摘要:
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
摘要翻译:具有100 petaOPS规模计算的多Petascale高效并行超级计算机,其成本,功耗和占地面积都在降低,并且允许从互连角度来看处理节点的最大封装密度。 超级计算机利用了VLSI的技术进步,实现了许多处理器可以集成到单个专用集成电路(ASIC)中的计算模型。 每个ASIC计算节点包括利用集成到一个管芯中的四个或更多个处理器的片上系统ASIC,每个处理器具有对所有系统资源的完全访问,并且使得处理器能够对诸如计算或消息传递I / O 并且优选地,根据应用内的各种算法阶段实现功能的自适应分割,或者如果I / O或其他处理器未被充分利用,则可以参与计算或通信节点通过五维环面网络互连 使用DMA来最大限度地最大化节点之间的分组通信的吞吐量并最小化等待时间。
摘要:
A method of managing a process relocation operation in a computing system is provided and includes determining respective operating temperatures of first, second and additional nodes of the system, where the first node has an elevated operating temperature and the second node has a normal operating temperature, notifying first and second kernels respectively associated with the first and second nodes, of a swapping condition, initially managing the first and second kernels to swap an application between the first and the second nodes while the swapping condition is in effect, and secondarily managing the first and second kernels to perform a barrier operation to end the swapping condition.
摘要:
Methods, apparatus, and products for distributing parallel algorithms of a parallel application among compute nodes of an operational group in a parallel computer are disclosed that include establishing a hardware profile, the hardware profile describing thermal characteristics of each compute node in the operational group; establishing a hardware independent application profile, the application profile describing thermal characteristics of each parallel algorithm of the parallel application; and mapping, in dependence upon the hardware profile and application profile, each parallel algorithm of the parallel application to a compute node in the operational group.
摘要:
Methods, apparatus, and products for distributing parallel algorithms of a parallel application among compute nodes of an operational group in a parallel computer are disclosed that include establishing a hardware profile, the hardware profile describing thermal characteristics of each compute node in the operational group; establishing a hardware independent application profile, the application profile describing thermal characteristics of each parallel algorithm of the parallel application; and mapping, in dependence upon the hardware profile and application profile, each parallel algorithm of the parallel application to a compute node in the operational group.
摘要:
Power throttling may be used to conserve power and reduce heat in a parallel computing environment. Compute nodes in the parallel computing environment may be organized into groups based on, for example, whether they execute tasks of the same job or receive power from the same converter. Once one of compute nodes in the group detects that a parameter (i.e., temperature, current, power consumption, etc.) has exceeded a first threshold, power throttling on all the nodes in the group may be activated. However, before deactivating power throttling, a plurality of parameters associated with the group of compute nodes may be monitored to ensure they are all below a second threshold. If so, the power throttling for all of the compute nodes is deactivated.
摘要:
Methods, systems, and products for dynamically reassigning a connected node to a block of compute nodes for re-launching a failed job that include: identifying that a job failed to execute on the block of compute nodes because connectivity failed between a compute node assigned as at least one of the connected nodes for the block of compute nodes and its supporting I/O node; and re-launching the job, including selecting an alternative connected node that is actively coupled for data communications with an active I/O node; and assigning the alternative connected node as the connected node for the block of compute nodes running the re-launched job.
摘要:
Power throttling may be used to conserve power and reduce heat in a parallel computing environment. Compute nodes in the parallel computing environment may be organized into groups based on, for example, whether they execute tasks of the same job or receive power from the same converter. Once one of compute nodes in the group detects that a parameter (i.e., temperature, current, power consumption, etc.) has exceeded a first threshold, power throttling on all the nodes in the group may be activated. However, before deactivating power throttling, a plurality of parameters associated with the group of compute nodes may be monitored to ensure they are all below a second threshold. If so, the power throttling for all of the compute nodes is deactivated.
摘要:
A method of managing a process relocation operation in a computing system is provided and includes determining respective operating temperatures of first, second and additional nodes of the system, where the first node has an elevated operating temperature and the second node has a normal operating temperature, notifying first and second kernels respectively associated with the first and second nodes, of a swapping condition, initially managing the first and second kernels to swap an application between the first and the second nodes while the swapping condition is in effect, and secondarily managing the first and second kernels to perform a barrier operation to end the swapping condition.
摘要:
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
摘要翻译:具有100 petaOPS规模计算的多Petascale高效并行超级计算机,其成本,功耗和占地面积都在降低,并且允许从互连角度来看处理节点的最大封装密度。 超级计算机利用了VLSI的技术进步,实现了许多处理器可以集成到单个专用集成电路(ASIC)中的计算模型。 每个ASIC计算节点包括利用集成到一个管芯中的四个或更多个处理器的片上系统ASIC,每个处理器具有对所有系统资源的完全访问,并且使得处理器能够对诸如计算或消息传递I / O 并且优选地,根据应用内的各种算法阶段实现功能的自适应分割,或者如果I / O或其他处理器未被充分利用,则可以参与计算或通信节点通过五维环面网络互连 使用DMA来最大限度地最大化节点之间的分组通信的吞吐量并最小化等待时间。
摘要:
Methods, systems, and products for dynamically reassigning a connected node to a block of compute nodes for re-launching a failed job that include: identifying that a job failed to execute on the block of compute nodes because connectivity failed between a compute node assigned as at least one of the connected nodes for the block of compute nodes and its supporting I/O node; and re-launching the job, including selecting an alternative connected node that is actively coupled for data communications with an active I/O node; and assigning the alternative connected node as the connected node for the block of compute nodes running the re-launched job.