Messaging In A Parallel Computer Using Remote Direct Memory Access ('RDMA')
    2.
    发明申请
    Messaging In A Parallel Computer Using Remote Direct Memory Access ('RDMA') 失效
    使用远程直接内存访问(“RDMA”)的并行计算机中的消息传递

    公开(公告)号:US20120331065A1

    公开(公告)日:2012-12-27

    申请号:US13167911

    申请日:2011-06-24

    IPC分类号: G06F15/16

    CPC分类号: G06F15/167 G06F15/17331

    摘要: Messaging in a parallel computer using remote direct memory access (‘RDMA’), including: receiving a send work request; responsive to the send work request: translating a local virtual address on the first node from which data is to be transferred to a physical address on the first node from which data is to be transferred from; creating a local RDMA object that includes a counter set to the size of a messaging acknowledgment field; sending, from a messaging unit in the first node to a messaging unit in a second node, a message that includes a RDMA read operation request, the physical address of the local RDMA object, and the physical address on the first node from which data is to be transferred from; and receiving, by the first node responsive to the second node's execution of the RDMA read operation request, acknowledgment data in the local RDMA object.

    摘要翻译: 使用远程直接内存访问(RDMA)在并行计算机中进行消息传递,包括:接收发送工作请求; 响应于所述发送工作请求:将要从其传送数据的第一节点上的本地虚拟地址转换为要从其传送数据的第一节点上的物理地址; 创建本地RDMA对象,其包括设置为消息收发确认字段的大小的计数器; 从第一节点中的消息单元向第二节点中的消息单元发送包括RDMA读操作请求,本地RDMA对象的物理地址以及第一节点上的物理地址的消息,数据为 被转移 以及响应于所述第二节点执行所述RDMA读取操作请求的所述第一节点接收所述本地RDMA对象中的确认数据。

    Executing Multiple Instructions Multiple Data (‘MIMD’) programs on a Single Instruction Multiple Data (‘SIMD’) machine
    3.
    发明授权
    Executing Multiple Instructions Multiple Data (‘MIMD’) programs on a Single Instruction Multiple Data (‘SIMD’) machine 失效
    在单指令多数据(“SIMD”)机器上执行多指令多数据('MIMD')程序

    公开(公告)号:US07831802B2

    公开(公告)日:2010-11-09

    申请号:US11780072

    申请日:2007-07-19

    IPC分类号: G06F15/76

    CPC分类号: G06F15/161

    摘要: Executing Multiple Instructions Multiple Data (‘MIMD’) programs on a Single Instruction Multiple Data (‘SIMD’) machine, the SIMD machine including a plurality of compute nodes, each compute node capable of executing only a single thread of execution, the compute nodes initially configured exclusively for SIMD operations, the SIMD machine further comprising a data communications network, the network comprising synchronous data communications links among the compute nodes, including establishing a SIMD partition comprising a plurality of the compute nodes; booting the SIMD partition in MIMD mode; executing by launcher programs a plurality of MIMD programs on compute nodes in the SIMD partition; and re-executing a launcher program by an operating system on a compute node in the SIMD partition upon termination of the MIMD program executed by the launcher program.

    摘要翻译: 在单指令多数据(“SIMD”)机器上执行多指令多数据(“MIMD”)程序,SIMD机器包括多个计算节点,每个计算节点只能执行单个执行线程,计算节点 最初被配置为专用于SIMD操作,所述SIMD机器还包括数据通信网络,所述网络包括所述计算节点之间的同步数据通信链路,包括建立包括多个所述计算节点的SIMD分区; 以MIMD模式引导SIMD分区; 通过启动程序执行SIMD分区中的计算节点上的多个MIMD程序; 以及在由所述启动程序执行的所述MIMD程序终止时,由所述SIMD分区中的计算节点上的操作系统重新执行启动程序。

    Configuring compute nodes in a parallel computer using remote direct memory access (‘RDMA’)

    公开(公告)号:US10474625B2

    公开(公告)日:2019-11-12

    申请号:US13351419

    申请日:2012-01-17

    摘要: Configuring compute nodes in a parallel computer using remote direct memory access (‘RDMA’), the parallel computer comprising a plurality of compute nodes coupled for data communications via one or more data communications networks, including: initiating, by a source compute node of the parallel computer, an RDMA broadcast operation to broadcast binary configuration information to one or more target compute nodes in the parallel computer; preparing, by each target compute node, the target compute node for receipt of the binary configuration information from the source compute node; transmitting, by each target compute node, a ready message to the target compute node, the ready message indicating that the target compute node is ready to receive the binary configuration information from the source compute node; and performing, by the source compute node, an RDMA broadcast operation to write the binary configuration information into memory of each target compute node.

    Aggregating Job Exit Statuses Of A Plurality Of Compute Nodes Executing A Parallel Application
    5.
    发明申请
    Aggregating Job Exit Statuses Of A Plurality Of Compute Nodes Executing A Parallel Application 有权
    多个计算节点执行并行应用程序的聚合作业退出状态

    公开(公告)号:US20130339805A1

    公开(公告)日:2013-12-19

    申请号:US13524602

    申请日:2012-06-15

    IPC分类号: G06F9/46 G06F11/07

    摘要: Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.

    摘要翻译: 聚合执行并行应用的多个计算节点的作业退出状态,包括:识别并行计算机中的计算节点的子集以执行并行应用; 在并行计算机中的计算节点的子集中选择一个计算节点作为工作领导计算节点; 启动计算节点子集上的并行应用程序的执行; 从所述计算节点的子集中的每个计算节点接收退出状态,其中每个计算节点的退出状态包括描述由所述计算节点执行所述并行应用的某些部分的信息; 从计算节点的子集中的每个计算节点聚合每个退出状态; 并且为并行计算机中的计算节点的子集发送聚合退出状态。

    Remote Direct Memory Access ('RDMA') In A Parallel Computer
    6.
    发明申请
    Remote Direct Memory Access ('RDMA') In A Parallel Computer 审中-公开
    并行计算机中的远程直接存储器访问('RDMA')

    公开(公告)号:US20120331243A1

    公开(公告)日:2012-12-27

    申请号:US13167950

    申请日:2011-06-24

    IPC分类号: G06F12/00

    摘要: Remote direct memory access (‘RDMA’) in a parallel computer, the parallel computer including a plurality of nodes, each node including a messaging unit, including: receiving an RDMA read operation request that includes a virtual address representing a memory region at which to receive data to be transferred from a second node to the first node; responsive to the RDMA read operation request: translating the virtual address to a physical address; creating a local RDMA object that includes a counter set to the size of the memory region; sending a message that includes an DMA write operation request, the physical address of the memory region on the first node, the physical address of the local RDMA object on the first node, and a remote virtual address on the second node; and receiving the data to be transferred from the second node.

    摘要翻译: 并行计算机中的远程直接存储器访问(RDMA),所述并行计算机包括多个节点,每个节点包括消息传送单元,包括:接收RDMA读取操作请求,其包括虚拟地址,所述虚拟地址表示用于接收数据的存储器区域 从第二节点传送到第一节点; 响应于RDMA读取操作请求:将虚拟地址转换为物理地址; 创建本地RDMA对象,其包括设置为存储器区域的大小的计数器; 发送包括DMA写入操作请求的消息,第一节点上的存储器区域的物理地址,第一节点上的本地RDMA对象的物理地址以及第二节点上的远程虚拟地址; 并从第二节点接收要传送的数据。

    Establishing A Data Communications Connection Between A Lightweight Kernel In A Compute Node Of A Parallel Computer And An Input-Output ('I/O') Node Of The Parallel Computer
    7.
    发明申请
    Establishing A Data Communications Connection Between A Lightweight Kernel In A Compute Node Of A Parallel Computer And An Input-Output ('I/O') Node Of The Parallel Computer 审中-公开
    在并行计算机的计算节点和并行计算机的输入输出('I / O')节点之间建立轻量级内核之间的数据通信连接

    公开(公告)号:US20120331153A1

    公开(公告)日:2012-12-27

    申请号:US13166536

    申请日:2011-06-22

    IPC分类号: G06F15/16

    CPC分类号: G06F15/80 G06F15/17356

    摘要: Establishing a data communications connection between a lightweight kernel in a compute node of a parallel computer and an input-output (‘I/O’) node of the parallel computer, including: configuring the compute node with the network address and port value for data communications with the I/O node; establishing a queue pair on the compute node, the queue pair identified by a queue pair number (‘QPN’); receiving, in the I/O node on the parallel computer from the lightweight kernel, a connection request message; establishing by the I/O node on the I/O node a queue pair identified by a QPN for communications with the compute node; and establishing by the I/O node the requested connection by sending to the lightweight kernel a connection reply message.

    摘要翻译: 在并行计算机的计算节点中的轻量级内核与并行计算机的输入输出(I / O)节点之间建立数据通信连接,其中包括:使用网络地址和端口值配置计算节点以进行数据通信 I / O节点; 在计算节点上建立队列对,由队列对(QPN)标识的队列对; 在轻巧内核的并行计算机上的I / O节点中接收连接请求消息; 由所述I / O节点上的所述I / O节点建立由QPN标识的用于与所述计算节点进行通信的队列对; 以及通过向轻量级内核发送连接回复消息,由I / O节点建立所请求的连接。

    Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
    8.
    发明授权
    Aggregating job exit statuses of a plurality of compute nodes executing a parallel application 有权
    聚合执行并行应用的多个计算节点的作业退出状态

    公开(公告)号:US09086962B2

    公开(公告)日:2015-07-21

    申请号:US13524602

    申请日:2012-06-15

    IPC分类号: G06F11/07 G06F9/52 G06F11/30

    摘要: Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.

    摘要翻译: 聚合执行并行应用的多个计算节点的作业退出状态,包括:识别并行计算机中的计算节点的子集以执行并行应用; 在并行计算机中的计算节点的子集中选择一个计算节点作为工作领导计算节点; 启动计算节点子集上的并行应用程序的执行; 从所述计算节点的子集中的每个计算节点接收退出状态,其中每个计算节点的退出状态包括描述由所述计算节点执行所述并行应用的一部分的信息; 从计算节点的子集中的每个计算节点聚合每个退出状态; 并且为并行计算机中的计算节点的子集发送聚合退出状态。

    Configuring Compute Nodes In A Parallel Computer Using Remote Direct Memory Access ('RDMA')
    9.
    发明申请
    Configuring Compute Nodes In A Parallel Computer Using Remote Direct Memory Access ('RDMA') 审中-公开
    使用远程直接内存访问(“RDMA”)配置并行计算机节点

    公开(公告)号:US20130185381A1

    公开(公告)日:2013-07-18

    申请号:US13351419

    申请日:2012-01-17

    IPC分类号: G06F15/16

    摘要: Configuring compute nodes in a parallel computer using remote direct memory access (‘RDMA’), the parallel computer comprising a plurality of compute nodes coupled for data communications via one or more data communications networks, including: initiating, by a source compute node of the parallel computer, an RDMA broadcast operation to broadcast binary configuration information to one or more target compute nodes in the parallel computer; preparing, by each target compute node, the target compute node for receipt of the binary configuration information from the source compute node; transmitting, by each target compute node, a ready message to the target compute node, the ready message indicating that the target compute node is ready to receive the binary configuration information from the source compute node; and performing, by the source compute node, an RDMA broadcast operation to write the binary configuration information into memory of each target compute node.

    摘要翻译: 使用远程直接存储器访问(“RDMA”)来配置并行计算机中的计算节点,所述并行计算机包括经由一个或多个数据通信网络耦合用于数据通信的多个计算节点,包括:由源计算节点 并行计算机,RDMA广播操作以将二进制配置信息广播到并行计算机中的一个或多个目标计算节点; 由每个目标计算节点准备用于从源计算节点接收二进制配置信息的目标计算节点; 由所述目标计算节点向所述目标计算节点发送就绪消息,所述就绪消息指示所述目标计算节点准备好从所述源计算节点接收所述二进制配置信息; 并且由源计算节点执行RDMA广播操作以将二进制配置信息写入每个目标计算节点的存储器中。

    Error Recovery During Execution Of An Application On A Parallel Computer
    10.
    发明申请
    Error Recovery During Execution Of An Application On A Parallel Computer 审中-公开
    在并行计算机上执行应用程序时出现错误恢复

    公开(公告)号:US20100017655A1

    公开(公告)日:2010-01-21

    申请号:US12174312

    申请日:2008-07-16

    IPC分类号: G06F11/08

    CPC分类号: G06F11/1482

    摘要: Methods, apparatus, and products are disclosed for error recovery during execution of an application on a parallel computer that includes a plurality of compute nodes. Such error recovery includes: storing, by the application during execution on the nodes, application restore data in a restore buffer at predetermined points during execution of the application, the restore data specifying an execution state of the application at one or more points during application execution; encountering, by at least one of the nodes executing the application, a recoverable error during application execution; determining, by the application, the nodes affected by the recoverable error; restarting, by each of the affected nodes, execution of the application; retrieving, by the restarted application executing on each of the affected nodes, the restore data from the restore buffer; and continuing, by each affected node, execution of the application with the execution state specified by the retrieved restore data.

    摘要翻译: 公开了在包括多个计算节点的并行计算机上的应用执行期间进行错误恢复的方法,装置和产品。 这种错误恢复包括:在执行期间由应用程序在应用程序执行期间将应用程序存储在恢复缓冲区中的应用程序执行期间的预定点处,恢复数据在应用程序执行期间的一个或多个点指定应用程序的执行状态 ; 由执行应用程序的至少一个节点遇到应用程序执行期间的可恢复错误; 由应用程序确定受可恢复错误影响的节点; 由每个受影响的节点重新启动应用程序的执行; 通过在每个受影响的节点上执行的重新启动的应用程序从还原缓冲器检索恢复数据; 并且由每个受影响的节点继续执行具有由所检索的恢复数据指定的执行状态的应用。