Error Recovery During Execution Of An Application On A Parallel Computer
    1.
    发明申请
    Error Recovery During Execution Of An Application On A Parallel Computer 审中-公开
    在并行计算机上执行应用程序时出现错误恢复

    公开(公告)号:US20100017655A1

    公开(公告)日:2010-01-21

    申请号:US12174312

    申请日:2008-07-16

    IPC分类号: G06F11/08

    CPC分类号: G06F11/1482

    摘要: Methods, apparatus, and products are disclosed for error recovery during execution of an application on a parallel computer that includes a plurality of compute nodes. Such error recovery includes: storing, by the application during execution on the nodes, application restore data in a restore buffer at predetermined points during execution of the application, the restore data specifying an execution state of the application at one or more points during application execution; encountering, by at least one of the nodes executing the application, a recoverable error during application execution; determining, by the application, the nodes affected by the recoverable error; restarting, by each of the affected nodes, execution of the application; retrieving, by the restarted application executing on each of the affected nodes, the restore data from the restore buffer; and continuing, by each affected node, execution of the application with the execution state specified by the retrieved restore data.

    摘要翻译: 公开了在包括多个计算节点的并行计算机上的应用执行期间进行错误恢复的方法,装置和产品。 这种错误恢复包括:在执行期间由应用程序在应用程序执行期间将应用程序存储在恢复缓冲区中的应用程序执行期间的预定点处,恢复数据在应用程序执行期间的一个或多个点指定应用程序的执行状态 ; 由执行应用程序的至少一个节点遇到应用程序执行期间的可恢复错误; 由应用程序确定受可恢复错误影响的节点; 由每个受影响的节点重新启动应用程序的执行; 通过在每个受影响的节点上执行的重新启动的应用程序从还原缓冲器检索恢复数据; 并且由每个受影响的节点继续执行具有由所检索的恢复数据指定的执行状态的应用。