Abstract:
A system, and computer program product for tolerating failures using concurrency in a cluster are provided in the illustrative embodiments. A failure is detected in a first computing node serving an application in a cluster. A subset of actions is selected from a set of actions, the set of actions configured to transfer the serving of the application from the first computing node to a second computing node in the cluster. A waiting period is set for the first computing node. The first computing node is allowed to continue serving the application during the waiting period. During the waiting period, concurrently with the first computing node serving the application, the subset of actions is performed at the second computing node. Responsive to receiving a signal of activity from the first computing node during the waiting period, the concurrent operation of the second computing node is aborted.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining image search results. One of the methods includes scheduling a first computation for execution on each of a first plurality of worker processes. The first computation changes a respective state of each of one or more of the first worker processes from a first state to a second state. A respective second computation is scheduled for execution on each of a second plurality of worker, where each respective second computation will use a different value for a particular variable for two or more of the second plurality of worker processes. The respective state of each of the second plurality of worker processes is updated from the second state to a third state, where the third state corresponds to execution of the second computation using a first value of the particular variable.
Abstract:
A computer switching method to be performed by a computer system including a plurality of computers, a storage system, and a management computer, the plurality of computers including: a plurality of first computers and a plurality of second computers, the storage system providing a logical storage device to each of the plurality of first computers, the logical storage device including a first logical storage device which is a storage area for storing data, the computer switching method including: a step of transmitting, by the management computer, a generation request for instructing the storage system to generate a second logical storage device; a step of generating, by the management computer, change information for mapping the first logical storage device to the second logical storage device for the second computer, and transmitting a change request including the generated change information to the storage system.
Abstract:
An efficient disaster recovery system is constructed at three data centers. A data center includes: a business server for executing an application in response to an input/output request; a storage system for providing a first storage area storing data in response to a request from the business server; and a management server for managing a second data center or a third data center among the plurality of data centers as a failover location when a system of a first data center having the first storage area stops; and wherein the management server: copies all pieces of data stored in the first storage area to a second storage area managed by a storage system of the second data center; and copies part of the data stored in the first storage area to a third storage area managed by a storage system of the third data center.
Abstract:
Cascading failover of blade servers in a data center implemented by transferring by a system management server a data processing workload from a failing blade server to an initial replacement blade server, with the data processing workload characterized by data processing resource requirements and the initial replacement blade server having data processing resources that do not match the data processing resource requirements; and transferring by the system management server the data processing workload from the initial replacement blade server to a subsequent replacement blade server, where the subsequent replacement blade server has data processing resources that better match the data processing resource requirements than do the data processing resources of the initial replacement blade server.
Abstract:
Techniques for managing disaster recovery sites are disclosed. In one particular embodiment, the techniques may be realized as a method for managing disaster recovery sites comprising generating a heartbeat at a first node, transmitting the heartbeat from the first node to a second node, determining whether a network connection between the first node and the second node has failed, determining whether the second node has received an additional heartbeat from the first node, and changing a state of the secondary node based on the determination of whether the second node has received the additional heartbeat.
Abstract:
An information processing device includes: a detector configured to, when a second processing function unit monitored over a second management network is recovered by using a first processing function unit that performs a function as an information processing device and that is monitored over a first management network, detect a conflict between first network information used by the second processing function unit in the second management network and second network information used by each processing function unit monitored over the first management network; and a recovery execution unit configured to resolve the conflict between the first network information and the second network information detected by the detector so as to recover the second processing function unit by using the first processing function unit.
Abstract:
A solutions manager supports computing solutions running on hosts in an adaptive computing environment by utilizing remote processes or agents placed on the hosts. A remote agent is associated with a computing solution and placed on the host on which the computing solution is running. When the computing solution is relocated to a new host, the remote agent associated with the computing solution is also automatically relocated and restarted on the new host.
Abstract:
In a method for provisioning a virtual machine, a processor rates a plurality of software images that include a first software image and a second software image. A processor provisions the virtual machine with the first software image in a first state and the second software image in a second state, wherein the second software image is rated higher than the first software image.
Abstract:
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.