摘要:
Provided are a method, system, and an article of manufacture for preventing data loss. Modified data is stored in a volatile storage. The stored modified data is copied onto a non-volatile storage. A determination is made as to whether the non-volatile storage should be checked for errors. In certain implementations, on determining that the nonvolatile storage should be checked for errors the non-volatile storage is checked for errors. If on checking the non-volatile storage is found to have an error, an indication of the error is provided.
摘要:
Provided are a method, system, and article of manufacture for synchronizing device error information among nodes. A first node performs an action with respect to a first node error counter for a device in communication with the first node and a second node. The first node transmits a message to the second node indicating the device and the action performed with respect to the first node error counter for the device. The second node performs the action indicated in the message with respect to a second node error counter for the device indicated in the message, wherein the second node error counter corresponds to the first node error counter for the device.
摘要:
A method is disclosed to adjust error thresholds in a data storage and retrieval system. The method supplies a data storage and retrieval system comprising memory and microcode, wherein that microcode comprises one or more default error thresholds. The method determines if the memory comprises one or more operational error thresholds. If the method determines that the memory comprises one or more operational error thresholds, then the method operates the data storage and retrieval system using those one or more operational error thresholds. Alternatively, if the method determines that the memory does not comprise one or more operational error thresholds, then the method sets the one or more default error thresholds as the one or more operational error thresholds.
摘要:
Provided are a method, system, and article of manufacture for synchronizing device error information among nodes. A first node performs an action with respect to a first node error counter for a device in communication with the first node and a second node. The first node transmits a message to the second node indicating the device and the action performed with respect to the first node error counter for the device. The second node performs the action indicated in the message with respect to a second node error counter for the device indicated in the message, wherein the second node error counter corresponds to the first node error counter for the device.
摘要:
An apparatus, system, and method are disclosed for autonomously overriding a global resource lock. The apparatus includes a determination module, an override module, and an assertion module. The determination module determines whether a global resource lock is owned by a peer resource controller and that the peer resource controller is offline in response to the peer resource controller owning the global resource lock. The atomic module atomically overrides ownership of the global resource lock from the peer resource controller. The assertion module asserts active ownership of the global resource lock. The apparatus, system, and method provide an autonomous override of the global resource lock, minimizing system downtime and user intervention.
摘要:
Provided are a method, system and program for processing complexes to access shared devices. A lock to a plurality of shared devices is maintained and accessible to a first and second processing systems. The first processing complex determines a first delay time and the second processing complex determines a second delay time. The first processing complex issues a request for the lock in response to expiration of the first delay time and the second processing complex issues a request for the lock in response to expiration of the second delay time.
摘要:
An apparatus, system, and method are disclosed for facilitating monitoring and responding to error events. An apparatus may includes a set of counters associated with a processing system resource, each counter associated with an error event and having attributes defining a count value, counter thresholds directly related to time, and empirical status information for the error event related to time. A user may adjust counter thresholds indirectly to set an error tolerance. An update module may update counters within the set based on an error event for the processing system resource. The management module persists and maintains a life cycle for counters based on counter attributes. Each counter may be of two types either a fixed counter that counts error events from a start time for a defined duration or a sliding counter that counts error events up to a predefined number of error events within a window of time.
摘要:
An apparatus, method, and system associates an identifier with a data packet. The identifier uniquely identifies a communication module, such as a host interface card, within a data storage system. In operation, a computer host sends a data packet to a server. The communication module receives the data packet and associates an identifier, unique to the communication module, with the data packet. The data packet is stored in a disk array, such as a Redundant Array of Independent Disks (RAID) system. When the computer host later requests the stored data packet, a validation module, which may be implemented within a PCI adapter such as a host interface card, retrieves the data packet and determines whether the data packet is corrupt. If the data packet is corrupt, the validation module identifies which host interface card corrupted the data with the use of the unique identifier associated with the data packet. The faulty communication module may then be removed from operation in the data storage system.
摘要:
An apparatus, system, and method are disclosed for data tracking and, in particular, for facilitating failure management within an electronic data communication system. The apparatus includes a tracking module and an error analysis module. The tracking module stores an adapter identifier in a tracking array. The adapter identifier corresponds to a source adapter from which data is received. The error analysis module determines a source of a data failure in response to recognition of the data failure. The data failure may occur on a host adapter, a device adapter, a communication fabric, a multi-processor, or another communication device. The apparatus, system, and method may be implemented in place of or in addition to hardware-assisted data integrity checking within a data storage system.
摘要:
Provided are a method, system, and article of manufacture for determining modified data in cache for use during a recovery operation. An event is detected during which processing of writes to a storage device is suspended. A cache including modified data not destaged to the storage device is scanned to determine the data units having modified data in response to detecting the event. The data units having the modified data is indicated in a backup storage. The indication of the data units having the modified data in the backup storage is used during a recovery operation.