摘要:
System-directed checkpointing is enabled in otherwise standard computers through relatively straightforward augmentations to the computer's memory controller hub. Firmware routines executed by a control and dispatch unit that is normally part of any memory controller hub enable it to implement any of six different checkpointing strategies: post-image checkpointing in which an image of the system state at the time of the last checkpoint is maintained in a local shadow memory; post-image checkpointing in which an image of the system state at the time of the last checkpoint is maintained in a shadow memory located in a second, backup computer; post-image checkpointing using a bit-map memory, having one bit representing each data block in system memory, to reduce the amount of memory-to-memory copying required to establish a checkpoint; post-image checkpointing to a local shadow memory using two bit map memories to enable normal processing to continue while the shadow is being updated, post-image checkpointing to a local shadow memory using a block-state memory that eliminates the need for any memory-to-memory copying; and local pre-image checkpointing that does not require a shadow memory. Since each of these implementations has advantages and disadvantages relative to the others and since similar mechanisms are used in the memory controller hub for all of these options, it can be designed to support all of them with hardwired or settable status bits defining which is to be supported in a given situation.
摘要:
While system-directed checkpointing can be implemented in various ways, for example by adding checkpointing support in the memory controller or in the operating system in otherwise standard computers, implementation at the hypervisor level enables the necessary state information to be captured efficiently while providing a number of ancillary advantages over those prior-art methods. This disclosure details procedures for realizing those advantages through relatively minor modifications to normal hypervisor operations. Specifically, by capturing state information in a guest-operating-system-specific manner, any guest operating system can be rolled back independently and resumed without losing either program or input/output (I/O) continuity and without affecting the operation of the other operating systems or their associated applications supported by the same hypervisor. Similarly, by managing I/O queues as described herein, rollback can be accomplished without requiring I/O operations to be repeated and I/O device failures can be circumvented without losing any I/O data in the process.
摘要:
While system-directed checkpointing can be implemented in various ways, for example by adding checkpointing support in the memory controller or in the operating system in otherwise standard computers, implementation at the hypervisor level enables the necessary state information to be captured efficiently while providing a number of ancillary advantages over those prior-art methods. This disclosure details procedures for realizing those advantages through relatively minor modifications to normal hypervisor operations. Specifically, by capturing state information in a guest-operating-system-specific manner, any guest operating system can be rolled back independently and resumed without losing either program or input/output (I/O) continuity and without affecting the operation of the other operating systems or their associated applications supported by the same hypervisor. Similarly, by managing I/O queues as described herein, rollback can be accomplished without requiring I/O operations to be repeated and I/O device failures can be circumvented without losing any I/O data in the process.