Method and apparatus for creating and using persistent images of distributed shared memory segments and in-memory checkpoints
First Claim
Patent Images
1. A method comprising:
- updating a first checkpoint image, responsive to a first change made by a first instance of an application executing on a plurality of nodes, wherein the first instance of the application is executing on a first node of the plurality of nodes, wherein the first checkpoint image is maintained at the first node, and wherein the first checkpoint image comprises a first state of the first instance of the application;
updating, responsive to a second change committed by any of the first instance of the application and at least one other instance of the application executing on at least one other node of the plurality of nodes, a second checkpoint image maintained at a distributed shared memory shared by the plurality of nodes wherein the distributed shared memory propagates the second checkpoint image to the plurality of nodes, and wherein the second checkpoint image comprises a second state of any of the first and the at least one other instances of the application;
creating, from at least a portion of the distributed shared memory and in response to the second change, a checkpoint-image snapshot comprising data indicative of the second state of any of the first and the at least one other instances of the application; and
storing the checkpoint-image snapshot on at least one node of the plurality of nodes.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus that enable quick recovery from failure or restoration of an application state of one or more nodes, applications, and/or communication links in a distributed computing environment, such as a cluster. Recovery or restoration is facilitated by regularly saving persistent images of the in-memory checkpoint data and/or of distributed shared memory segments using snapshots of the committed checkpoint data. When one or more nodes fail, the snapshots can be read and used to restart the application in the most recently-saved state prior to the failure or rollback the application to an earlier state.
59 Citations
20 Claims
-
1. A method comprising:
-
updating a first checkpoint image, responsive to a first change made by a first instance of an application executing on a plurality of nodes, wherein the first instance of the application is executing on a first node of the plurality of nodes, wherein the first checkpoint image is maintained at the first node, and wherein the first checkpoint image comprises a first state of the first instance of the application; updating, responsive to a second change committed by any of the first instance of the application and at least one other instance of the application executing on at least one other node of the plurality of nodes, a second checkpoint image maintained at a distributed shared memory shared by the plurality of nodes wherein the distributed shared memory propagates the second checkpoint image to the plurality of nodes, and wherein the second checkpoint image comprises a second state of any of the first and the at least one other instances of the application; creating, from at least a portion of the distributed shared memory and in response to the second change, a checkpoint-image snapshot comprising data indicative of the second state of any of the first and the at least one other instances of the application; and storing the checkpoint-image snapshot on at least one node of the plurality of nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus comprising:
-
means for updating a first checkpoint image, responsive to a first change made by a first instance of an application executing on a plurality of nodes, wherein the first instance of the application is executing on a first node of the plurality of nodes, wherein the first checkpoint image is maintained at the first node, and wherein the first checkpoint image comprises a first state of the first instance of the application; means for updating, responsive to a second change committed by any of the first instance of the application and at least one other instance of the application executing on at least one other node of the plurality of nodes, a second checkpoint image maintained at a distributed shared memory shared by the plurality of nodes wherein the distributed shared memory propagates the second checkpoint image to the plurality of nodes, and wherein the second checkpoint image comprises a second state of any of the first and the at least one other instances of the application; means for creating, from at least a portion of the distributed shared memory and in response to the second change, a checkpoint-image snapshot comprising data indicative of the second state of any of the first and the at least one other instances of the application; and storage for storing the checkpoint-image snapshot. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification