FAULT TOLERANCE IN DISTRIBUTED SYSTEMS
First Claim
1. A method of managing execution of operation requests to facilitate fault tolerance in a distributed system having a plurality of components, said method comprising:
- receiving at one component of the distributed system an operation request to be processed, the one component executing on a processor;
processing, by the one component, the operation request, the processing including initiating one or more sub-operation requests to be performed by at least one other component of the distributed system;
storing at least an indication of the one or more sub-operation requests in an asynchronous work queue to be asynchronously processed by the at least one other component, the asynchronous work queue including one or more sub-operation requests for which processing is incomplete;
storing state related to the operation request in a persistent data store, said state including at least an indication of the one or more sub-operation requests on the asynchronous work queue; and
responsive to storing the state in the persistent data store and completing the operation request, asynchronously initiating execution of a sub-operation request of the one or more sub-operation requests on the asynchronous work queue.
1 Assignment
0 Petitions
Accused Products
Abstract
Fault tolerance is provided in a distributed system. The complexity of replicas and rollback requests are avoided; instead, a local failure in a component of a distributed system is tolerated. The local failure is tolerated by storing state related to a requested operation on the component, persisting that stored state in a data store, such as a relational database, asynchronously processing the operation request, and if a failure occurs, restarting the component using the stored state from the data store.
62 Citations
20 Claims
-
1. A method of managing execution of operation requests to facilitate fault tolerance in a distributed system having a plurality of components, said method comprising:
-
receiving at one component of the distributed system an operation request to be processed, the one component executing on a processor; processing, by the one component, the operation request, the processing including initiating one or more sub-operation requests to be performed by at least one other component of the distributed system; storing at least an indication of the one or more sub-operation requests in an asynchronous work queue to be asynchronously processed by the at least one other component, the asynchronous work queue including one or more sub-operation requests for which processing is incomplete; storing state related to the operation request in a persistent data store, said state including at least an indication of the one or more sub-operation requests on the asynchronous work queue; and responsive to storing the state in the persistent data store and completing the operation request, asynchronously initiating execution of a sub-operation request of the one or more sub-operation requests on the asynchronous work queue. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer system for managing execution of operation requests to facilitate fault tolerance in a distributed system having a plurality of components, said computer system comprising:
-
a memory; and a processor in communications with the memory, wherein the computer system is configured to perform a method, the method comprising; receiving at one component of the distributed system an operation request to be processed; processing, by the one component, the operation request, the processing including initiating one or more sub-operation requests to be performed by at least one other component of the distributed system; storing at least an indication of the one or more sub-operation requests in an asynchronous work queue to be asynchronously processed by the at least one other component, the asynchronous work queue including one or more sub-operation requests for which processing is incomplete; storing state related to the operation request in a persistent data store, said state including at least an indication of the one or more sub-operation requests on the asynchronous work queue; and responsive to storing the state in the persistent data store and completing the operation request, asynchronously initiating execution of a sub-operation request of the one or more sub-operation requests on the asynchronous work queue. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer program product for managing execution of operation requests to facilitate fault tolerance in a distributed system having a plurality of components, the computer program product comprising:
a non-transitory computer readable storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising; receiving at one component of the distributed system an operation request to be processed; processing, by the one component, the operation request, the processing including initiating one or more sub-operation requests to be performed by at least one other component of the distributed system; storing at least an indication of the one or more sub-operation requests in an asynchronous work queue to be asynchronously processed by the at least one other component, the asynchronous work queue including one or more sub-operation requests for which processing is incomplete; storing state related to the operation request in a persistent data store, said state including at least an indication of the one or more sub-operation requests on the asynchronous work queue; and responsive to storing the state in the persistent data store and completing the operation request, asynchronously initiating execution of a sub-operation request of the one or more sub-operation requests on the asynchronous work queue. - View Dependent Claims (17, 18, 19, 20)
Specification