Containment and recovery of software exceptions in interacting, replicated-state-machine-based fault-tolerant components
First Claim
1. A method of error recovery in a replicated state machine, wherein, at a defined time in an operation of the machine, a batch of inputs are input to the machine, and the machine uses a multitude of components for processing said inputs, and wherein during said processing, one of said components generates an exception, the method comprising the steps of:
- after the exception, rolling the state machine back to a defined point in the operation of the machine;
preemptively failing said one of the components;
re-executing the batch of inputs in the state machine;
handling any failure, during said re-executing step, of said one of the components using a defined error handling procedure, including using a second one of said components to handle said any failure in order to contain said exception within said one of the components; and
repeating the rolling, preemptively failing, re-executing and handling steps until the input batch runs to completion without generating any exception in any of the components that are not pre-emptively failed.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, system and article of manufacture are disclosed for error recovery in a replicated state machine. A batch of inputs is input to the machine, and the machine uses a multitude of components for processing those inputs. Also, during this processing, one of said components generates an exception. The method comprises the steps of after the exception, rolling the state machine back to a defined point in the operation of the machine; preemptively failing said one of the components; re-executing the input batch in the state machine; and handling any failure, during the re-executing step, of the one of the components using a defined error handling procedure. The rolling, preemptively failing, re-executing and handling steps are repeated until the input batch runs to completion without generating any exception in any of the components that are not preemptively failed.
-
Citations
20 Claims
-
1. A method of error recovery in a replicated state machine, wherein, at a defined time in an operation of the machine, a batch of inputs are input to the machine, and the machine uses a multitude of components for processing said inputs, and wherein during said processing, one of said components generates an exception, the method comprising the steps of:
-
after the exception, rolling the state machine back to a defined point in the operation of the machine; preemptively failing said one of the components; re-executing the batch of inputs in the state machine; handling any failure, during said re-executing step, of said one of the components using a defined error handling procedure, including using a second one of said components to handle said any failure in order to contain said exception within said one of the components; and repeating the rolling, preemptively failing, re-executing and handling steps until the input batch runs to completion without generating any exception in any of the components that are not pre-emptively failed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of error recovery in a replicated state machine, wherein, at a defined time in an operation of the machine, a batch of inputs are input to the machine, and the machine uses a multitude of components for processing said inputs, and wherein during said processing, one of said components generates an exception, the method comprising the steps of:
-
after the exception, rolling the state machine back to a defined point in the operation of the machine; preemptively failing said one of the components'"'"' re-executing the batch of inputs in the state machine; handling any failure, during said re-executing step, of said one of the components using a defined error handling procedure; and repeating the rolling, preemptively failing, re-executing and handling steps until the input batch runs to completion without generating any exception in any of the components that are not preemptively failed; wherein the preemptively failing step includes the steps of; assigning a respective one component ID to each of the components; and maintaining a fail-set of components that are to be preemptively failed during input-batch processing; and wherein each of the components receives a new ID when said each component is created, and as a result of a reset by a supervisor component after an exception.
-
-
12. An error recovery system in a replicated state machine, wherein, at a defined time in an operation of the machine, a batch of inputs are input to the machine, and the machine uses a multitude of components for processing said inputs, and wherein during said processing, one of said components generates an exception, the error recovery system comprising:
-
a computer system including one or more processor units configured for; after the exception, rolling the state machine back to a defined point in the operation of the machine; preemptively failing said one of the components; re-executing the batch of inputs in the state machine; handling any failure, during said re-executing step, of said one of the components using a defined error handling procedure, including using a second one of said components to handle said any failure in order to contain said exception within said one of the components; and repeating the rolling, preemptively failing, re-executing and handling steps until the input batch runs to completion without generating any exception in any of the components that are not preemptively failed. - View Dependent Claims (13, 14, 15, 16)
-
-
17. An article of manufacture comprising:
-
at least one computer usable tangible medium having computer readable program code logic tangibly embodied therein to execute a machine instruction in a processing unit for error recovery in a replicated stat machine, wherein, at a defined time in an operation of the machine, a batch of inputs are input to the machine, and the machine uses a multitude of components for processing said inputs, and wherein during said processing, one of said components generates an exception, said computer readable program code logic, when executing, performing the following steps; after the exception, rolling the state machine back to a defined point in the operation of the machine; preemptively failing said one of the components; re-executing the batch of inputs in the state machine; handling any failure, during said re-executing step, of said one of the components using a defined error handling procedure, including using a second one of said components to handle said any failure in order to contain said exception within said one of the components; and repeating the rolling, preemptively failing, re-executing and handling steps until the input batch runs to completion without generating any exception in any of the components that are not preemptively failed. - View Dependent Claims (18, 19, 20)
-
Specification