Method and system for automatic failover of distributed query processing using distributed shared memory
First Claim
1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:
- determining a category for each node in the subgroup of nodes, wherein each node is categorized as a worker node or an idle node;
saving a set of data structures corresponding to each step involved in execution of a task assigned to each worker node at pre-determined time intervals in a shared memory distributed across the sub-group of nodes, wherein the saved set of data structures is assigned a first value;
monitoring each worker node by the one or more idle nodes in each sub-group by polling the shared memory for detecting changes to the saved set of data structures corresponding to each worker node at pre-determined time intervals, wherein a second value is assigned to the saved set of data structures corresponding to each worker node if a change is detected;
raising a failure notification by the one or more idle nodes if the second value is not detected after a pre-determined period of time for at least one of the worker nodes, wherein the detection of no change from the first value to the second value indicates interruption in the execution of at least one of the steps involved in the execution of the task, thereby the saved set of data structures acts as a checkpoint for determining the saved set of data structures that are to be reloaded to the one or more idle nodes;
reloading the saved set of data structures corresponding to each step involved in execution of the task of the at least one worker node to at least one of the one or more idle nodes after the detection of no change from the first value to the second value; and
resuming, by the idle node, execution from the interrupted step involved in the execution of the task.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for implementing automatic recovery from failure of resources in a grid-based distributed database is provided. The method includes determining the category of each node in the subgroup of nodes, where the determination identifies each node as at least one of a worker node and an idle node. The method further includes saving state of each worker node engaged in execution of a task in a shared memory at pre-determined time intervals. Each worker node is monitored by one or more idle nodes in each sub-group. Upon detection of no change in state of worker node for a pre-determined period of time, a failure notification is raised by one or more idle nodes that have detected failure of the worker node.
70 Citations
11 Claims
-
1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:
-
determining a category for each node in the subgroup of nodes, wherein each node is categorized as a worker node or an idle node; saving a set of data structures corresponding to each step involved in execution of a task assigned to each worker node at pre-determined time intervals in a shared memory distributed across the sub-group of nodes, wherein the saved set of data structures is assigned a first value; monitoring each worker node by the one or more idle nodes in each sub-group by polling the shared memory for detecting changes to the saved set of data structures corresponding to each worker node at pre-determined time intervals, wherein a second value is assigned to the saved set of data structures corresponding to each worker node if a change is detected; raising a failure notification by the one or more idle nodes if the second value is not detected after a pre-determined period of time for at least one of the worker nodes, wherein the detection of no change from the first value to the second value indicates interruption in the execution of at least one of the steps involved in the execution of the task, thereby the saved set of data structures acts as a checkpoint for determining the saved set of data structures that are to be reloaded to the one or more idle nodes; reloading the saved set of data structures corresponding to each step involved in execution of the task of the at least one worker node to at least one of the one or more idle nodes after the detection of no change from the first value to the second value; and resuming, by the idle node, execution from the interrupted step involved in the execution of the task. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for implementing automatic recovery from failure of resources in a Distributed Query Processing (DQP) engine implemented in a grid-based distributed database, wherein the grid comprises one or more worker nodes configured to execute a query and one or more idle nodes configured to monitor the one or more worker nodes, the system comprising:
-
a processor having; a State Manager module configured to save a set of data structures corresponding to each step involved in execution of a task assigned to each worker node at pre-determined time intervals in a distributed shared memory, wherein the saved set of data structures is assigned a first value; and a Fault Detector module operating concurrently with the State Manager module and configured to; monitor each of the worker nodes by polling the distributed shared memory for detecting changes to the saved set of data structures corresponding to each worker node at pre-determined time intervals, wherein a second value is assigned to the saved set of data structures corresponding to each worker node if a change is detected, detect node failure in the event of no change from the first value to the second value, raise alarms in case of node failure, wherein the node failure indicates interruption in the execution of at least one of the steps involved in the execution of the task, thereby the saved set of data structures acts as a checkpoint for determining the saved set of data structures that are to be reloaded to the one or more idle nodes reload the saved set of data structures corresponding to each step involved in execution of the task of the at least one worker node to at least one of the one or more idle nodes after the detection of node failure, and resume execution by the idle node from the last interrupted step involved in the execution of the task. - View Dependent Claims (10, 11)
-
Specification