METHOD AND SYSTEM FOR AUTOMATIC FAILOVER OF DISTRIBUTED QUERY PROCESSING USING DISTRIBUTED SHARED MEMORY
First Claim
1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:
- determining the category of each node in the subgroup of nodes, wherein the determination identifies each node as at least one of a worker node and an idle node;
saving state of each worker node engaged in execution of a task, wherein the state is saved in shared memory distributed across nodes in a sub-group, further wherein the state of each worker node is saved at pre-determined time intervals having a first fixed value;
monitoring each worker node by one or more idle nodes in each sub-group, wherein monitoring comprises polling the shared memory for changes to state of the each worker node at pre-determined time intervals having a second fixed value;
raising a failure notification by the one or more idle nodes, wherein the failure notification is raised upon detection of no change in state of the each worker node for a pre-determined period of time; and
resuming execution of task of the failed worker node by an idle node selected from amongst the one or more worker nodes.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for implementing automatic recovery from failure of resources in a grid-based distributed database is provided. The method includes determining the category of each node in the subgroup of nodes, where the determination identifies each node as at least one of a worker node and an idle node. The method further includes saving state of each worker node engaged in execution of a task in a shared memory at pre-determined time intervals. Each worker node is monitored by one or more idle nodes in each sub-group. Upon detection of no change in state of worker node for a pre-determined period of time, a failure notification is raised by one or more idle nodes that have detected failure of the worker node.
-
Citations
12 Claims
-
1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:
-
determining the category of each node in the subgroup of nodes, wherein the determination identifies each node as at least one of a worker node and an idle node; saving state of each worker node engaged in execution of a task, wherein the state is saved in shared memory distributed across nodes in a sub-group, further wherein the state of each worker node is saved at pre-determined time intervals having a first fixed value; monitoring each worker node by one or more idle nodes in each sub-group, wherein monitoring comprises polling the shared memory for changes to state of the each worker node at pre-determined time intervals having a second fixed value; raising a failure notification by the one or more idle nodes, wherein the failure notification is raised upon detection of no change in state of the each worker node for a pre-determined period of time; and resuming execution of task of the failed worker node by an idle node selected from amongst the one or more worker nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for implementing automatic recovery from failure of resources in a DQP engine implemented in a grid-based distributed database, wherein the grid comprises one or more worker nodes configured to execute a query and one or more idle nodes configured to monitor the one or more worker nodes, the system comprising:
-
a State Manager module configured to create, read and invalidate states of worker nodes in a distributed shared memory, wherein the state of a node is the minimal set of required data on which execution of a process is dependant; a Fault Detector module operating concurrently with the State Manager module and configured to detect node failures and raise alarms in case of node failures; and a Fault Handler module invoked by the Fault Detector module upon detection of a worker node failure and configured to trigger the worker nodes to modify their data exchange plan dynamically in response to detection of the worker node failure. - View Dependent Claims (11, 12)
-
Specification