METHOD AND SYSTEM FOR AUTOMATIC FAILOVER OF DISTRIBUTED QUERY PROCESSING USING DISTRIBUTED SHARED MEMORY

US 20110228668A1
Filed: 05/28/2010
Published: 09/22/2011
Est. Priority Date: 03/22/2010
Status: Active Grant

First Claim

Patent Images

1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:

determining the category of each node in the subgroup of nodes, wherein the determination identifies each node as at least one of a worker node and an idle node;

saving state of each worker node engaged in execution of a task, wherein the state is saved in shared memory distributed across nodes in a sub-group, further wherein the state of each worker node is saved at pre-determined time intervals having a first fixed value;

monitoring each worker node by one or more idle nodes in each sub-group, wherein monitoring comprises polling the shared memory for changes to state of the each worker node at pre-determined time intervals having a second fixed value;

raising a failure notification by the one or more idle nodes, wherein the failure notification is raised upon detection of no change in state of the each worker node for a pre-determined period of time; and

resuming execution of task of the failed worker node by an idle node selected from amongst the one or more worker nodes.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for implementing automatic recovery from failure of resources in a grid-based distributed database is provided. The method includes determining the category of each node in the subgroup of nodes, where the determination identifies each node as at least one of a worker node and an idle node. The method further includes saving state of each worker node engaged in execution of a task in a shared memory at pre-determined time intervals. Each worker node is monitored by one or more idle nodes in each sub-group. Upon detection of no change in state of worker node for a pre-determined period of time, a failure notification is raised by one or more idle nodes that have detected failure of the worker node.

Citations

12 Claims

1. A method for implementing automatic recovery from failure of resources in a grid-based distributed database, the grid comprising a plurality of multi-cast subgroup of nodes, wherein each subgroup of nodes comprises one or more worker nodes and one or more idle nodes, the method comprising:
- determining the category of each node in the subgroup of nodes, wherein the determination identifies each node as at least one of a worker node and an idle node;
  
  saving state of each worker node engaged in execution of a task, wherein the state is saved in shared memory distributed across nodes in a sub-group, further wherein the state of each worker node is saved at pre-determined time intervals having a first fixed value;
  
  monitoring each worker node by one or more idle nodes in each sub-group, wherein monitoring comprises polling the shared memory for changes to state of the each worker node at pre-determined time intervals having a second fixed value;
  
  raising a failure notification by the one or more idle nodes, wherein the failure notification is raised upon detection of no change in state of the each worker node for a pre-determined period of time; and
  
  resuming execution of task of the failed worker node by an idle node selected from amongst the one or more worker nodes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the second fixed value is pre-determined to be greater than the first fixed value.
  - 3. The method of claim 1, wherein the task processed by a worker node comprises processing of a query.
  - 4. The method of claim 3, wherein the implementation of automatic recovery from failure of resources is executed using an OGSA-DQP architecture.
  - 5. The method of claim 4, wherein a state of worker node saved in distributed shared memory is a checkpoint including a minimal set of data structures required for part of query assigned to a worker node to be re-loaded on another node in order to continue execution of a query.
  - 6. The method of claim 1, wherein the idle node selected for resuming the task of the failed worker node is chosen using a lock based agreement scheme.
  - 7. The method of claim 6, wherein the lock based agreement scheme comprises at least one of Bully algorithm, Coin-flipping protocol and Byzantine protocol.
  - 8. The method of claim 1, wherein the distributed shared memory is implemented using software objects.
  - 9. The method of claim 1, wherein the distributed shared memory is implemented as associative memory.

10. A system for implementing automatic recovery from failure of resources in a DQP engine implemented in a grid-based distributed database, wherein the grid comprises one or more worker nodes configured to execute a query and one or more idle nodes configured to monitor the one or more worker nodes, the system comprising:
- a State Manager module configured to create, read and invalidate states of worker nodes in a distributed shared memory, wherein the state of a node is the minimal set of required data on which execution of a process is dependant;
  
  a Fault Detector module operating concurrently with the State Manager module and configured to detect node failures and raise alarms in case of node failures; and
  
  a Fault Handler module invoked by the Fault Detector module upon detection of a worker node failure and configured to trigger the worker nodes to modify their data exchange plan dynamically in response to detection of the worker node failure.
- View Dependent Claims (11, 12)
- - 11. The system of claim 10, wherein the automatic recovery implemented is independent of time of occurrence of failure.
  - 12. The system of claim 10, wherein the State Manager module, the Fault Detector module and the Fault Handler module are implemented as software modules in an idle node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Infosys Limited
Original Assignee
Infosys Technologies Limited (Infosys Limited)
Inventors
PILLAI, Brijesh, KRISHNAMOORTHY, Srikumar, SINGH, Aakanksha Gagrani

Granted Patent

US 8,874,961 B2
Time in Patent Office

Days
Field of Search
US Class Current

370/217
CPC Class Codes

G06F 11/2023   Failover techniques

G06F 11/203   using migration

G06F 11/2043   where the redundant compone...

G06F 16/2471   Distributed queries

G06F 2201/80   Database-specific techniques

G06F 2201/815   Virtual

METHOD AND SYSTEM FOR AUTOMATIC FAILOVER OF DISTRIBUTED QUERY PROCESSING USING DISTRIBUTED SHARED MEMORY

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR AUTOMATIC FAILOVER OF DISTRIBUTED QUERY PROCESSING USING DISTRIBUTED SHARED MEMORY

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links