SYSTEM AND METHOD FOR HANDLING MULTI-NODE FAILURES IN A DISASTER RECOVERY CLUSTER

US 20160085647A1
Filed: 12/03/2014
Published: 03/24/2016
Est. Priority Date: 09/22/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

determining that a candidate node is not available for a switchover operation;

identifying an alternate node for the switchover operation;

determining whether the identified alternate node is capable of handling a load from a plurality of other nodes;

in response to determining that the identified alternate node is capable of handling the local from the plurality of other nodes, performing a switchover operation to transfer ownership of one or more objects from the plurality of other nodes to the identified alternate node; and

recovering data from a non-volatile memory to the one or more objects; and

bringing online the one or more objects.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for handling multi-node failures in a disaster recovery cluster is provided. In the event of an error condition, a switchover operation occurs from the failed nodes to one or more surviving nodes. Data stored in non-volatile random access memory is recovered by the surviving nodes to bring storage objects, e.g., disks, aggregates and/or volumes into a consistent state.

15 Citations

View as Search Results

20 Claims

1. A method comprising:
- determining that a candidate node is not available for a switchover operation;
  
  identifying an alternate node for the switchover operation;
  
  determining whether the identified alternate node is capable of handling a load from a plurality of other nodes;
  
  in response to determining that the identified alternate node is capable of handling the local from the plurality of other nodes, performing a switchover operation to transfer ownership of one or more objects from the plurality of other nodes to the identified alternate node; and
  
  recovering data from a non-volatile memory to the one or more objects; and
  
  bringing online the one or more objects.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1 wherein the one or more objects comprise volumes.
  - 3. The method of claim 1 wherein the one or more objects comprise aggregates.
  - 4. The method of claim 1 wherein the plurality of other nodes comprises a first node arranged in a high availability pairing with a second node.
  - 5. The method of claim 4 further comprising:
    - detecting a failure on the first node; and
      
      performing a failover operation from the first node to the second node.
  - 6. The method of claim 5 further comprising, detecting an error condition on the second node.
  - 7. The method of claim 1 wherein recovering data from the non-volatile memory further comprises replaying a portion of data stored in the non-volatile memory that was mirrored from the plurality of other nodes.
  - 8. The method of claim 1 wherein the one or more objects are stored on magnetic storage media.
  - 9. The method of claim 1 wherein the one or more objects are stored on storage devices operatively interconnected with a shared switching fabric

10. A system comprising:
- a first high availability pair comprising of a first and a second node operatively interconnected by a first cluster interconnect, the first node associated with first data storage objects and the second node associated with second data storage objects;
  
  a second high availability pair comprising of a third and a fourth node operatively interconnected by a second cluster interconnect, the third node associated with third data storage objects and the fourth node associated with fourth data storage objects, the first and second high availability pairs organized as a disaster recovery group;
  
  wherein the first node is configured to perform a takeover operation of the second data storage objects in response to an error condition of the second node; and
  
  wherein the third node is configured to perform a switchover operation to manage the first data storage objects and the fourth node is configured to perform a switchover operation to manage the second data storage objects in response to a subsequent error condition affecting the first node.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10 wherein the third node is further configured to recover a portion of data stored in a third non-volatile memory, wherein the recovery of data causes the third node to write the data to the first data storage objects.
  - 12. The system of claim 10 wherein the fourth node is further configured to recover a portion of data stored in a fourth non-volatile memory, wherein the recovery of data causes the fourth node to write the data to the fourth data storage objects.
  - 13. The system of claim 10 wherein the first, second, third and fourth data storage objects comprise volumes.
  - 14. The system of claim 10 wherein the first, second, third and fourth data storage objects comprise aggregates.
  - 15. The system of claim 10 wherein the first, second, third and fourth nodes are operatively interconnected with a shared switching fabric.
  - 16. The system of claim 15 wherein the first data objects are stored on storage devices operatively interconnected with the shared switching fabric.
  - 17. The system of claim 16 wherein ownership of the storage devices is modified to conform to the node managing the first data objects stored thereon.
  - 18. The system of claim 10 wherein the third node further comprises a storage operating system having a management host module, the management host module storing an object limit option for the third node.

19. A computer readable medium, including program instructions executable on a processor, the computer readable medium comprising:
- program instructions that determine that a candidate node is not available for a switchover operation;
  
  program instructions that identify an alternate node for the switchover operation;
  
  program instructions that determine whether the identified alternate node is capable of handling a load from a plurality of other nodes;
  
  in response to determining that the identified alternate node is capable of handling the local from the plurality of other nodes, program instructions that perform a switchover operation to transfer ownership of one or more objects from the plurality of other nodes to the identified alternate node;
  
  program instructions that recover data from a non-volatile memory to the one or more objects; and
  
  program instructions that bring online the one or more objects.
- View Dependent Claims (20)
- - 20. The computer readable medium of claim 19 wherein the program instructions that recover data from the non-volatile memory further comprise program instructions that replay a portion of data stored in the non-volatile memory that was mirrored from the plurality of other nodes

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Ramasubramaniam, Vaiapuri, Kadayam, Harihara, Cho, Yong Eun, Patel, Chaitanya, Keremane, Hrishikesh, Deshmukh, Prachi, Sarfare, Parag

Granted Patent

US 9,811,428 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/2028   eliminating a faulty proces...

G06F 11/2033   switching over of hardware ...

G06F 11/2041   with more than one idle spa...

G06F 11/2046   where the redundant compone...

G06F 11/2097   maintaining the standby con...

G06F 2201/805   Real-time

SYSTEM AND METHOD FOR HANDLING MULTI-NODE FAILURES IN A DISASTER RECOVERY CLUSTER

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR HANDLING MULTI-NODE FAILURES IN A DISASTER RECOVERY CLUSTER

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links