Fault tolerance and recovery in a high-performance computing (HPC) system

US 20060112297A1
Filed: 11/17/2004
Published: 05/25/2006
Est. Priority Date: 11/17/2004
Status: Active Grant

First Claim

Patent Images

1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:

a fabric coupling a plurality of nodes in an HPC system to each other;

storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and

a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.

Citations

38 Claims

1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
- a fabric coupling a plurality of nodes in an HPC system to each other;
  
  storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and
  
  a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system of claim 1, wherein the manager is further operable to identify the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 3. The system of claim 2, wherein the status of the currently running comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 4. The system of claim 2, wherein the daemon communicates the messages to the manager at regular intervals.
  - 5. The system of claim 1, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 6. The system of claim 1, wherein the manager is further operable to checkpoint the host to enable the manager to boot the host at the free node from a checkpoint.
  - 7. The system of claim 1, wherein the manager is further operable, if a fault occurs at the currently running node, to update one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 8. The system of claim 1, wherein the manager is further operable, if a fault occurs at the currently running node, to notify an administrator of the HPC system of the occurrence of the fault.
  - 9. The system of claim 1, wherein the manager is operable, to discontinue operation of the currently running node, to do one or more of the following:
    - prevent communication to and from the currently running node;
      
      prevent the currently running node from accessing the storage;
      
      cause the currently running node to idle;
      
      cause the currently running node to power down;
      
      or cause the currently running node to reboot.
  - 10. The system of claim 1, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 11. The system of claim 10, wherein the switches are INFINIBAND switches.
  - 12. The system of claim 1, wherein a host comprises an Internet Protocol (IP) address, a boot image, a configuration, and a file system usable to boot the host at a node in the HPC system.
  - 13. The system of claim 1, wherein the fault at the currently running node comprises a fault in a hardware component at the currently running node.
  - 14. The system of claim 1, wherein the fault at the currently running node comprises a fault in a software component at the currently running node.
  - 15. The system of claim 1, wherein the fault at the currently running node comprises a fault in an interface between the currently running node and the fabric.

16. A method for fault tolerance and recovery in a high-performance computing (HPC) system, the method comprising:
- monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
  
  if a fault occurs at the currently running node;
  
  discontinuing operation of the currently running node; and
  
  booting the host at a free node in the HPC system from the storage.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The method of claim 16, further comprising identifying the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 18. The method of claim 17, wherein the status of the currently running comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 19. The method of claim 17, wherein the daemon communicates the messages to the manager at regular intervals.
  - 20. The method of claim 16, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 21. The method of claim 16, further comprising checkpointing the host to enable booting the host at the free node from a checkpoint.
  - 22. The method of claim 16, further comprising, if a fault occurs at the currently running node, updating one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 23. The method of claim 16, further comprising, if a fault occurs at the currently running node, notifying an administrator of the HPC system of the occurrence of the fault.
  - 24. The method of claim 16, wherein discontinuing operation of the currently running node comprises one or more of:
    - preventing communication to and from the currently running node;
      
      preventing the currently running node from accessing the storage;
      
      causing the currently running node to idle;
      
      causing the currently running node to power down; and
      
      causing the currently running node to reboot.
  - 25. The method of claim 16, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 26. The method of claim 25, wherein the switches are INFINIBAND switches.

27. Logic for fault tolerance and recovery in a high-performance computing (HPC) system, the logic encoded in a computer-readable medium and when executed operable to:
- monitor a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
  
  if a fault occurs at the currently running node;
  
  discontinue operation of the currently running node; and
  
  boot the host at a free node in the HPC system from the storage.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 28. The logic of claim 27, further operable to identify the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 29. The logic of claim 28, wherein the status of the currently running comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 30. The logic of claim 28, wherein the daemon communicates the messages to the manager at regular intervals.
  - 31. The logic of claim 27, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 32. The logic of claim 27, further operable to checkpoint the host to enable booting the host at the free node from a checkpoint.
  - 33. The logic of claim 27, further operable, if a fault occurs at the currently running node, to update one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 34. The logic of claim 27, further operable, if a fault occurs at the currently running node, to notify an administrator of the HPC system of the occurrence of the fault.
  - 35. The logic of claim 27, operable, to discontinue operation of the currently running node, to do one or more of the following:
    - prevent communication to and from the currently running node;
      
      prevent the currently running node from accessing the storage;
      
      cause the currently running node to idle;
      
      cause the currently running node to power down; and
      
      cause the currently running node to reboot.
  - 36. The logic of claim 27, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 37. The logic of claim 36, wherein the switches are INFINIBAND switches.

38. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
- means for monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
  
  means for, if a fault occurs at the currently running node;
  
  discontinuing operation of the currently running node; and
  
  booting the host at a free node in the HPC system from the storage.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Raytheon Company (Rtx Corporation)
Original Assignee
Raytheon Company (Rtx Corporation)
Inventors
Davidson, Shannon V.

Granted Patent

US 7,475,274 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/2
CPC Class Codes

G06F 11/1438   Restarting or rejuvenating

G06F 11/2005   using redundant communicati...

G06F 11/2025   using centralised failover ...

G06F 11/203   using migration

G06F 11/2046   where the redundant compone...

G06F 11/2051   in regular structures

Fault tolerance and recovery in a high-performance computing (HPC) system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Fault tolerance and recovery in a high-performance computing (HPC) system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links