Fault tolerance and recovery in a high-performance computing (HPC) system

DC CAFC

US 7,475,274 B2
Filed: 11/17/2004
Issued: 01/06/2009
Est. Priority Date: 11/17/2004
Status: Active Grant

- Alert
- Pin

First Claim

Patent Images

1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:

a fabric coupling a plurality of nodes in an HPC system to each other, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card;

storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and

a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.

View all claims

1 Assignment

Timeline View

Assignment View

Litigations

0 Petitions

Accused Products

Abstract

In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.

Citations

38 Claims

1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
- a fabric coupling a plurality of nodes in an HPC system to each other, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card;
  
  storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and
  
  a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system of claim 1, wherein the manager is further operable to identify the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 3. The system of claim 2, wherein the status of the currently running node comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 4. The system of claim 2, wherein the daemon communicates the messages to the manager at regular intervals.
  - 5. The system of claim 2, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 6. The system of claim 1, wherein the manager is further operable to checkpoint the host to enable the manager to boot the host at the free node from a checkpoint.
  - 7. The system of claim 1, wherein the manager is further operable, if a fault occurs at the currently running node, to update one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 8. The system of claim 1, wherein the manager is further operable, if a fault occurs at the currently running node, to notify an administrator of the HPC system of the occurrence of the fault.
  - 9. The system of claim 1, wherein the manager is operable, to discontinue operation of the currently running node, to do one or more of the following:
    - prevent communication to and from the currently running node;
      
      prevent the currently running node from accessing the storage;
      
      cause the currently running node to idle;
      
      cause the currently running node to power down;
      
      orcause the currently running node to reboot.
  - 10. The system of claim 1, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 11. The system of claim 10, wherein the switches are INFINIBAND switches.
  - 12. The system of claim 1, wherein a host comprises an Internet Protocol (IP) address, a boot image, a configuration, and a file system usable to boot the host at a node in the HPC system.
  - 13. The system of claim 1, wherein the fault at the currently running node comprises a fault in a hardware component at the currently running node.
  - 14. The system of claim 1, wherein the fault at the currently running node comprises a fault in a software component at the currently running node.
  - 15. The system of claim 1, wherein the fault at the currently running node comprises a fault in an interface between the currently running node and the fabric.

16. A method for fault tolerance and recovery in a high-performance computing (HPC) system, the method comprising:
- monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to a storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and
  
  if a fault occurs at the currently running node;
  
  discontinuing operation of the currently running node; and
  
  booting a host at a free node in the HPC system from the storage.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The method of claim 16, further comprising identifying the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 18. The method of claim 17, wherein the status of the currently running node comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 19. The method of claim 17, wherein the daemon communicates the messages to the manager at regular intervals.
  - 20. The method of claim 17, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 21. The method of claim 16, further comprising checkpointing the host to enable booting the host at the free node from a checkpoint.
  - 22. The method of claim 16, further comprising, if a fault occurs at the currently running node, updating one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 23. The method of claim 16, further comprising, if a fault occurs at the currently running node, notifying an administrator of the HPC system of the occurrence of the fault.
  - 24. The method of claim 16, wherein discontinuing operation of the currently running node comprises one or more of:
    - preventing communication to and from the currently running node;
      
      preventing the currently running node from accessing the storage;
      
      causing the currently running node to idle;
      
      causing the currently running node to power down; and
      
      causing the currently running node to reboot.
  - 25. The method of claim 16, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 26. The method of claim 25, wherein the switches are INFINIBAND switches.

27. One or more computer-readable storage media storing logic for fault tolerance and recovery in a high-performance computing (HPC) system, the logic when executed operable to:
- monitor a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to a storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and
  
  if a fault occurs at the currently running node;
  
  discontinue operation of the currently running node; and
  
  boot a host at a free node in the HPC system from the storage.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 28. The computer-readable storage media of claim 27, further operable to identify the fault at the currently running node according to one or more messages from a daemon at the currently running node indicating a status of the currently running node.
  - 29. The computer-readable storage media of claim 28, wherein the status of the currently running node comprises one or more of an average speed of a fan at the currently running node, a current temperature of the currently running node, and a level of power consumption at the currently running node.
  - 30. The computer-readable storage media of claim 28, wherein the daemon communicates the messages to the manager at regular intervals.
  - 31. The computer-readable storage media of claim 28, wherein the daemon communicates the messages to the manager across each interface between the currently running node and the fabric.
  - 32. The computer-readable storage media of claim 27, further operable to checkpoint the host to enable booting the host at the free node from a checkpoint.
  - 33. The computer-readable storage media of claim 27, further operable, if a fault occurs at the currently running node, to update one or more routing tables in the HPC system to enable communication to and from the host at the free node.
  - 34. The computer-readable storage media of claim 27, further operable, if a fault occurs at the currently running node, to notify an administrator of the HPC system of the occurrence of the fault.
  - 35. The computer-readable storage media of claim 27, operable, to discontinue operation of the currently running node, to do one or more of the following:
    - prevent communication to and from the currently running node;
      
      prevent the currently running node from accessing the storage;
      
      cause the currently running node to idle;
      
      cause the currently running node to power down; and
      
      cause the currently running node to reboot.
  - 36. The computer-readable storage media of claim 27, wherein the fabric comprises a plurality of switches coupling the nodes to each other according to a topology comprising a three dimensional torus.
  - 37. The computer-readable storage media of claim 36, wherein the switches are INFINIBAND switches.

38. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising computer-readable storage media comprising:
- means for monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and
  
  means for, if a fault occurs at the currently running node;
  
  discontinuing operation of the currently running node; and
  
  booting the host at a free node in the HPC system from the storage.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Raytheon Company (Rtx Corporation)
Original Assignee
Raytheon Company (Rtx Corporation)
Inventors
Davidson, Shannon V.
Primary Examiner(s)
LE, DIEU MINH T

Application Number

US10/991,754
Publication Number

US 20060112297A1
Time in Patent Office

1,511 Days
Field of Search

714/2, 714/4, 714/25, 714/47, 714/48
US Class Current

714/4.4
CPC Class Codes

G06F 11/1438   Restarting or rejuvenating

G06F 11/2005   using redundant communicati...

G06F 11/2025   using centralised failover ...

G06F 11/203   using migration

G06F 11/2046   where the redundant compone...

G06F 11/2051   in regular structures

Fault tolerance and recovery in a high-performance computing (HPC) system

First Claim

1 Assignment

Litigations

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Fault tolerance and recovery in a high-performance computing (HPC) system

First Claim

1 Assignment

Subscription Required

Subscription Required

Litigations

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links