Fault tolerance and recovery in a high-performance computing (HPC) system
First Claim
1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
- a fabric coupling a plurality of nodes in an HPC system to each other;
storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and
a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.
1 Assignment
0 Petitions
Accused Products
Abstract
In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
-
Citations
38 Claims
-
1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
-
a fabric coupling a plurality of nodes in an HPC system to each other;
storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and
a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for fault tolerance and recovery in a high-performance computing (HPC) system, the method comprising:
-
monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
if a fault occurs at the currently running node;
discontinuing operation of the currently running node; and
booting the host at a free node in the HPC system from the storage. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. Logic for fault tolerance and recovery in a high-performance computing (HPC) system, the logic encoded in a computer-readable medium and when executed operable to:
-
monitor a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
if a fault occurs at the currently running node;
discontinue operation of the currently running node; and
boot the host at a free node in the HPC system from the storage. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
-
38. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
-
means for monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes; and
means for, if a fault occurs at the currently running node;
discontinuing operation of the currently running node; and
booting the host at a free node in the HPC system from the storage.
-
Specification