Fault tolerance and recovery in a high-performance computing (HPC) system
DC CAFCFirst Claim
1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
- a fabric coupling a plurality of nodes in an HPC system to each other, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card;
storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and
a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.
1 Assignment
Litigations
0 Petitions
Accused Products
Abstract
In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
-
Citations
38 Claims
-
1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:
-
a fabric coupling a plurality of nodes in an HPC system to each other, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for fault tolerance and recovery in a high-performance computing (HPC) system, the method comprising:
-
monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to a storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and if a fault occurs at the currently running node; discontinuing operation of the currently running node; and booting a host at a free node in the HPC system from the storage. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. One or more computer-readable storage media storing logic for fault tolerance and recovery in a high-performance computing (HPC) system, the logic when executed operable to:
-
monitor a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to a storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and if a fault occurs at the currently running node; discontinue operation of the currently running node; and boot a host at a free node in the HPC system from the storage. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
-
38. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising computer-readable storage media comprising:
-
means for monitoring a currently running node in an HPC system comprising a plurality of nodes, a fabric coupling the plurality of nodes to each other and coupling the plurality of nodes to storage accessible to each of the plurality of nodes and operable to store a plurality of hosts each executable at any of the plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card; and means for, if a fault occurs at the currently running node; discontinuing operation of the currently running node; and booting the host at a free node in the HPC system from the storage.
-
Specification