×

Fault tolerance and recovery in a high-performance computing (HPC) system

DC CAFC
  • US 7,475,274 B2
  • Filed: 11/17/2004
  • Issued: 01/06/2009
  • Est. Priority Date: 11/17/2004
  • Status: Active Grant
First Claim
Patent Images

1. A system for fault tolerance and recovery in a high-performance computing (HPC) system, the system for fault tolerance and recovery comprising:

  • a fabric coupling a plurality of nodes in an HPC system to each other, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card;

    storage coupled to the fabric and accessible to each of the nodes, the storage operable to store a plurality of hosts each executable at any of the nodes; and

    a manager coupled to the fabric, the manager operable to monitor a currently running node in the HPC system executing a host and, if a fault occurs at the currently running node, discontinue operation of the currently running node and boot the host at a free node in the HPC system from the storage.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×