×

System and method for detecting and managing HPC node failure

  • US 7,711,977 B2
  • Filed: 04/15/2004
  • Issued: 05/04/2010
  • Est. Priority Date: 04/15/2004
  • Status: Active Grant
First Claim
Patent Images

1. Software encoded in one or more computer-readable tangible media and when executed operable to:

  • determine that one of a plurality of nodes has failed, each node comprising;

    at least two first processors operable to communicate with each other via a direct link between them, the first processors integrated to a first card; and

    a first switch integrated to the first card, the first processors communicably coupled to the first switch, the first switch operable to communicably couple the first processors to at least six second cards each comprising at least two second processors integrated to the second card and a second switch integrated to the second card operable to communicably couple the second processors to the first card and at least five third cards each comprising at least two third processors integrated to the third card and a third switch integrated to the third card;

    the first processors operable to communicate with particular second processors on a particular second card via the first switch and the second switch on the particular second card;

    the first processors operable to communicate with particular third processors on a particular third card via the first switch, a particular second switch on a particular second card between the first card and the particular third card, and the third switch on the particular third card without communicating via either second processor on the particular second card;

    remove the failed node from a virtual list of nodes, the virtual list comprising one logical entry for each of the plurality of nodes;

    determine that at least a portion of an job was being executed on the failed node;

    terminate at least the portion of the job;

    determine that the job was associated with a subset of the plurality of nodes; and

    deallocate the subset of nodes from the job.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×