×

Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set

  • US 7,930,595 B2
  • Filed: 06/22/2006
  • Issued: 04/19/2011
  • Est. Priority Date: 06/22/2006
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method for analyzing errors in a parallel computer system, said parallel computer system comprising multiple nodes arranged in a lattice for inter-nodal communications, each node comprising at least one processor for executing a respective application sub-process and a nodal memory, said method comprising the steps of:

  • executing a respective unique application sub-process of a common application in each node of a plurality of said nodes of said parallel computer system to produce respective independent state data in each node of said plurality of said nodes;

    obtaining said respective independent state data corresponding to each node of said plurality of nodes of said parallel computer system;

    analyzing said independent state data to identify a first node having anomalous corresponding independent state data with respect to respective independent state data corresponding to a plurality of neighboring nodes of said first node, each said neighboring node being a node adjacent said first node within said lattice, wherein said step of analyzing said independent state data to identify a first node comprises;

    identifying a first subset of said plurality of nodes, said first subset consisting of all nodes having independent state data which matches according to a pre-defined matching criterion; and

    using the topology of said lattice to identify said first node as an anomalous neighbor of at least one node of said first subset; and

    presenting results of said analyzing step to a user.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×