Method and apparatus for analyzing error conditions in a massively parallel computer system by identifying anomalous nodes within a communicator set
First Claim
1. A computer-implemented method for analyzing errors in a parallel computer system, said parallel computer system comprising multiple nodes arranged in a lattice for inter-nodal communications, each node comprising at least one processor for executing a respective application sub-process and a nodal memory, said method comprising the steps of:
- executing a respective unique application sub-process of a common application in each node of a plurality of said nodes of said parallel computer system to produce respective independent state data in each node of said plurality of said nodes;
obtaining said respective independent state data corresponding to each node of said plurality of nodes of said parallel computer system;
analyzing said independent state data to identify a first node having anomalous corresponding independent state data with respect to respective independent state data corresponding to a plurality of neighboring nodes of said first node, each said neighboring node being a node adjacent said first node within said lattice, wherein said step of analyzing said independent state data to identify a first node comprises;
identifying a first subset of said plurality of nodes, said first subset consisting of all nodes having independent state data which matches according to a pre-defined matching criterion; and
using the topology of said lattice to identify said first node as an anomalous neighbor of at least one node of said first subset; and
presenting results of said analyzing step to a user.
2 Assignments
0 Petitions
Accused Products
Abstract
An analytical mechanism for a massively parallel computer system automatically analyzes data retrieved from the system, and identifies nodes which exhibit anomalous behavior in comparison to their immediate neighbors. Preferably, anomalous behavior is determined by comparing call-return stack tracebacks for each node, grouping like nodes together, and identifying neighboring nodes which do not themselves belong to the group. A node, not itself in the group, having a large number of neighbors in the group, is a likely locality of error. The analyzer preferably presents this information to the user by sorting the neighbors according to number of adjoining members of the group.
-
Citations
17 Claims
-
1. A computer-implemented method for analyzing errors in a parallel computer system, said parallel computer system comprising multiple nodes arranged in a lattice for inter-nodal communications, each node comprising at least one processor for executing a respective application sub-process and a nodal memory, said method comprising the steps of:
-
executing a respective unique application sub-process of a common application in each node of a plurality of said nodes of said parallel computer system to produce respective independent state data in each node of said plurality of said nodes; obtaining said respective independent state data corresponding to each node of said plurality of nodes of said parallel computer system; analyzing said independent state data to identify a first node having anomalous corresponding independent state data with respect to respective independent state data corresponding to a plurality of neighboring nodes of said first node, each said neighboring node being a node adjacent said first node within said lattice, wherein said step of analyzing said independent state data to identify a first node comprises; identifying a first subset of said plurality of nodes, said first subset consisting of all nodes having independent state data which matches according to a pre-defined matching criterion; and using the topology of said lattice to identify said first node as an anomalous neighbor of at least one node of said first subset; and presenting results of said analyzing step to a user. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A program product for analyzing errors in a parallel computer system, said parallel computer system comprising multiple nodes arranged in a lattice for inter-nodal communications, each node comprising at least one processor for executing a respective application sub-process and a nodal memory, the program product comprising:
-
a plurality of computer executable instructions recorded on tangible computer-readable storage media, wherein said instructions, when executed by at least one computer system, cause the at least one computer system to perform the steps of; receiving respective independent state data corresponding to each of a plurality of said nodes, said respective independent state data being produced as a result of executing a respective unique application sub-process of a common application in each node of said plurality of nodes of said parallel computing lattice; analyzing said independent state data to identify a first node of said plurality of said nodes having anomalous corresponding independent state data with respect to respective independent state data corresponding to a plurality of neighboring nodes of said first node, each said neighboring node being a node adjacent said first node within said lattice, wherein said analyzing said independent state data to identify a first node comprises; identifying a first subset of said plurality of nodes using said state data, said first subset consisting of all nodes having respective independent state data which matches according to a pre-defined matching criterion; and using the topology of said lattice to identify said first node as an anomalous neighbor of at least one node of said first subset; and presenting results of said analyzing step to a user. - View Dependent Claims (9, 10)
-
-
11. A computer system which analyzes errors in a parallel computing lattice, said lattice comprising a plurality of nodes coupled by inter-nodal communications paths, each node comprising at least one processor for executing a respective application sub-process and a nodal memory, the computer system comprising:
-
at least one processor; a memory for storing data addressable by said at least one processor; an analytical program embodied as computer executable instructions storable in said memory and executable on said at least one processor, said analytical program comprising; (a) a state data function which receives respective independent state data corresponding to each of a plurality of nodes of said parallel computing lattice, said respective independent state data being produced as a result of executing a respective unique application sub-process of a common application in each node of said plurality of nodes of said parallel computing lattice; (b) an anomaly detector function which identifies a first node having anomalous corresponding independent state data with respect to respective independent state data corresponding to a plurality of neighboring nodes of said first node, each said neighboring node being a node adjacent said first node within said lattice, wherein said anomaly detector function identifies a first subset of said plurality of nodes, said first subset consisting of all nodes having state data which matches according to a pre-defined matching criterion, and wherein said anomaly detector function uses the topology of said lattice to identify said first node as an anomalous neighbor of at least one node of said first subset; and (c) an output function which presents results of said state data and anomaly detector functions to a user. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
Specification