Methods and apparatus using commutative error detection values for fault isolation in multiple node computers
First Claim
1. A signal-bearing computer memory medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform node fault detection operations in a computing system using commutative error detection values, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- the computing system further comprising a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes at least a portion of an application program at least twice, wherein during each execution of the portion of the application program at least one commutative error detection value is generated and saved to the commutative error detection apparatus associated with at least one node of the plurality when data generated during execution of a reproducible segment of the portion of the application program is injected into the network by the at least one node, the node fault detection operations comprising;
retrieving the at least one commutative error detection value generated during a first execution of the portion of the application program from the commutative error detection apparatus of the at least one node;
saving the at least one commutative error detection value associated with the first execution of the portion of the application program to a computer memory medium;
retrieving the at least one commutative error detection value generated during a second execution of the portion of the application program from the commutative error detection apparatus of the at least one node; and
comparing the at least one commutative error detection value from the first execution of the portion of the application program to the at least one commutative error detection value from the second execution of the portion of the application program.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for—example, checksums—to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.
18 Citations
41 Claims
-
1. A signal-bearing computer memory medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform node fault detection operations in a computing system using commutative error detection values, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- the computing system further comprising a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes at least a portion of an application program at least twice, wherein during each execution of the portion of the application program at least one commutative error detection value is generated and saved to the commutative error detection apparatus associated with at least one node of the plurality when data generated during execution of a reproducible segment of the portion of the application program is injected into the network by the at least one node, the node fault detection operations comprising;
retrieving the at least one commutative error detection value generated during a first execution of the portion of the application program from the commutative error detection apparatus of the at least one node; saving the at least one commutative error detection value associated with the first execution of the portion of the application program to a computer memory medium; retrieving the at least one commutative error detection value generated during a second execution of the portion of the application program from the commutative error detection apparatus of the at least one node; and comparing the at least one commutative error detection value from the first execution of the portion of the application program to the at least one commutative error detection value from the second execution of the portion of the application program. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- the computing system further comprising a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes at least a portion of an application program at least twice, wherein during each execution of the portion of the application program at least one commutative error detection value is generated and saved to the commutative error detection apparatus associated with at least one node of the plurality when data generated during execution of a reproducible segment of the portion of the application program is injected into the network by the at least one node, the node fault detection operations comprising;
-
9. A signal-bearing computer memory medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform node fault detection operations in a computing system using commutative error detection values, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- the computing system further comprising a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes multiple portions of an application program at least twice, wherein during each execution of the multiple portions of the application program a plurality of commutative error detection values are generated and saved to the respective commutative error detection apparatus associated with the plurality of nodes when data generated during execution of reproducible segments of the multiple portions of the application program is injected into the network by the nodes, the node fault detection operations comprising;
during the first execution and second executions of the multiple portions of the application program, retrieving the commutative error detection values from the commutative error detection apparatus associated with the plurality of nodes; saving the plurality of commutative error detection values associated with at least the first execution of the multiple portions of the application program to a computer memory medium; and comparing on a node-by-node basis the plurality of commutative error detection values associated with the first execution of the multiple portions of the application program to the plurality of commutative error detection values associated with the second execution of the multiple portions of the application program, where at least one difference in commutative error detection values between the first and second executions of the application program indicates a node fault condition. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- the computing system further comprising a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes multiple portions of an application program at least twice, wherein during each execution of the multiple portions of the application program a plurality of commutative error detection values are generated and saved to the respective commutative error detection apparatus associated with the plurality of nodes when data generated during execution of reproducible segments of the multiple portions of the application program is injected into the network by the nodes, the node fault detection operations comprising;
-
24. A node fault detection apparatus for use in a computing system, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- and a network connecting the plurality of nodes through the network interfaces of the nodes, the node fault detection apparatus comprising;
at least one node fault detection processor for performing node fault detection operations; a node fault detection computer memory medium for storing commutative error detection values retrieved from the commutative error detection apparatus of the plurality of nodes; and a network interface connecting the node fault detection apparatus to the network of the computing system, where the at least one processor of the node fault detection apparatus performs at least the following node fault detection operations when the computing system executes a portion of an application program at least twice, wherein during each execution of the portion of the application program at least one commutative error detection value is generated and saved to a commutative error detection apparatus associated with at least one node of the plurality when data generated during execution of a reproducible segment of the portion of the application program is injected into the network by the at least one node, the node fault detection operations comprising; retrieving the at least one commutative error detection value created during a first execution of the application program from the commutative error detection apparatus of the at least one node; saving the at least one commutative error detection value from the first execution of the application program to the node fault detection apparatus computer memory medium; retrieving the at least one commutative error detection value created during a second execution of the application program from the commutative error detection apparatus of the at least one node; and comparing the at least commutative error detection value from the first execution of the application program to the at least one commutative error detection value from the second execution of the application program. - View Dependent Claims (25, 26, 27)
- and a network connecting the plurality of nodes through the network interfaces of the nodes, the node fault detection apparatus comprising;
-
28. A computing system using commutative error detection values for node fault detection, the computing system comprising:
-
a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus; a network connecting the plurality of nodes through the network interfaces of the nodes; a node fault detection apparatus comprising; a processor for performing node fault detection operations; a computer memory medium for storing commutative error detection values; and a network interface connecting the node fault detection apparatus to the network, where the node fault detection apparatus processor performs at least the following node fault detection operations during first and second executions of multiple portions of an application program by the computing system, wherein during the first and second executions of the multiple portions of the application program a plurality of commutative error detection values are saved to the commutative error detection apparatus of the plurality of nodes when the nodes inject information generated during execution of reproducible segments of the multiple portions of the application program into the network, the node fault detection operations comprising; during the first execution and second executions of the multiple portions of the application program, retrieving the commutative error detection values from the commutative error detection apparatus associated with the plurality of nodes; saving the plurality of commutative error detection values associated with at least the first execution of the multiple portions of the application program to the node fault detection computer memory medium; and comparing on a node-by-node basis the plurality of commutative error detection values associated with the first execution of the multiple portions of the application program to the plurality of commutative error detection values associated with the second execution of the multiple portions of the application program, where at least one difference in commutative error detection values between the first and second executions of the application program indicates a node fault condition. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37)
-
-
38. A node fault detection apparatus for use in a computing system, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- and a network connecting the plurality of nodes through the network interfaces of the nodes, the node fault detection apparatus comprising;
processor means for performing node fault detection operations; memory means for storing commutative error detection values retrieved from the commutative error detection apparatus of the plurality of nodes; and network interface means connecting the node fault detection apparatus to the network of the computing system, where the processor means of the node fault detection apparatus performs at least the following node fault detection operations when the computing system executes a portion of an application program at least twice, wherein during each execution of the portion of the application program at least one commutative error detection value is generated and saved to a commutative error detection apparatus associated with at least one node of the plurality when data generated during execution of a reproducible segment of the portion of the application program is injected into the network by the at least one node, the node fault detection operations comprising; retrieving the at least one commutative error detection value created during a first execution of the application program from the commutative error detection apparatus of the at least one node; saving the at least one commutative error detection value from the first execution of the application program to the memory means; retrieving the at least one commutative error detection value created during a second execution of the application program from the commutative error detection apparatus of the at least one node; and comparing the at least commutative error detection value from the first execution of the application program to the at least one commutative error detection value from the second execution of the application program. - View Dependent Claims (39)
- and a network connecting the plurality of nodes through the network interfaces of the nodes, the node fault detection apparatus comprising;
-
40. A node fault detection method for identifying faulty nodes in a computing system using commutative error detection values, where the computing system comprises a plurality of nodes, where each of the nodes comprises at least a node processor, a node memory, a network interface and a commutative error detection apparatus;
- and a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes multiple portions of an application program at least twice, wherein during the first and second executions of the multiple portions of the application program a plurality of commutative error detection values are saved to the commutative error detection apparatus of the plurality of nodes when the nodes inject information generated during execution of reproducible segments of the multiple portions of the application program into the network, the method comprising;
during the first execution and second executions of the multiple portions of the application program, retrieving the commutative error detection values from the commutative error detection apparatus associated with the plurality of nodes; saving the plurality of commutative error detection values associated with at least the first execution of the multiple portions of the application program to a computer memory medium; and comparing on a node-by-node basis the plurality of commutative error detection values associated with the first execution of the multiple portions of the application program to the plurality of commutative error detection values associated with the second execution of the multiple portions of the application program, where at least one difference in commutative error detection values between the first and second executions of the application program indicates a node fault condition. - View Dependent Claims (41)
- and a network connecting the plurality of nodes through the network interfaces of the nodes, and wherein node fault detection occurs when the computing system executes multiple portions of an application program at least twice, wherein during the first and second executions of the multiple portions of the application program a plurality of commutative error detection values are saved to the commutative error detection apparatus of the plurality of nodes when the nodes inject information generated during execution of reproducible segments of the multiple portions of the application program into the network, the method comprising;
Specification