Coordinated multinode dump collection in response to a fault

US 6,643,802 B1
Filed: 04/27/2000
Issued: 11/04/2003
Est. Priority Date: 04/27/2000
Status: Active Grant

First Claim

Patent Images

1. A method of handling faults in a system having plural nodes including first and second nodes, comprising:

detecting a fault condition in the first node;

capturing predetermined information of each node; and

starting a routine in the second node to coordinate the saving of the predetermined information to a common database.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multi-node parallel processing system includes multiple nodes each including the capability to handle faults. When a fault is detected, a fault handling procedure is launched. This may include invoking a dump capture handler to stop execution of all applications as well as control the saving of dump information to slave dump databases. The node in which the fault occurred sends a broadcast message to all the other nodes to indicate the fault condition. In response to the message, each of the other nodes also captures predetermined dump information into slave dump databases. The system is then restarted. After the system has started up again, a master dump save routine is started on a master node. The master dump save routine then launches slave dump save routines in each of the nodes to coordinate the collection of dump information. The master dump save routine can query the information stored in each of the slave dump routines and select a subset of the information desired. In response, the slave dump save routines then communicate the requested dump information to the master node.

93 Citations

View as Search Results

39 Claims

1. A method of handling faults in a system having plural nodes including first and second nodes, comprising:
- detecting a fault condition in the first node;
  
  capturing predetermined information of each node; and
  
  starting a routine in the second node to coordinate the saving of the predetermined information to a common database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein capturing the predetermined information includes storing information relating to one or more processes active in each node when the fault condition occurred.
  - 3. The method of claim 1, wherein the nodes are coupled by an interconnect network, the method further comprising sending one or more messages to indicate the fault condition to the nodes.
  - 4. The method of claim 3, wherein sending the one or more messages is performed by the first node to other nodes, the other nodes comprising the second node and at least one other node.
  - 5. The method of claim 4, wherein capturing the predetermined condition comprises capturing dump information, the method further comprising:
6. The method of claim 5, further comprising:
- launching, in at least each of the nodes other then the second node, a handler to collect the stored dump information in the corresponding node for storage into the common database.
7. The method of claim 6, further comprising transmitting the collected dump information to the second node, wherein the common database is stored by the second node.
8. The method of claim 1, wherein capturing the predetermined information includes storing information relating to one or more threads active in the corresponding node when the fault condition occurred.
9. The method of claim 8, further comprising running a WINDOWS®
- operating system in each of the nodes.
10. The method of claim 1, wherein capturing the predetermined information includes storing information relating to software routines active in the corresponding node when the fault condition occurred.
11. The method of claim 1, wherein the common database is contained in the second node, the method further comprising the routine receiving the predetermined information from each of the nodes.
12. The method of claim 11, wherein receiving the predetermined information includes receiving information relating to processes and threads.
13. The method of claim 12, further comprising running a WINDOWS®
- operating system in each of the nodes.
14. The method of claim 1, further comprising the second node checking that each of the other nodes has completed capturing the information.
15. The method of claim 1, wherein detecting the fault condition in the first node comprises detecting the fault condition by a software routine in the first node.
16. The method of claim 1, wherein the system comprises a database system, and wherein each of the plural nodes contains a database management application.

17. An article including one or more machine-readable storage media containing instructions for handling faults in a system having a plurality of nodes including first and second nodes, the instructions when executed causing the system to:
- detect, by the first node, an occurrence of a fault in the first node;
  
  collect predetermined information associated with each node in response to occurrence of the fault; and
  
  communicate the collected predetermined information to the second node for storage in the second node.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 18. The article of claim 17, wherein detecting the occurrence of the fault in the first node is performed by a software routine in the first node.
  - 19. The article of claim 17, wherein the system comprises a database system, and wherein the instructions when executed cause the system to execute a database management application in each node.
  - 20. The article of claim 17, wherein the instructions when executed cause the system to:
21. The article of claim 20, wherein the instructions when executed cause the first node to determine that each of the nodes has completed collecting the predetermined information.
22. The article of claim 21, wherein the instructions when executed cause a restart of the system.
23. The article of claim 22, wherein the second node is designated as the master node, and wherein the instructions when executed cause the master node to invoke a master routine.
24. The article of claim 23, wherein the instructions when executed cause the master routine to invoke a slave routine in each of the other nodes.
25. The article of claim 24, wherein the instructions when executed cause the slave routines to communicate the collected predetermined information to the master routine.
26. The article of claim 20, wherein the instructions when executed cause the system to:
- start, in response to an indication from the second node, a routine in each of the nodes to collect the predetermined information.

27. A system comprising a plurality of nodes including a master node and a first node,the first node to detect a fault condition in the first node and to store predetermined information in response to the fault condition;
- nodes other than the first node to also store predetermined information in response to the fault condition, the nodes other than the first node including the master node; and
  
  the master node comprising;
  
  a storage to store a database, a handler to issue a request to capture the predetermined information stored by the nodes.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- - 28. The system of claim 27, the first node to send one or more messages to the other nodes in response to the fault condition, the other nodes to store respective predetermined information in response to the one or more messages.
  - 29. The system of claim 27, the master node to cause routines to launch in each of the nodes to collect respective predetermined information.
  - 30. The system of claim 29, wherein the nodes contain routines to communicate the predetermined information to the master node, and wherein each of the routines is a slave routine, the master node further comprising a master routine capable of controlling tasks performed by each slave routine.
  - 31. The system of claim 29, wherein the database to store collected predetermined information received by the master routine from each of the slave routines.
  - 32. The system of claim 31, wherein the master routine receives an indication that slave routines have completed collecting the information.
  - 33. The system of claim 27, wherein the predetermined information comprises dump information.
  - 34. The system of claim 27, wherein the database to store predetermined information received from each of the nodes.
  - 35. The system of claim 27, wherein the predetermined information includes a process context containing information relating to processes.
  - 36. The system of claim 35, wherein the predetermined information includes a thread context containing information relating to threads.
  - 37. The system of claim 36, wherein the predetermined information includes a node context containing pointers to processes in the node.
  - 38. The system of claim 37, wherein the predetermined information includes application-related files.
  - 39. The system of claim 27, wherein each node includes a WINDOWS®
    - operating system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Teradata US, Inc. (Teradata Corporation)
Original Assignee
NCR Corporation
Inventors
Lewis, Donald J., Hsieh, Carl Chih-Fen, Cochran, Nancy J., Geisert, Mark A., Frost, Bruce J., Calkins, Dennis R.
Primary Examiner(s)
Beausoliel, Robert
Assistant Examiner(s)
CHU, GABRIEL L

Application Number

US09/558,984
Time in Patent Office

1,286 Days
Field of Search

714/31, 714/45, 714/48, 714/57, 714/25, 714/37
US Class Current

714/37
CPC Class Codes

G06F 11/0712   in a virtual computing plat...

G06F 11/0724   in a multiprocessor or a mu...

G06F 11/0778   Dumping, i.e. gathering err...

Coordinated multinode dump collection in response to a fault

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

93 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Coordinated multinode dump collection in response to a fault

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

93 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links