Fault detection, diagnosis, and prevention for complex computing systems

US 8,949,671 B2
Filed: 01/30/2008
Issued: 02/03/2015
Est. Priority Date: 01/30/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method for diagnosing failures in an object-oriented software system, the method comprising:

continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;

maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;

identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;

generating a failure model classified by type and category of the stress conditions from the diagnostic information;

localizing one or more failure conditions within the failure model using a multivariate normal distribution;

dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;

dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;

dynamically evaluating the collected diagnostic information to diagnose causes of failure; and

providing preventative information during run time and changing operation based on the preventative information to avoid future failures.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is provided for diagnosing failures in an object-oriented software system. The method comprises collecting runtime diagnostic information; maintaining a record of the diagnostic information in a storage buffer; and dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval. The diagnostic information includes at least one set of call stack information for at least one currently running application and at least one set of other information. Each of the at least one set of other information is selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process.

Citations

28 Claims

1. A method for diagnosing failures in an object-oriented software system, the method comprising:
- continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;
  
  maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;
  
  identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;
  
  generating a failure model classified by type and category of the stress conditions from the diagnostic information;
  
  localizing one or more failure conditions within the failure model using a multivariate normal distribution;
  
  dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;
  
  dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;
  
  dynamically evaluating the collected diagnostic information to diagnose causes of failure; and
  
  providing preventative information during run time and changing operation based on the preventative information to avoid future failures.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising monitoring the software system to detect occurrences of failures related to the runtime interactions between the set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected, the failures related to the runtime interactions between the set of objects including one or more of data access violations, memory existing in an inconsistent state, and sudden impact from large resource usages.
  - 3. The method of claim 2, further comprising monitoring a set of resources being utilized by the software system, and recording the snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.
  - 4. The method of claim 3, further comprising monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any of the set of resources is being utilized beyond a specified threshold for that resource.
  - 5. The method of claim 4, wherein the method is performed by one or more separate processes executing within an address space of the software system, one or more threads of a process executing within the software system, one or more daemons executing within the software system, one or more services executing outside the address space of the software system, or combinations thereof.
  - 6. The method of claim 1, wherein the diagnostic information is generated by one or more processes executing within the software system.
  - 7. The method of claim 1, wherein the diagnostic information is generated by one or more threads of one or more processes executing within the software system.
  - 8. The method of claim 1, wherein the software system is communicatively coupled to a data repository, and wherein the storage buffer is maintained in the data repository, such that in a state in which the storage buffer is full, the maintaining the record of the diagnostic information includes recording current diagnostic information over oldest diagnostic information existing in the storage buffer for the record.
  - 9. The method of claim 1, wherein the record of the diagnostic information is maintained in a log trace file.
  - 10. The method of claim 1, wherein the predetermined interval is specified according to a maximum size for the storage buffer, a maximum number of call stack changes, or a maximum period of time.
  - 11. The method of claim 2, further comprising allowing an analyst to access the record of the diagnostic information to attempt identify and categorize the cause of each occurrence of a failure that is detected.
  - 12. The method of claim 11, further comprising sorting and extracting information from the record of diagnostic information that is relevant to the each occurrence of a failure that is detected.
  - 13. The method of claim 3, wherein the software system is communicatively coupled to a data repository, and wherein each snapshot of the record of diagnostic information that is recorded is maintained in a record of snapshots in the data repository.
  - 14. The method of claim 13, further comprising utilizing the record of snapshots to predict occurrences of failures related to runtime interactions between a set of objects in the software system.
  - 15. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises performing regular regression testing.
  - 16. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises creating a failure model of data clusters for utilization conditions of the set of resources.
  - 17. The method of claim 1, wherein the software system is an application selected from operating system applications, database management applications, server-side software applications, web-based applications, and client-side software applications.
  - 18. The method of claim 1, wherein the software system is executing on a single processor, several processors in close proximity, or distributed across a network.

19. A system for diagnosing failures in an object-oriented software system, the system comprising:
- a first software module, executed by a processor, to collect runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application in the software system, and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;
  
  a storage buffer configured to maintain a record of the diagnostic information, the storage buffer being configured to automatically receive the diagnostic information collected by the first software module and dynamically update the record to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval wherein the storage buffer also receives and collects snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;
  
  a failure diagnosis component for evaluating the diagnostic information record to determine diagnose causes of the failure by type; and
  
  a failure prediction component having a prevention component configured to generate a failure model classified by type and category of the stress conditions from the diagnostic information, configured to;
  
  identify and categorize the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;
  
  localize one or more failure conditions within the failure model using a multivariate normal distribution; and
  
  update the failure model responsive to changes to configuration settings;
  
  wherein the record of the diagnostic information is used to reproduce the failure for diagnostics.
- View Dependent Claims (20, 21, 22)
- - 20. The diagnostic system of claim 19, further comprising a second software module configured to be notified by the software system of the each occurrence of a failure related to the runtime interactions between the set of objects in the software system, the second software module being configured to access the storage buffer to evaluate the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure for which the second software module receives notification.
  - 21. The diagnostic system of claim 20, further comprising a third software module configured to monitor a set of resources being utilized by the software system, the third software module being further configured to record a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.
  - 22. The diagnostic system of claim 21, wherein the first, second, and third software modules are implemented within one or more libraries of functions, one or more plug-in modules, one or more dynamic link-libraries, or combinations thereof.

23. A computer having a non transitory machine usable medium including computer readable instructions stored thereon for execution by a processor to perform a method for diagnosing failures in an object-oriented software system, the method comprising:
- collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;
  
  maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;
  
  identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;
  
  generating a failure model classified by type and category of the stress conditions from the diagnostic information;
  
  localizing one or more failure conditions within the failure model using a multivariate normal distribution;
  
  dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;
  
  dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;
  
  evaluating the collected diagnostic information to diagnose causes of failure; and
  
  providing preventative information based on the evaluation to prevent future failures.
- View Dependent Claims (24, 25)
- - 24. The computer-usable medium of claim 23, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected.
  - 25. The computer-usable medium of claim 23, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.

26. A data processing system comprising:
- a central processing unit;
  
  a random access memory for storing data and programs for execution by the central processing unit;
  
  a first storage level comprising a nonvolatile storage device; and
  
  computer readable instructions stored in the random access memory for execution by central processing unit to perform a method for diagnosing failures in an object-oriented software system, the method comprising;
  
  collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;
  
  maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;
  
  identifying and categorizing runtime interactions between a set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;
  
  generating a failure model classified by type and category of stress conditions from the diagnostic information;
  
  localizing one or more failure conditions within the failure model using a multivariate normal distribution;
  
  dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; and
  
  dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;
  
  evaluating the collected diagnostic information to diagnose causes of failure; and
  
  providing preventative information based on the evaluation to prevent future failures.
- View Dependent Claims (27, 28)
- - 27. The data processing system of claim 26, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure that is detected.
  - 28. The data processing system of claim 26, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Mukherjee, Maharaj
Primary Examiner(s)
TRUONG, LOAN

Application Number

US12/022,453
Publication Number

US 20090193298A1
Time in Patent Office

2,561 Days
Field of Search

714/38, 714/47, 714/26, 714/38.1, 714/47.1
US Class Current

714/38.1
CPC Class Codes

G06F 11/0718 in an object-oriented system

G06F 11/0766 Error or fault reporting or...

Fault detection, diagnosis, and prevention for complex computing systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Fault detection, diagnosis, and prevention for complex computing systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links