Fault detection, diagnosis, and prevention for complex computing systems
First Claim
1. A method for diagnosing failures in an object-oriented software system, the method comprising:
- continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;
maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;
identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;
generating a failure model classified by type and category of the stress conditions from the diagnostic information;
localizing one or more failure conditions within the failure model using a multivariate normal distribution;
dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;
dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;
dynamically evaluating the collected diagnostic information to diagnose causes of failure; and
providing preventative information during run time and changing operation based on the preventative information to avoid future failures.
1 Assignment
0 Petitions
Accused Products
Abstract
A method is provided for diagnosing failures in an object-oriented software system. The method comprises collecting runtime diagnostic information; maintaining a record of the diagnostic information in a storage buffer; and dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval. The diagnostic information includes at least one set of call stack information for at least one currently running application and at least one set of other information. Each of the at least one set of other information is selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process.
-
Citations
28 Claims
-
1. A method for diagnosing failures in an object-oriented software system, the method comprising:
-
continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process; maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected; generating a failure model classified by type and category of the stress conditions from the diagnostic information; localizing one or more failure conditions within the failure model using a multivariate normal distribution; dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics; dynamically evaluating the collected diagnostic information to diagnose causes of failure; and providing preventative information during run time and changing operation based on the preventative information to avoid future failures. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for diagnosing failures in an object-oriented software system, the system comprising:
-
a first software module, executed by a processor, to collect runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application in the software system, and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process; a storage buffer configured to maintain a record of the diagnostic information, the storage buffer being configured to automatically receive the diagnostic information collected by the first software module and dynamically update the record to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval wherein the storage buffer also receives and collects snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; a failure diagnosis component for evaluating the diagnostic information record to determine diagnose causes of the failure by type; and a failure prediction component having a prevention component configured to generate a failure model classified by type and category of the stress conditions from the diagnostic information, configured to; identify and categorize the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected; localize one or more failure conditions within the failure model using a multivariate normal distribution; and update the failure model responsive to changes to configuration settings; wherein the record of the diagnostic information is used to reproduce the failure for diagnostics. - View Dependent Claims (20, 21, 22)
-
-
23. A computer having a non transitory machine usable medium including computer readable instructions stored thereon for execution by a processor to perform a method for diagnosing failures in an object-oriented software system, the method comprising:
-
collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process; maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected; generating a failure model classified by type and category of the stress conditions from the diagnostic information; localizing one or more failure conditions within the failure model using a multivariate normal distribution; dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics; evaluating the collected diagnostic information to diagnose causes of failure; and providing preventative information based on the evaluation to prevent future failures. - View Dependent Claims (24, 25)
-
-
26. A data processing system comprising:
-
a central processing unit; a random access memory for storing data and programs for execution by the central processing unit; a first storage level comprising a nonvolatile storage device; and computer readable instructions stored in the random access memory for execution by central processing unit to perform a method for diagnosing failures in an object-oriented software system, the method comprising; collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process; maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; identifying and categorizing runtime interactions between a set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected; generating a failure model classified by type and category of stress conditions from the diagnostic information; localizing one or more failure conditions within the failure model using a multivariate normal distribution; dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; and dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics; evaluating the collected diagnostic information to diagnose causes of failure; and providing preventative information based on the evaluation to prevent future failures. - View Dependent Claims (27, 28)
-
Specification