×

Fault detection, diagnosis, and prevention for complex computing systems

  • US 8,949,671 B2
  • Filed: 01/30/2008
  • Issued: 02/03/2015
  • Est. Priority Date: 01/30/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for diagnosing failures in an object-oriented software system, the method comprising:

  • continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;

    maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;

    identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;

    generating a failure model classified by type and category of the stress conditions from the diagnostic information;

    localizing one or more failure conditions within the failure model using a multivariate normal distribution;

    dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;

    dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;

    dynamically evaluating the collected diagnostic information to diagnose causes of failure; and

    providing preventative information during run time and changing operation based on the preventative information to avoid future failures.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×