Method and apparatus for monitoring the health of a computer system

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
9Forward
Citations 
0
Petitions 
2
Assignments
First Claim
1. A method for monitoring the health of a computer system, comprising:
 receiving a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
determining whether the firstdifference function indicates that the computer system is at the onset of degradation; and
if so, performing a remedial action.
2 Assignments
0 Petitions
Accused Products
Abstract
A system that monitors the health of a computer system is presented. During operation, the system receives a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system. The system then determines whether the firstdifference function indicates that the computer system is at the onset of degradation. If so, the system performs a remedial action.
13 Citations
View as Search Results
Optimal stress exerciser for computer servers  
Patent #
US 7,725,292 B2
Filed 10/17/2007

Current Assignee
Oracle America Inc.

Sponsoring Entity
Oracle America Inc.

OPTIMAL STRESS EXERCISER FOR COMPUTER SERVERS  
Patent #
US 20090106600A1
Filed 10/17/2007

Current Assignee
Oracle America Inc.

Sponsoring Entity
Oracle America Inc.

INTELLIGENT AND AUTOMATED CODE DEPLOYMENT  
Patent #
US 20130179878A1
Filed 02/28/2013

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

INTELLIGENT AND AUTOMATED CODE DEPLOYMENT  
Patent #
US 20130179877A1
Filed 01/06/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Intelligent and automated code deployment  
Patent #
US 8,713,562 B2
Filed 01/06/2012

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Intelligent and automated code deployment  
Patent #
US 9,003,401 B2
Filed 02/28/2013

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Intelligent and automated code deployment  
Patent #
US 9,836,294 B2
Filed 03/05/2015

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Display and analysis of information related to monitored elements of a computer system  
Patent #
US 10,228,825 B1
Filed 10/16/2013

Current Assignee
CA Inc. dba CA Technologies

Sponsoring Entity
CA Inc. dba CA Technologies

Intelligent and automated code deployment  
Patent #
US 10,248,397 B2
Filed 11/10/2017

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

Method and apparatus for proactive fault monitoring in interconnects  
Patent #
US 7,353,431 B2
Filed 08/21/2006

Current Assignee
Oracle America Inc.

Sponsoring Entity
Sun Microsystems Incorporated

System and method for vehicle diagnostics  
Patent #
US 7,103,460 B1
Filed 09/06/2005

Current Assignee
American Vehicular Sciences LLC

Sponsoring Entity
Automotive Technologies International Incorporated

Neuroparity pattern recognition system and method  
Patent #
US 6,119,111 A
Filed 06/09/1998

Current Assignee
ARCH Development Corporation

Sponsoring Entity
ARCH Development Corporation

Machine fault diagnostics system and method  
Patent #
US 5,566,092 A
Filed 12/30/1993

Current Assignee
Caterpillar Incorporated

Sponsoring Entity
Caterpillar Incorporated

21 Claims
 1. A method for monitoring the health of a computer system, comprising:
 receiving a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
determining whether the firstdifference function indicates that the computer system is at the onset of degradation; and
if so, performing a remedial action.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
 receiving a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
 11. A computerreadable storage medium storing instructions that when executed by a computer cause the computer to perform a method for monitoring the health of a computer system, wherein the method comprises:
 receiving a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
determining whether the firstdifference function indicates that the computer system is at the onset of degradation; and
if so, performing a remedial action.  View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
 receiving a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
 21. An apparatus that monitors the health of a computer system, comprising:
 a receiving mechanism configured to receive a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
a degradationdetection mechanism configured to determine whether the firstdifference function indicates that the computer system is at the onset of degradation; and
a remedialaction mechanism, wherein if the degradationdetection mechanism determines that the computer system is at the onset of degradation, the remedialaction mechanism is configured to perform a remedial action.
 a receiving mechanism configured to receive a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system;
1 Specification
[0001]The subject matter of this application is related to the subject matter in a copending nonprovisional application by Dan Vacar, David K. McElfresh, Kenny C. Gross, and Leoncio D. Lopez entitled, "Characterizing Degradation of Components During ReliabilityEvaluation Studies," having Ser. No. 11/452,632, and filing date 13 Jun. 2006 (Attorney Docket No. SUN060365).
[0002]1. Field of the Invention
[0003]The present invention relates to techniques for monitoring the health of a computer system. More specifically, the present invention relates to a method and apparatus for determining whether a computer system is at the onset of degradation by monitoring a difference function for the variance of a monitored telemetry variable.
[0004]2. Related Art
[0005]An increasing number of businesses are using computer systems for missioncritical applications. In such applications, a component failure can have a devastating effect on the business. For example, the airline industry is critically dependent on computer systems that manage flight reservations, and would essentially cease to function if these systems failed. Hence, it is critically important to monitor the health of components within the computer system so that remedial actions can be performed on components that are at the onset of degradation.
[0006]One technique for monitoring the health of components within the computer system is to monitor telemetry variables generated within the computer system. These telemetry variables can include physical signals generated by transducers: such as temperature, voltage, current, and vibration, and can include software signals monitored by an operating system such as: hard disk activity, central processing unit (CPU) load, and memory usage. Existing healthmonitoring techniques detect changes in the mean value of the monitored telemetry variables, or changes in the patterns of correlation among dynamically a firstdifference function for the variance of a time series for a monitored telemetry variable within the computer system. The system then determines whether the firstdifference function indicates that the computer system is at the onset of degradation. If so, the system performs a remedial action.
[0007]In one embodiment, prior to receiving the firstdifference function, the system receives the variance of the time series for the monitored telemetry variable. Next, the system calculates a residual function of the variance of the time series for the monitored telemetry variable. The system then calculates the firstdifference function from the residual function of the variance.
[0008]In one embodiment, prior to receiving the variance for the time series for the monitored telemetry variable, the system receives the time series for the monitored telemetry variable. The system then calculates the variance of the time series for the monitored telemetry variable.
[0009]In one embodiment, while calculating the firstdifference function of the time series, for each time point within the time series, the system subtracts a value of the time series at a previous time point from the value of the time series at a present time point.
[0010]In one embodiment, the system divides the result of the subtraction by the value of a length of a time interval between the previous time point and the present time point.
[0011]In one embodiment, while calculating the residual function for a time series, for each time interval in the time series, the system (1) calculates a running average of values for the time series up to and including a present time interval; and (2) subtracts the running average from a value of the time series at the present time interval.
[0012]In one embodiment, while determining whether the firstdifference function indicates that the computer system is at the onset of degradation, the system determines whether the firstdifference function exceeds a specified threshold.
[0013]In one embodiment, while determining whether the firstdifference function indicates that the computer system is at the onset of degradation, the system performs a Sequential Probability Ratio Test (SPRT) on the firstdifference function. The system then determines whether the SPRT generates an alarm.
[0014]In one embodiment, the SPRT can include one or more of: a positive variance firstdifference test, which generates an alarm if the firstdifference function for the variance of the time series for the monitored telemetry variable is increasing; and a negative variance firstdifference test, which generates an alarm if the firstdifference function for the variance of the time series for the monitored telemetry variable is decreasing.
[0015]In one embodiment, while performing the remedial action the system performs one or more of the following actions: recording a time when the onset of degradation occurred; notifying a system administrator that the computer system is at the onset of degradation; shutting down the computer system; backing up data stored on the computer system; failingover to a redundant computer system; replacing one or more components which are at the onset of degradation; and performing other remedial actions.
[0016]FIG. 1 presents a plot illustrating the output of a failing temperature sensor.
[0017]FIG. 2 presents a plot illustrating the dynamic resistance of a failing interconnect.
[0018]FIG. 3 presents a block diagram illustrating a computer system in accordance with an embodiment of the present invention.
[0019]FIG. 4 illustrates realtime telemetry system in accordance with an embodiment of the present invention
[0020]FIG. 5 presents a flow chart illustrating the process of using the firstdifference function for the variance of telemetry variables to monitor the health of a computer system in accordance with an embodiment of the present invention.
[0021]FIG. 6 illustrates an insitu reliability stresstest chamber in accordance with an embodiment of the present invention.
[0022]FIG. 7 presents a flow chart illustrating the process of using the firstdifference function for the variance of telemetry variables to detect the onset of degradation in components during reliabilityevaluation studies in accordance with an embodiment of the present invention.
[0023]The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
[0024]The data structures and code described in this detailed description are typically stored on a computerreadable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, volatile memory, nonvolatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer readable media now known or later developed.
Overview
[0025]One embodiment of the present invention detects precursor failure mechanisms that do not show up as anomalies in the mean value or in the correlation patterns for telemetry variables. More specifically, one embodiment of the present invention detects failure mechanisms that appear as changes in the variance (including the degree of spikiness or burstiness) of monitored telemetry variables.
[0026]One embodiment of the present invention proactively detects and monitors the evolution of computer system failure mechanisms through a binaryhypothesis test that continuously monitors the digitized rateofchange of the variance for monitored telemetry variables. In doing so, the present invention can detect anomalies that show up as a changeingain, changeinvariance, or changeinspikiness/burstiness, without a changeinmean value.
[0027]One embodiment of the present invention detects increases in the variance of a monitored telemetry variable and quantifies the rate of increase in the variance. Another embodiment of the present invention detects decreases in the variance of a monitored telemetry variable and quantifies the rate of decrease in the variance. Note that decreases in variance include the case where physical transducers degrade with what are known as "stuckat" failures.
[0028]Note that for the sake of clarity, the present invention is described in terms of "telemetry variables," which can generally include, but are not limited to, sensor signals generated by physical sensors or software sensors, instrumentation signals, inferential variables which are inferred from sensor signals or inferred from other variables or signals, and any other variable or signal that can be used to determine the health of a computer system or a component.
Computer System
[0029]FIG. 3 presents a block diagram illustrating a computer system 300 in accordance with an embodiment of the present invention. Computer system 300 includes processor 301, memory 302, storage device 303, and realtime telemetry system 304.
[0030]Processor 301 can generally include any type of processor, including, but not limited to, a microprocessor, a mainframe computer, a digital signal processor, a personal organizer, a device controller and a computational engine within an appliance. Memory 302 can include any type of memory, including but not limited to, dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, read only memory (ROM), and any other type of memory now known or later developed. Storage device 303 can include any type of nonvolatile storage device that can be coupled to a computer system. This includes, but is not limited to, magnetic, optical, and magnetooptical storage devices, as well as storage devices based on flash memory and/or batterybacked up memory.
[0031]In one embodiment of the present invention, realtime telemetry system 304 is separate from computer system 300. Note that realtime telemetry system 304 is described in more detail below with reference to FIG. 4.
RealTime Telemetry System
[0032]FIG. 4 illustrates realtime telemetry system 304 in accordance with an embodiment of the present invention. Referring to FIG. 4, computer system 300 can generally include any computational device including a mechanism for servicing requests from a client for computational and/or data storage resources. In one embodiment, computer system 300 is a highend uniprocessor or multiprocessor server that is being monitored by realtime telemetry system 304.
[0033]Realtime telemetry system 304 includes telemetry device 400, analytical resampling program 401, sensitivity analysis tool 402, and SPRT module 403. Telemetry device 400 gathers information from the various sensors and monitoring tools within computer system 300. In one embodiment, telemetry device 400 directs the signals to a remote location that contains analytical resampling program 401, sensitivity analysis tool 402, and SPRT module 403. In another embodiment of the present invention, one or more of analytical resampling program 401, sensitivity analysis tool 402, and SPRT module 403 are located within computer system 300.
[0034]Analytical resampling program 401 ensures that the monitored telemetry variables have a uniform sampling rate. In doing so, analytical resampling program 401 uses interpolation techniques, if necessary, to fill in missing data points, or to equalize the sampling intervals when the raw data is nonuniformly sampled.
[0035]After the telemetry variables pass through analytical resampling program 401, they are aligned and correlated by sensitivity analysis tool 402. For example, in one embodiment of the present invention sensitivity analysis tool 402 incorporates a novel moving window technique that "slides" through the telemetry variables with systematically varying window widths. The system systematically varies the alignment between sliding windows for different telemetry variables to optimize the degree of association between the telemetry variables, as quantified by an "Fstatistic," which is computed and ranked for all telemetry variable windows by sensitivity analysis tool 402.
[0036]While statistically comparing the quality of two fits, Fstatistics reveal the measure of regression. The higher the value of the Fstatistic, the better the correlation is between two telemetry variables. The lead/lag value for the sliding window that results in the Fstatistic with the highest value is chosen, and the candidate telemetry variable is aligned to maximize this value. This process is repeated for each telemetry variable by sensitivity analysis tool 402.
[0037]Telemetry variables that have an Fstatistic very close to 1 are "completely correlated" and can be discarded. This can result when two telemetry variables are measuring the same metric, but are expressing them in different engineering units. For example, a telemetry variable can convey a temperature in degrees Fahrenheit, while a second telemetry variable conveys the same temperature in degrees Centigrade. Since these two telemetry variables are perfectly correlated, one does not contain any additional information over the other, and therefore, one may be discarded.
[0038]Some telemetry variables may exhibit little correlation, or no correlation whatsoever. In this case, these telemetry variables may be dropped because they add little predictive information. Once a highly correlated subset of the telemetry variables has been determined, they are combined into one group or cluster for processing by the SPRT module 403.
[0039]One embodiment of the present invention continuously monitors a variety of telemetry variables (e.g., sensor signals) in real time during operation of the server. (Note that although we refer to a single computer system in this disclosure, the present invention can also apply to a collection of computer systems).
[0040]These telemetry variables can also include signals associated with internal performance parameters maintained by software within the computer system. For example, these internal performance parameters can include, but are not limited to, system throughput, transaction latencies, queue lengths, central processing unit (CPU) utilization, load on CPU, idle time, memory utilization, load on the memory, load on the cache, I/O traffic, bus saturation metrics, FIFO overflow statistics, and various operational profiles gathered through "virtual sensors" located within the operating system.
[0041]These telemetry variables can also include signals associated with canary performance parameters for synthetic user transactions, which are periodically generated for the purpose of measuring quality of service from the end user's perspective.
[0042]These telemetry variables can additionally include hardware variables, including, but not limited to, internal temperatures, voltages, currents, and fan speeds.
[0043]Furthermore, these telemetry variables can include diskrelated metrics for a remote storage device, including, but not limited to, average service time, average response time, number of kilobytes (kB) read per second, number of kB written per second, number of read requests per second, number of write requests per second, and number of soft errors per second.
[0044]In one embodiment of the present invention, the foregoing telemetry variables are monitored continuously with one or more SPRT tests.
[0045]In one embodiment of the present invention, the components from which the telemetry variables originate are field replaceable units (FRUs), which can be independently monitored. Note that all major system components, including both hardware and software components, can be decomposed into FRUs. (For example, a software FRU can include: an operating system, a middleware component, a database, or an application.)
[0046]Also note that the present invention is not meant to be limited to server computer systems. In general, the present invention can be applied to any type of computer system. This includes, but is not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a personal organizer, a device controller, and a computational engine within an appliance.
Detecting Changes in Variance
[0047]One embodiment of the present invention computes the firstdifference function of the variance estimates for digitized timeseries telemetry variables under surveillance. In one embodiment of the present invention, the timeseries signals are "residuals" formed by subtracting the mean value of a monitored telemetry variable from the value of the monitored telemetry variable. Note that the firstdifference function of the variance is a numerical approximation of the derivative of the sequence of variance estimates.
[0048]In one embodiment of the present invention, a sequential probability ratio test (SPRT) is used to monitor firstdifference function of the variance of a telemetry variable. In one embodiment, the SPRT generates a warning flag if the variance of the telemetry variable is increasing (positive variance firstdifference function SPRT). In another embodiment, the SPRT generates a warning flag if the variance of the telemetry variable is decreasing (negative variance firstdifference function SPRT). Note that more than one SPRT test can be used at the same time.
[0049]A comparison of SPRT alarms issuing from a positive variancederivative SPRT and/or a negative variance firstdifference function SPRT provides a wealth of diagnostic information on a class of failure modes known collectively as a "changeingain without a changeinmean". For example, if the positive variancederivative SPRT triggers warning flags, it is an indication that there has been a sudden increase in the variance (or degree of spikiness or burstiness) of the process. If this SPRT subsequently ceases triggering warning flags, it is an indication that the degradation mode responsible for the increased noisiness has gone to completion. Such information can be beneficial in root causing the origin of the degradation and helping to eliminate the degradation mechanism from future product designs.
[0050]Similarly, if the negative variance firstdifference function SPRT starts triggering alarms, there is a decrease in variance for the process. If the negative variance firstdifference function SPRT ceases issuing warning flags, it is an indication that the degradation mode has gone to completion. In safety critical processes, this failure mode (decreasing variance without a change in mean) is dangerous in conventional systems that are monitored only by threshold limit tests. The reason it is dangerous is that a shrinking variance, when it occurs as a result of a transducer that is losing its ability to respond, never trips a threshold limit. (Whereas degradation that manifests as a linear decalibration bias, or even an increasing variance, will eventually trip a high or low threshold limit and sound a warning). A sustained shrinking variance, which can occur, for example, when oilfilled pressure transmitters leak their oil, or electrolytic capacitors leak their electrolyte, never trips a threshold, but can be detected by the positive and negative variance firstdifference function SPRT tests.
[0051]FIG. 5 presents a flow chart illustrating the process of using the firstdifference function for the variance of telemetry variables to monitor the health of a computer system in accordance with an embodiment of the present invention. The process begins when the system monitors telemetry variables (step 500). In one embodiment of the present invention, the telemetry variable is a variance function of the monitored telemetry variable. In another embodiment of the present invention, the system receives a time series for the monitored telemetry variable and calculates a variance function of the time series for the monitored telemetry variable.
[0052]Next, the system calculates a running average for each monitored telemetry variable (step 501). The system then calculates residuals for the telemetry variables (step 502) to produce residuals 503. In one embodiment of the present invention, the system calculates residuals for the telemetry variables by subtracting the running average for each monitored telemetry variable from the corresponding monitored telemetry variable.
[0053]Next, the system calculates the firstdifference function for the residuals of the variance for the monitored telemetry variables (step 504). In one embodiment of the of the present invention, for each time point within the time series for the telemetry variable, the system calculates the firstdifference function by subtracting a value of the time series at a previous time point from the value of the time series at a present time point. In one embodiment of the present invention, the system divides the result of the subtraction by the value of a length of a time interval between the previous time point and the present time point.
[0054]The system then performs a Sequential Probability Ratio Test (SPRT) on the firstdifference functions (step 505). Note that the system also receives alpha and beta values 506 for the SPRT. (SPRTs are described in more detail below.)
[0055]If an alarm is generated by the SPRTs (step 507, yes), the system records the time of failure (step 508), and continues monitoring the telemetry variables (step 509). If no alarm is generated by the SPRTs (step 507, no), the system continues monitoring the telemetry variables (step 509).
[0056]In one embodiment of the present invention, a remedial action is also performed in step 508. In one embodiment of the present invention, the remedial action can involve performing one or more of: recording a time when the onset of degradation occurred; notifying a system administrator that the computer system is at the onset of degradation; shutting down the computer system; backing up data stored on the computer system; failingover to a redundant computer system; replacing one or more components which are at the onset of degradation; and other remedial actions.
AcceleratedLife Studies
[0057]For devices undergoing acceleratedlife studies, it is often desirable to supply power to the devices under test while they are in the stresstest chambers. Even though it may not be possible to apply the full pass/fail functional testing to the devices inside the stresstest chamber, a change in the electrical behavior of the device can be detected by monitoring the signatures of the electrical current being applied to the devices. Note that subtle anomalies in the noisesignature timeseries of the current for the device appear when the device degrades and/or fails. Also note that the current to the device can provide an indirect measure of the health of a device. More specifically, the currentnoise time series can be used as an "inferential variable" for highresolution annunciation of the onset of degradation and, in many cases, the exact point of failure in time in the components undergoing acceleratedlife studies.
[0058]FIG. 6 illustrates an insitu reliability stresstest chamber 600 in accordance with an embodiment of the present invention. A component under test 601, which can be any type of device from a computer system, is placed inside stresstest chamber 600. Note that component under test 601 can include, but is not limited to: power supplies, capacitors, sockets, integrated circuit chips, hard drives, and transceivers.
[0059]Stress control module 602 applies and controls one or more stress variables to the stresstest chamber 600. These stress variables can include, but are not limited to: temperature, humidity, vibration, voltage, chemical/environmental, and radiation. In one embodiment of the present invention, stress control module 602 applies sufficient stress factors to stresstest chamber 600 to create acceleratedlife studies for the component under test 601. The same setup can also be applied to early failure rate studies of a component, burnin screens of a component and repaircenter reliability evaluations of a returned component.
[0060]As is shown in FIG. 6, stresstest chamber 600 can contain multiple units (specimens) of component under test 601, wherein an array of nine specimens 603 of component under test 601 are shown. Stresstest chamber 600 provides a supply of power to each specimen of component under test 601, and obtains telemetry variable outputs (e.g., inferential variables) from each specimen. The telemetry variable outputs are coupled to a faultmonitoring module 604. In one embodiment of the present invention, faultmonitoring module 604 is a Continuous System Telemetry Harness (CSTH).
[0061]Note that the output data series can be either processed in realtime or postprocessed. In one embodiment of the present invention, faultmonitoring module 604 analyzes the output data series in realtime while the telemetry variables are being collected from all of the specimens 603 of component under test 601, and predicts the likelihood of failure for each of specimens 603. In another embodiment of the present invention, faultmonitoring module 604 postprocesses the output data series at a later time and detects whether failures have occurred at an earlier time, and if so, determines the time of failures. Note that the output data series can include but is not limited to: a timeseries, a number of cycles, and a number of incidents.
[0062]Furthermore, note that the telemetry variable from each specimen of the component can include current, voltage, resistance, temperature, and other physical variables. Also, note that all of the specimens 603 in stresstest chamber 600 can be tested at the same time and under the same conditions. Moreover, instead of testing multiple individual components, the stresstest chamber can be configured to test a single component.
[0063]One embodiment of the present invention uses an ultrasensitive sequential detection technique called the Sequential Probability Ratio Test (SPRT) for telemetry variable surveillance to accurately identify the onset of component degradation and/or failure. Moreover, a tandem SPRT can be run on the derivative of the telemetry variable's time series to accurately assess the time of complete of failure. The combination of tandem SPRTs that monitor the telemetry variables provides a robust surveillance scheme which has the capability to: [0064]1. detect the onset of degradation in any individual component under stress, even when the overall functionality of that component cannot be measured directly; and to [0065]2. detect the time of complete failure for any component under stress.
[0066]In one embodiment of the present invention, information from the tandem SPRT analyses is combined with discretetime exsitu pass/fail testing to construct a detailed population failure distribution.
[0067]One embodiment of the present invention lessens the constraints on the tradeoff between the number of units under test and the duration of the experiments, while yielding much higher resolution information on the dynamic evolution of the health of the components as a function of age and cumulative stress. This higher resolution facilitates higher confidence in selecting a mathematical model that accurately predicts the longterm reliability of the component for a time point beyond the number of hours the component was actually tested.
[0068]Also note that the present invention minimizes expensive exsitu functional evaluations.
[0069]FIG. 7 presents a flow chart illustrating the process of using the firstdifference function for the variance of telemetry variables to detect the onset of degradation in components during reliabilityevaluation studies in accordance with an embodiment of the present invention. The process begins when the system monitors telemetry variables (step 700). In one embodiment of the present invention, the telemetry variable is a variance function of the monitored telemetry variable. In another embodiment of the present invention, the system receives a time series for the monitored telemetry variable and calculates a variance function of the time series for the monitored telemetry variable. Note that FIG. 5 and FIG. 7 are different. FIG. 5 illustrates the process of monitoring a number of distinct telemetry variables for a single computer system or a single component. In contrast, FIG. 7 illustrates the process of monitoring a single telemetry variable from a number of specimens of a component. In one embodiment of the present invention, the process illustrated in FIG. 7 can be performed on one or more telemetry variables across a number of specimens of a component.
[0070]Next, the system calculates a running average across all monitored telemetry variables (step 701). The system then calculates residuals for the telemetry variables (step 702) to produce residuals 703. In one embodiment of the present invention, the system calculates residuals for the telemetry variables by subtracting the running average for all monitored telemetry variable from each monitored telemetry variable.
[0071]Next, the system calculates the firstdifference function for the residuals of the variance for the monitored telemetry variables (step 704). In one embodiment of the of the present invention, for each time point within the time series for the telemetry variable, the system calculates the firstdifference function by subtracting a value of the time series at a previous time point from the value of the time series at a present time point. In one embodiment of the present invention, the system divides the result of the subtraction by the value of a length of a time interval between the previous time point and the present time point.
[0072]The system then performs a Sequential Probability Ratio Test (SPRT) on the firstdifference functions (step 705). Note that the system receives alpha and beta values 706 for the SPRT. Note that SPRTs are described in more detail below.
[0073]If an alarm is generated by the SPRTs (step 707, yes), the system records the time of failure (step 708). The system then determines whether the reliabilityevaluation study should be altered (step 710). If so, the system stops and alters the reliabilityevaluation study (step 711). The system then continues monitoring the telemetry variables (step 709). If the system determines that the reliabilityevaluation study should not be altered, the system continues monitoring the telemetry variables (step 709).
[0074]If no alarm is generated by the SPRTs (step 707, no), the system continues monitoring the telemetry variables (step 709).
SPRT (Sequential Probability Ratio Test)
[0075]The Sequential Probability Ratio Test is a statistical hypothesis test that differs from standard fixed sample tests. In fixedsample statistical tests, a given number of observations are used to select one hypothesis from one or more alternative hypotheses. The SPRT, however, examines one observation at a time, and then makes a decision as soon as it has sufficient information to ensure that prespecified confidence bounds are met.
[0076]The basic approach taken by the SPRT technique is to analyze successive observations of a discrete process. Let y, represent a sample from the process at a given moment t.sub.n in time. In one embodiment of the present invention, the sequence of values {Y.sub.n}=y.sub.0, y.sub.1, . . . y.sub.n comes from a stationary process characterized by a Gaussian, whitenoise probability density function (PDF) with mean 0. (Note that since with the sequence is from a nominally stationary processes, any process variables with a nonzero mean can be first normalized to a mean of zero with no loss of generality).
[0077]The SPRT is a binary hypothesis test that analyzes process observations sequentially to determine whether or not the telemetry variable is consistent with normal behavior. When a SPRT reaches a decision about current process behavior (i.e., the telemetry variable is behaving normally or abnormally), the system reports the decision and continues to process observations.
[0078]For each of the eight types of tandem SPRT tests described below, the telemetry variable data adheres to a Gaussian PDF with mean 0 and variance .sigma..sup.2 for normal signal behavior, referred to as the null hypothesis, H.sub.0. The system computes eight specific SPRT hypothesis tests in parallel for each telemetry variable monitored. One embodiment of the present invention applies a SPRT to an electrical current timeseries. Other embodiments of the present invention apply a SPRT to other telemetry variables, including voltage, internal temperature, or stress variables.
[0079]The SPRT surveillance module executes all 8 tandem hypothesis tests in parallel. Each test determines whether the current sequence of process observations is consistent with the null hypothesis versus an alternative hypothesis. The first four tests are: (SPRT 1) the positivemean test, (SPRT 2) the negativemean test, (SPRT 3) the nominalvariance test, and (SPRT 4) the inversevariance test. For the positivemean test, the telemetry variable data for the corresponding alternative hypothesis, H.sub.1, adheres to a Gaussian PDF with mean +M and variance .sigma..sup.2. For the negativemean test, the telemetry variable data for the corresponding alternative hypothesis, H.sub.2, adheres to a Gaussian PDF with mean M and variance .sigma..sup.2. For the nominalvariance test, the telemetry variable data for the corresponding alternative hypothesis, H.sub.3, adheres to a Gaussian PDF with mean 0 and variance V.sigma..sup.2 (with scalar factor V). For the inversevariance test, the telemetry variable data for the corresponding alternative hypothesis, H.sub.4, adheres to a Gaussian PDF with mean 0 and variance .sigma..sup.2/V.
[0080]The next two tandem SPRT tests are performed not on the raw telemetry variables as above, but on the first difference function of the telemetry variable. For discrete time series, the firstdifference function (i.e., difference between each observation and the observation preceding it) gives an estimate of the numerical derivative of the time series. During uninteresting time periods, the observations in the firstdifference function are a nominally stationary random process centered about zero. If an upward or downward trend suddenly appears in the telemetry variable, SPRTs number 5 and 6 observe an increase or decrease, respectively, in the slope of the telemetry variable.
[0081]For example, if there is a decrease in the value of the telemetry variable, SPRT alarms are triggered for SPRTs 2 and 6. SPRT 2 generates a warning because the sequence of raw observations drops with time. And SPRT 6 generates a warning because the slope of the telemetry variable changes from zero to something less than zero. The advantage of monitoring the mean SPRT and slope SPRT in tandem is that the system correlates the SPRT readings from the eight tests and determines if the component has failed. For example, if the telemetry variable levels off to a new stationary value (or plateau), the alarms from SPRT 6 cease because the slope returns to zero when the raw telemetry variable reaches a plateau. However, SPRT 2 will continue generating a warning because the new mean value of the telemetry variable is different from the value prior to the degradation. Therefore, the system correctly identifies that the component has failed.
[0082]If SPRTs 3 or 4 generates a warning, the variance of the telemetry variable is either increasing or decreasing, respectively. An increasing variance that is not accompanied by a change in mean (inferred from SPRTs 1 and 2 and SPRTs 5 and 6) signifies an episodic event that is "bursty" or "spiky" with time. A decreasing variance that is not accompanied by a change in mean is a common symptom of a failing component that is characterized by an increasing time constant. Therefore, having variance SPRTs available in parallel with slope and mean SPRTs provides a wealth of supplementary diagnostic information.
[0083]The final two tandem SPRT tests, SPRT 7 and SPRT 8, are performed on the firstdifference function of the variance estimates for the telemetry variable. The firstdifference function of the variance estimates is a numerical approximation of the derivative of the sequence of variance estimates. As such, SPRT 7 triggers a warning flag if the variance of the telemetry variable is increasing, while SPRT 8 triggers a warning flag if the variance of the telemetry variable is decreasing. A comparison of SPRT alarms from SPRTs 3, 4, 7, and 8, gives a great deal of diagnostic information on a class of failure modes known collectively as a "changeingain without a changeinmean." For example, if SPRTs 3 and 7 both trigger warning flags, it is an indication that there has been a sudden increase in the variance of the process. If SPRT 3 continues to trigger warning flags but SPRT 7 ceases issuing warning flags, it is an indication that the degradation mode responsible for the increased noisiness has gone to completion. Such information can be beneficial in root causing the origin of the degradation and eliminating it from future product designs.
[0084]Similarly, if SPRTs 4 and 8 both start triggering alarms, there is a decrease in variance for the process. If SPRT 4 continues to issue warning flags but SPRT 8 ceases issuing warning flags, it is an indication that the degradation mode has gone to completion. In safetycritical processes, this failure mode (decreasing variance without a change in mean) is dangerous in conventional systems that are monitored only by threshold limit tests. The reason it is dangerous is that a shrinking variance, when it occurs as a result of a transducer that is losing its ability to respond, never trips a threshold limit. (In contrast degradation that manifests as a linear decalibration bias, or even an increasing variance, eventually trips a high or low threshold limit and sounds a warning). A sustained decreasing variance, which happens, for example, when oilfilled pressure transmitters leak their oil, or electrolytic capacitors leak their electrolyte, never trips a threshold in conventional systems, but will be readily detected by the suite of 8 tandem SPRT tests taught in this invention.
[0085]The SPRT technique provides a quantitative framework that permits a decision to be made between the null hypothesis and the eight alternative hypotheses with specified misidentification probabilities. If the SPRT accepts one of the alternative hypotheses, an alarm flag is set and data is transmitted.
[0086]The SPRT operates as follows. At each time step in a calculation, the system calculates a test index and compares it to two stopping boundaries A and B (defined below). The test index is equal to the natural log of a likelihood ratio (L.sub.n), which for a given SPRT is the ratio of the probability that the alternative hypothesis for the test (H.sub.j, where j is the appropriate subscript for the SPRT in question) is true, to the probability that the null hypothesis (H.sub.0) is true.
L n = probability of observed sequence { Y n } given H j is true probability of observed sequence { Y n } given H 0 is true ( 1 )
[0087]If the logarithm of the likelihood ratio is greater than or equal to the logarithm of the upper threshold limit [i.e., ln(L.sub.n)>ln(B)], then the alternative hypothesis is true. If the logarithm of the likelihood ratio is less than or equal to the logarithm of the lower threshold limit [i.e., ln(L.sub.n)<ln(A)], then the null hypothesis is true. If the log likelihood ratio falls between the two limits, [i.e., ln(A)<ln(L.sub.n) <ln(B)], then there is not enough information to make a decision (and, incidentally, no other statistical test could yet reach a decision with the same given Type I and II misidentification probabilities).
[0088]Equation (2) relates the threshold limits to the misidentification probabilities .alpha. and .beta.:
A = .beta. 1  .alpha. , B = 1  .beta. .alpha. ( 2 )
where .alpha. is the probability of accepting H.sub.j when H.sub.0 is true (i.e., the falsealarm probability), and .beta. is the probability of accepting H.sub.0 when H.sub.j is true (i.e., the missedalarm probability).
[0089]The first two SPRT tests for normal distributions examine the mean of the process observations. If the distribution of observations exhibits a nonzero mean (e.g., a mean of either +M or M, where M is the preassigned system disturbance magnitude for the mean test), the mean tests determine that the system is degraded. Assuming that the sequence {Y.sub.n} adheres to a Gaussian PDF, then the probability that the null hypothesis H.sub.0 is true (i.e., mean 0 and variance .sigma..sup.2) is:
P ( y 1 , y 2 , , y n H 0 ) = 1 ( 2 .pi..sigma. 2 ) n / 2 exp [  1 2 .sigma. 2 k  1 n y k 2 ] ( 3 )
[0090]Similarly, the probability for alternative hypothesis H.sub.1 is true (i.e., mean M and variance .sigma..sup.2) is:
P ( y 1 , y 2 , , y n H 1 ) = 1 ( 2 .pi..sigma. 2 ) n / 2 exp [  1 2 .sigma. 2 ( k  1 n y k 2  2 k  1 n y k M + k  1 n M 2 ) ] ( 4 )
[0091]The ratio of the probabilities in (3) and (4) gives the likelihood ratio L.sub.n for the positivemean test:
L n = exp [  1 2 .sigma. 2 k  1 n M ( M  2 y k ) ] ( 5 )
[0092]Taking the logarithm of likelihood ratio given by (5) produces the SPRT index for the positivemean test (SPRT.sub.pos):
SPRT pos =  1 2 .sigma. 2 k  1 n M ( M  2 y k ) = M .sigma. 2 k  1 n ( y k  M 2 ) ( 6 )
[0093]The SPRT index for the negativemean test (SPRT.sub.neg) is derived by substituting M for each instance of M in (4) through (6) above, resulting in:
SPRT neg = M .sigma. 2 k  1 n (  y k  M 2 ) ( 7 )
[0094]The next two SPRT tests examine the variance of the sequence. This capability gives the SPRT module the ability to detect and quantitatively characterize changes in variability for processes, which is vitally important for 6sigma QA/QC improvement initiatives. In the variance tests, the system is degraded if the sequence exhibits a change in variance by a factor of V or 1/V, where V, the preassigned system disturbance magnitude for the variance test, is a positive scalar. The probability that the alternative hypothesis H.sub.3 is true (i.e., mean 0 and variance V.sigma..sup.2) is given by (3) with .sigma..sup.2 replaced by V.sigma..sup.2:
P ( y 1 , y 2 , , y n H 0 ) = 1 ( 2 .pi. V .sigma. ) n / 2 exp [  1 2 V .sigma. 2 k  1 n y k 2 ] ( 8 )
[0095]The likelihood ratio for the variance test is given by the ratio of (8) to (3):
L n = V  n / 2 exp [  1 2 .sigma. 2 1  V V k  1 n y k 2 ] ( 9 )
[0096]Taking the logarithm of the likelihood ratio given in (9) produces the SPRT index for the nominalvariance test (SPRT.sub.nom):
SPRT nom = 1 2 .sigma. 2 ( V  1 V ) k  1 n y k 2  n 2 ln V ( 10 )
[0097]The SPRT index for the inversevariance test (SPRT.sub.inv) is derived by substituting 1/V for each instance of V in (8) through (10), resulting in:
SPRT inv = 1 2 .sigma. 2 ( 1  V ) k  1 n y k 2 + n 2 ln V ( 11 )
[0098]The tandem SPRT module performs mean, variance, and SPRT tests on the raw process telemetry variable and on its first difference function. To initialize the module for analysis of a telemetry variable timeseries, the user specifies the system disturbance magnitudes for the tests (M and V), the falsealarm probability (.alpha.), and the missedalarm probability (.beta.).
[0099]Then, during the training phase (before the first failure of a component under test), the module calculates the mean and variance of the monitored variable process signal. For most telemetry variables the mean of the raw observations for the telemetry variable will be nonzero; in this case the mean calculated from the training phase is used to normalize the telemetry variable during the monitoring phase. The system disturbance magnitude for the mean tests specifies the number of standard deviations (or fractions thereof) that the distribution must shift in the positive or negative direction to trigger an alarm. The system disturbance magnitude for the variance tests specifies the fractional change of the variance necessary to trigger an alarm.
[0100]At the beginning of the monitoring phase, the system sets all eight SPRT indices to 0. Then, during each time step of the calculation, the system updates the SPRT indices using (6), (7), (10), and (11). The system compares each SPRT index is then compared to the upper [i.e., ln((1.beta.)/.alpha.] and lower [i.e., ln((.beta./(1.alpha.))] decision boundaries, with these three possible outcomes: [0101]1. the lower limit is reached, in which case the process is declared healthy, the test statistic is reset to zero, and sampling continues; [0102]2. the upper limit is reached, in which case the process is declared degraded, an alarm flag is raised indicating a sensor or process fault, the test statistic is reset to zero, and sampling continues; or [0103]3. neither limit has been reached, in which case no decision concerning the process can yet be made, and the sampling continues.
[0104]The advantages of using a SPRT are twofold: [0105]1. early detection of very subtle anomalies in noisy process variables; and [0106]2. prespecification of quantitative falsealarm and missedalarm probabilities.
[0107]The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.