Hybrid method for event prediction and system control

US 20050114739A1
Filed: 11/24/2003
Published: 05/26/2005
Est. Priority Date: 11/24/2003
Status: Active Grant

First Claim

Patent Images

1. A method of predicting the occurrence of critical events in a computer cluster having a series of nodes, said method comprising:

maintaining an event log that contains information concerning critical events that have occurred in the computer cluster;

maintaining a system parameter log that contains information concerning system parameters for each node in the cluster; and

predicting a future performance of a node in the cluster based upon said event log and said system parameter log.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A hybrid method of predicting the occurrence of future critical events in a computer cluster having a series of nodes records system performance parameters and the occurrence of past critical events. A data filter filters the logged to data to eliminate redundancies and decrease the data storage requirements of the system. Time-series models and rule based classification schemes are used to associate various system parameters with the past occurrence of critical events and predict the occurrence of future critical events. Ongoing processing jobs are migrated to nodes for which no critical events are predicted and future jobs are routed to more robust nodes.

69 Citations

View as Search Results

21 Claims

1. A method of predicting the occurrence of critical events in a computer cluster having a series of nodes, said method comprising:
- maintaining an event log that contains information concerning critical events that have occurred in the computer cluster;
  
  maintaining a system parameter log that contains information concerning system parameters for each node in the cluster; and
  
  predicting a future performance of a node in the cluster based upon said event log and said system parameter log.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 comprising developing a Bayesian network model that represents said computer cluster and said nodes based upon the information in said event log and said system parameter log.
  - 3. The method of claim 1 wherein maintaining said system parameter log comprises recording a temperature of a node in the cluster and a corresponding time value.
  - 4. The method of claim 1 wherein maintaining said system parameter log comprises recording a utilization parameter of a central processing unit of a node in the cluster and a corresponding time value.
  - 5. The method of claim 1 comprising filtering said event log and said system parameter log such that some critical event information and some system parameter information is not maintained in said event log and said system parameter log.
  - 6. The method of claim 1 comprising using a time-series mathematical model to predict future values of said system parameters.
  - 7. The method of claim 1 comprising using a rule based classification system to predict future critical events based upon said critical event information and said system parameter information.
  - 8. The method of claim 1 wherein the step of predicting comprises forming a warning window for each node in the cluster such that said warning window contains a predicted performance parameter or critical event occurrence for the node for a predetermined future period of time.

9. A method of improving the performance of a computer cluster having a series of nodes comprising:
- monitoring the occurrence of critical events in said nodes in said computer cluster;
  
  monitoring system performance parameters of said nodes in said computer cluster;
  
  creating a node representation for each node in said computer cluster based upon said monitoring;
  
  creating a cluster representation based on said node representations;
  
  periodically examining said node representations to predict future node performance; and
  
  using said cluster representation to redistribute tasks among said nodes based upon said predicted node performance.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The method of claim 9 wherein creating said cluster representation and said node representation comprises creating a Bayesian Network that represents relationships between the occurrence of said critical events and said system performance parameters.
  - 11. The method of claim 9 comprising saving information concerning said critical events and said system performance parameters in a database.
  - 12. The method of claim 11 comprising filtering said saved information to remove information wherein said removed information is not determined to be useful in predicting a future performance of said nodes.
  - 13. The method of claim 9 comprising applying a time-series mathematical model to said system performance parameters to predict future values of said system performance parameters.
  - 14. The method of claim 13 wherein said time series mathematical model is one of an auto regression, a moving average and an autoregressive moving average model.
  - 15. The method of claim 9 comprising using rule based classifications to associate some system performance parameters with occurrence of said critical events.
  - 16. The method of claim 9 wherein said system performance parameters concern at least one of a node temperature, processor utilization value, network bandwidth and available memory space.

17. An information processing system comprising:
- a computer cluster having a series of nodes;
  
  a control system for monitoring critical events that occur in said computer cluster and system parameters of said nodes;
  
  a memory for storing information related to said occurrence of said critical events and said system parameters of said nodes; and
  
  a Bayesian Network model for predicting a future occurrence of a critical event based upon an observed relationship between said system parameters and said occurrence of critical events.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The information processing system of claim 17 comprising a filter for removing redundant information from said stored information.
  - 19. The information processing system of claim 17 wherein said Bayesian Network comprises a time-series modeler for predicting future values of said system parameters.
  - 20. The information processing system of claim 17 wherein said Bayesian Network comprises a rule based classification system for associating said system parameters with said occurrences of said critical events.
  - 21. The information processing system of claim 17 comprising a dynamic probe generator for determining when to collect additional information concerning said system parameters or said critical event occurrence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
GlobalFoundries, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Gupta, Manish, Oliner, Adam J., Sahoo, Ramendra K., Moreira, Jose E.

Granted Patent

US 7,451,210 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/39
CPC Class Codes

G06F 11/008   Reliability or availability...

G06F 11/3006   where the computing system ...

G06F 11/3058   Monitoring arrangements for...

G06F 11/3447   Performance evaluation by m...

G06F 11/3476   Data logging G06F11/14, G06...

G06F 2201/86   Event-based monitoring

Y10S 706/908   Electronic or computer, int...

Y10S 706/916   Electronic or computer, int...

Hybrid method for event prediction and system control

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

69 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Hybrid method for event prediction and system control

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

69 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links