Autonomic monitoring in a grid environment

US 7,340,654 B2
Filed: 06/17/2004
Issued: 03/04/2008
Est. Priority Date: 06/17/2004
Status: Active Grant

First Claim

Patent Images

1. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:

defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;

collecting status information on the plurality of jobs during their execution;

evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and

acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to a queue level.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for performing autonomic monitoring in a computing grid is described. The system includes a plurality of modules, which when implemented into a computing grid, are operable to analyze objects of the grid and identify exception conditions associated with the objects. The system includes a configuration module for receiving information on specified objects to be monitored and exception conditions for the objects, an information collection module to collect job execution data associated with the objects, and an exception module to evaluate the job execution data associated with the objects and identify existing exception conditions. Related methods of performing autonomic monitoring in a grid system are also described.

101 Citations

View as Search Results

30 Claims

1. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;
  
  collecting status information on the plurality of jobs during their execution;
  
  evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and
  
  acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to a queue level.
- View Dependent Claims (2)
- - 2. The method of claim 1 wherein the at least one of the one or more exception conditions is specified relative to a cluster-specific queue level.

3. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;
  
  collecting status information on the plurality of jobs during their execution;
  
  evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and
  
  acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to job execution time.

4. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;
  
  collecting status information on the plurality of jobs during their execution;
  
  evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and
  
  acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is selected from the group consisting of minimum expected job run time, maximum expected job run time, and minimum expected CPU consumption.

5. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;
  
  collecting status information on the plurality of jobs during their execution;
  
  evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and
  
  acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to an individual host.
- View Dependent Claims (6, 7, 8)
- - 6. The method of claim 5 wherein the at least one of the one or more exception conditions is specified based on the number of jobs that exit abnormally within a certain time for a particular host.
  - 7. The method of claim 5 wherein the at least one of the one or more exception conditions is specified based on the minimum duration that an exceptional condition should exist for a particular host.
  - 8. The method of claim 5 wherein multiple exception conditions can be differently defined for different hosts within the networked computing grid.

9. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- specifying a plurality of objects to be monitored, the specified objects comprising job queues and/or hosts within the networked computing grid;
  
  defining one or more exception conditions for the plurality of objects, the defined exception conditions being indicative of problem conditions in at least some of the plurality of objects being monitored;
  
  defining a response to each defined exception condition, whereby each of the specified plurality of objects, together with its one or more defined exception conditions and defined responses to the exception conditions, comprises an autonomic object within the networked computing grid;
  
  collecting values of the variables for the exception conditions associated with one or more of the autonomic objects within the networked computing grid;
  
  evaluating the collected variable values to determine whether at least one exception condition exists that is indicative of a problem condition in the autonomic object; and
  
  acting to correct the problem condition of the autonomic object that was indicated by the at least one exception condition.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 10. The method of claim 9 wherein the acting to correct the problem condition comprises closing one or more hosts that are associated with the autonomic object.
  - 11. The method of claim 9 wherein the acting to correct the problem condition comprises notifying a system administrator of the existence of the exception condition.
  - 12. The method of claim 9 wherein the at least one exception conditions is specified relative to a queue level.
  - 13. The method of claim 12 wherein the at least one exception condition is specified relative to a cluster-specific queue level.
  - 14. The method of claim 9 wherein the at least one exception condition is specified relative to job execution time.
  - 15. The method of claim 9 wherein the at least one exception condition is selected from the group consisting of minimum expected job run time, maximum expected job run time, and minimum expected CPU consumption.
  - 16. The method of claim 9 wherein the at least one exception condition is specified relative to an individual host.
  - 17. The method of claim 16 wherein the at least one exception condition is specified based on the number of jobs that exit abnormally within a certain time for a particular host.
  - 18. The method of claim 16 wherein the at least one exception condition is specified based on a minimum duration that an exception condition should exist for a particular host.
  - 19. The method of claim 16 wherein multiple exception conditions can be differently defined for different hosts within the networked computing grid.

20. A system for performing autonomic monitoring of objects in a networked computing grid having a plurality of resources for executing a plurality of jobs, the autonomic monitoring operable to detect problem conditions in the grid hardware or grid-enabling software operation in the networked computing grid, the system comprising:
- a configuration module for receiving information on one or more objects to be monitored, for defining exception conditions for the one or more objects, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation in the networked computing grid, and for defining responses to be taken to each defined exception condition, whereby each of the specified plurality of objects with its one or more defined exception conditions and defined responses to the exception conditions comprises an autonomic object within the networked computing grid;
  
  an information collection module in communication with the configuration module, the information collection module operable to collect values of the variables for the exception conditions associated with one or more of the autonomic objects defined through the configuration module;
  
  an exception module in communication with the information collection module and the configuration module, the exception module being operable to identify the existence of the one or more exception conditions for the autonomic objects defined through the configuration module by evaluating the variables for the exception conditions collected through the information collection module; and
  
  an action module in communication with the exception module, the action module being operable to invoke actions to correct the to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 21. The system of claim 20 wherein the actions invoked by the action module comprise closing one or more hosts associated with the autonomic object.
  - 22. The system of claim 20 wherein the actions invoked by the action module comprise notifying a system administrator of the existence of an exception condition.
  - 23. The system of claim 20 wherein at least one of the defined exception conditions is specified relative to a queue level.
  - 24. The system of claim 23 wherein at least one of the defined exception conditions is specified relative to a cluster-specific queue level.
  - 25. The system of claim 20 wherein at least one of the defined exception conditions is specified relative to job execution time.
  - 26. The system of claim 20 wherein at least one of the defined exception conditions is selected from the group consisting of minimum expected job run time, maximum expected job run time, and minimum expected CPU consumption.
  - 27. The system of claim 20 wherein at least one of the defined exception conditions is specified relative to an individual host.
  - 28. The system of claim 27 wherein the at least one exception condition is specified based on the number of jobs that exit abnormally within a certain time for the individual host.
  - 29. The system of claim 27 wherein the at least one exception condition is specified based on the minimum duration that an exceptional condition should exist for the individual host.
  - 30. The system of claim 20 wherein multiple exception conditions can be differently defined for different hosts within the networked computing grid.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Platform Computing Corporation
Inventors
Wei, Xiaohui, Bigagli, David
Primary Examiner(s)
DUNCAN, MARC M

Application Number

US10/871,350
Publication Number

US 20050283788A1
Time in Patent Office

1,356 Days
Field of Search

714/47
US Class Current

714/47.2
CPC Class Codes

H04L 43/00 Arrangements for monitoring...

Autonomic monitoring in a grid environment

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

101 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Autonomic monitoring in a grid environment

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

101 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links