Autonomic monitoring in a grid environment
First Claim
1. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
- defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation;
collecting status information on the plurality of jobs during their execution;
evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and
acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to a queue level.
5 Assignments
0 Petitions
Accused Products
Abstract
A system for performing autonomic monitoring in a computing grid is described. The system includes a plurality of modules, which when implemented into a computing grid, are operable to analyze objects of the grid and identify exception conditions associated with the objects. The system includes a configuration module for receiving information on specified objects to be monitored and exception conditions for the objects, an information collection module to collect job execution data associated with the objects, and an exception module to evaluate the job execution data associated with the objects and identify existing exception conditions. Related methods of performing autonomic monitoring in a grid system are also described.
101 Citations
30 Claims
-
1. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
-
defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation; collecting status information on the plurality of jobs during their execution; evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to a queue level. - View Dependent Claims (2)
-
-
3. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
-
defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation; collecting status information on the plurality of jobs during their execution; evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to job execution time.
-
-
4. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
-
defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation; collecting status information on the plurality of jobs during their execution; evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is selected from the group consisting of minimum expected job run time, maximum expected job run time, and minimum expected CPU consumption.
-
-
5. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
-
defining one or more exception conditions for a plurality of jobs being executed on one or more hosts within the grid, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation; collecting status information on the plurality of jobs during their execution; evaluating the collected job status information to determine whether an exception condition exists that is indicative of a problem condition in the grid hardware or grid-enabling software operation; and acting to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition, wherein at least one of the one or more exception conditions is specified relative to an individual host. - View Dependent Claims (6, 7, 8)
-
-
9. A method for performing autonomic monitoring of jobs in a networked computing grid in order to determine problem conditions in the grid hardware or grid-enabling software operation, the method comprising:
-
specifying a plurality of objects to be monitored, the specified objects comprising job queues and/or hosts within the networked computing grid; defining one or more exception conditions for the plurality of objects, the defined exception conditions being indicative of problem conditions in at least some of the plurality of objects being monitored; defining a response to each defined exception condition, whereby each of the specified plurality of objects, together with its one or more defined exception conditions and defined responses to the exception conditions, comprises an autonomic object within the networked computing grid; collecting values of the variables for the exception conditions associated with one or more of the autonomic objects within the networked computing grid; evaluating the collected variable values to determine whether at least one exception condition exists that is indicative of a problem condition in the autonomic object; and acting to correct the problem condition of the autonomic object that was indicated by the at least one exception condition. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system for performing autonomic monitoring of objects in a networked computing grid having a plurality of resources for executing a plurality of jobs, the autonomic monitoring operable to detect problem conditions in the grid hardware or grid-enabling software operation in the networked computing grid, the system comprising:
-
a configuration module for receiving information on one or more objects to be monitored, for defining exception conditions for the one or more objects, the exception conditions being indicative of problem conditions in the grid hardware or grid-enabling software operation in the networked computing grid, and for defining responses to be taken to each defined exception condition, whereby each of the specified plurality of objects with its one or more defined exception conditions and defined responses to the exception conditions comprises an autonomic object within the networked computing grid; an information collection module in communication with the configuration module, the information collection module operable to collect values of the variables for the exception conditions associated with one or more of the autonomic objects defined through the configuration module; an exception module in communication with the information collection module and the configuration module, the exception module being operable to identify the existence of the one or more exception conditions for the autonomic objects defined through the configuration module by evaluating the variables for the exception conditions collected through the information collection module; and an action module in communication with the exception module, the action module being operable to invoke actions to correct the to correct the problem condition of the grid hardware or grid-enabling software operation that was indicated by the exception condition. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification