Computer recovery method and system for recovering automatically from fault, and fault monitoring apparatus and program used in computer system
First Claim
1. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out,wherein said fault recovery information includes component characteristic information including information showing characteristics of components making up said at least one computer system to be monitored by said fault monitoring apparatus, andwherein said fault monitoring apparatus, when instructing a fault recovery operation to said at least one computer system in which said fault occurs, considers efficiencies concerning components included in said at least one computer system based on said component characteristic information, and instructs said at least one computer system in which said fault occurs so as to select components which are used efficiently.
1 Assignment
0 Petitions
Accused Products
Abstract
A fault monitoring apparatus is connected to computer systems and monitors a fault in the computer systems. The fault monitoring apparatus is provided with a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in the computer systems, and when a fault occurs in the computer systems, retrieves the rules previously set in the fault recovery information and instructs the computer systems so as to perform a recovery operation corresponding to a rule matching to the fault which occurs in the computer systems.
55 Citations
21 Claims
-
1. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein said fault recovery information includes component characteristic information including information showing characteristics of components making up said at least one computer system to be monitored by said fault monitoring apparatus, and wherein said fault monitoring apparatus, when instructing a fault recovery operation to said at least one computer system in which said fault occurs, considers efficiencies concerning components included in said at least one computer system based on said component characteristic information, and instructs said at least one computer system in which said fault occurs so as to select components which are used efficiently.
-
2. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein said fault recovery information includes classification of each of faults which occurred previously and configuration information of said at least one computer system at a time at which a fault occurs as fault example storage information, and wherein said fault monitoring apparatus further comprises an avoidance instructing section which, when instructing a fault recovery operation to said at least one computer system in which said fault occurs, refers to fault information of past occurrences, and instructs said at least one computer system in which experienced said fault so as to avoid a computer system configuration in which a fault is apt to occur.
-
3. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein said fault recovery information includes a system requisite rule which is a rule of a computer system to be met by said at least one computer system which is an object of monitoring and information for defining an operation for satisfying said rule as computer system configuration rule information, and wherein said fault monitoring apparatus further comprises a change instructing section which, when fault recovery of said at least one computer system in which a fault occurs, instructs said at least one computer system to change a computer system configuration after said fault recovery operation in accordance with a request specification of a whole system of said at least one computer system based on said system requisite rule.
-
4. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, and wherein said fault monitoring apparatus further comprises: -
a first processing section, when a fault occurs in said at least one computer system, for checking whether said fault which occurs is a component fault or not, and for, when said fault is said component fault, storing system configuration information at a time when said fault occurs in a storage area for memorizing fault examples as fault example storage information; a second processing section for referring to said fault example storage information to refer to past fault examples, for checking whether or not there is a same fault example that has occurred this time, for, when there is said same fault example, comparing system configuration information in the past same fault example with a computer system configuration in which said fault occurs, for extracting a feature of said computer system configuration, and for memorizing said characteristic related to said fault information as fault example storage information; a counting section for counting a frequency of fault occurrences for every feature of said computer system configuration when said fault occurs based on an extracted feature of computer system configuration; and a third processing section for checking a frequency of fault occurrences for every feature of said computer system configuration, and for registering a rule for avoiding an extracted feature of a computer system configuration in said component characteristic information, when said frequency of fault occurrences is more than a predetermined number.
-
-
5. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein in said fault monitoring apparatus, said fault recovery information includes a fault type judging rule in which, when a fault occurs in said at least one computer system, a rule for judging at which position said fault occurs and what type of a fault as a recovery rule, and an operation specifying section in which an operation is specified when said fault occurs, wherein when a fault occurs in said at least one computer system, said fault information indicating a fault cause is notified from said fault monitoring agent to said fault monitoring apparatus, wherein said fault monitoring apparatus which receives said fault information refers to said fault recovery rule, retrieves a fault type judging rule corresponding to a condition of said fault which occurs, wherein said fault monitoring apparatus instructs said fault monitoring agent of an operation of contents described in said operation specifying section corresponding to said fault type judging rule matching said fault, and wherein a fault type judging rule used in a case of an unknown fault occurrence is previously prepared, and an operation specifying section corresponding to said fault type judging rule is registered in a lowest order of priority.
-
6. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein in said fault monitoring apparatus, said fault recovery information includes a fault type judging rule in which, when a fault occurs in said at least one computer system, a rule for judging at which position said fault occurs and what type of a fault as a recovery rule, and an operation specifying section in which an operation is specified when said fault occurs, wherein when a fault occurs in said at least one computer system, said fault information indicating a fault cause is notified from said fault monitoring agent to said fault monitoring apparatus, wherein said fault monitoring apparatus which receives said fault information refers to said fault recovery rule, retrieves a fault type judging rule corresponding to a condition of said fault which occurs, wherein said fault monitoring apparatus instructs said fault monitoring agent of an operation of contents described in said operation specifying section corresponding to said fault type judging rule matching said fault, and wherein in said fault monitoring apparatus, a condition where a load of an operating system exceeds a predetermined load state is previously registered as said fault type judging rule, and an operation is defined in which a CPU (Central Processing Unit) board is added to a corresponding computer system as a fault recovery operation corresponding to said fault type judging rule.
-
7. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, and wherein said fault recovery information includes component characteristic information including information showing characteristics of components making up said at least one computer system to be monitored by said fault monitoring apparatus, wherein said fault monitoring apparatus, when instructing a fault recovery operation to said at least one computer system in which said fault occurs, considers efficiencies concerning components included in said at least one computer system based on said component characteristic information, and instructs said at least one computer system in which said fault occurs so as to select components which are used efficiently, and wherein said at least one computer system has a plurality of partitions respectively made up of a sub-computer system, and wherein said partitions are defined in said component characteristic information as alternative components, and when a fault occurs in an arbitrary component making up one of said plurality of partitions, said component is automatically changed to an alternative component.
-
8. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, wherein in said fault monitoring apparatus, said fault recovery information includes a fault type judging rule in which, when a fault occurs in said at least one computer system, a rule for judging at which position said fault occurs and what type of a fault as a recovery rule, and an operation specifying section in which an operation is specified when said fault occurs, wherein when a fault occurs in said at least one computer system, said fault information indicating a fault cause is notified from said fault monitoring agent to said fault monitoring apparatus, wherein said fault monitoring apparatus which receives said fault information refers to said fault recovery rule, retrieves a fault type judging rule corresponding to a condition of said fault which occurs, wherein said fault monitoring apparatus instructs said fault monitoring agent of an operation of contents described in said operation specifying section corresponding to said fault type judging rule matching said fault, wherein said at least one computer system has a plurality of partitions respectively made up of a sub-computer system, wherein said fault monitoring apparatus is provided with said fault type judging rule and said operation specifying section which are different for each of said at least one computer systems, and wherein when said operating system differs for each of said partitions, said fault monitoring apparatus integrates each of partitions and executes an automatic fault recovery operation.
-
9. A computer recovery system for recovering automatically from a fault comprising at least one computer system and a fault monitoring apparatus for monitoring a fault in said at least one computer system,
wherein said fault monitoring apparatus comprises a storage section for storing and holding fault recovery information including rules for defining recovery operations when faults occur in said at least one computer system, and a recovery instructing section, when a fault occurs in said at least one computer system, for retrieving said rules previously set in said fault recovery information and for instructing said at least one computer system in such a manner that a recovery operation corresponding a rule matching to said fault which occurs in said at least one computer system is carried out, and wherein each of said plurality computer systems makes up a cluster system whereby a node is configured, and wherein said fault monitoring apparatus includes at least one piece of node information, information showing that each node is capable of being a cluster with which node, and communication speed information of each network in said fault recovery information.
-
10. A computer recovery method for recovering automatically from a fault comprising:
-
a first step, when a fault occurs in at least one computer system, of notifying a fault monitoring apparatus of fault information by a fault monitoring agent in said at least one computer system in which said fault occurs; a second step, by said fault monitoring apparatus, of storing said fault information in a fault example storage area, and of extracting a feature of a computer system configuration for said fault information; a third step, by said fault monitoring apparatus, of referring to a fault recovery rule, of retrieving a fault type judging rule corresponding to a condition, and of instructing a fault monitoring agent to execute an operation described in a corresponding operation specifying section; and a fourth step, by said fault monitoring apparatus, of referring to a system configuration rule, of checking whether all of system requisite rules are met or not, and of instructing said fault monitoring agent to execute an operation described in said operation specifying section corresponding to said system requisite rule when there exists system requisite rule which said at least one computer system does not met. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A fault monitoring apparatus connected to at least one computer system and monitoring a fault in said at least one computer system, comprising:
-
a storage section for memorizing and holding fault recovery information including a rule which defines a recovery operation when a fault occurs in said at least one computer system, wherein said fault monitoring apparatus, when a fault occurs in said at least one computer system, retrieves a rule previously set in said fault recovery information and instructs said at least one computer system to execute a recovery operation corresponding to said fault which occurs in said at least one computer system, wherein said fault recovery information includes component characteristic information including information showing characteristics of components included in said at least one computer system monitored by said fault monitoring apparatus, and wherein said fault monitoring apparatus, when instructing a fault recovery operation to said at least one computer system in which said fault occurs, considers efficiencies concerning components included in said at least one computer system based on said component characteristic information, and instructs said at least one computer system in which said fault occurs so as to select components which are used efficiently.
-
-
18. A fault monitoring apparatus connected to at least one computer system and monitoring a fault in said at least one computer system, comprising:
-
a storage section for memorizing and holding fault recovery information including a rule which defines a recovery operation when a fault occurs in said at least one computer system, wherein said fault monitoring apparatus, when a fault occurs in said at least one computer system, retrieves a rule previously set in said fault recovery information and instructs said at least one computer system to execute a recovery operation corresponding to said fault which occurs in said at least one computer system, wherein types of past faults which occurred and configuration information of said at least one computer system at a time at which a fault has occurred are registered in said storage section as fault example storage information, and wherein said fault monitoring apparatus, when instructing a fault recovery operation to said at least one computer system in which said fault has occurred, refers to fault information of past occurrences in said fault example storage information, and instructs said at least one computer system in which said fault occurs so as to avoid a computer system in which a fault is apt to occur.
-
-
19. A fault monitoring apparatus connected to at least one computer system and monitoring a fault in said at least one computer system, comprising:
-
a storage section for memorizing and holding fault recovery information including a rule which defines a recovery operation when a fault occurs in said at least one computer system, wherein said fault monitoring apparatus, when a fault occurs in said at least one computer system, retrieves a rule previously set in said fault recovery information and instructs said at least one computer system to execute a recovery operation corresponding to said fault which occurs in said at least one computer system, wherein a system requisite rule which is a rule of a computer system to be met by a at least one computer system as an object to be monitored and information for defining an operation for meeting said rule are registered in said storage section as system configuration rule information, and wherein there is provided a controller that, when fault recovery of said at least one computer system in which a fault occurs, instructs said at least one computer system to change a computer system configuration after said fault recovery operation in accordance with a request specification of a whole system of said at least one computer system based on said system requisite rule.
-
-
20. A fault monitoring apparatus connected to at least one computer system and monitoring a fault in said at least one computer system, comprising:
-
a storage section for memorizing and holding fault recovery information including a rule which defines a recovery operation when a fault occurs in said at least one computer system, wherein said fault monitoring apparatus, when a fault occurs in said at least one computer system, retrieves a rule previously set in said fault recovery information and instructs said at least one computer system to execute a recovery operation corresponding to said fault which occurs in said at least one computer system; a first processing section, when a fault occurs in said at least one computer system, for checking whether said fault which occurs is a component fault or not, and for, when said fault is said component fault, storing system configuration information at a time when said fault occurs in fault example storage information; a second processing section for referring to said fault example storage information to refer to past fault examples, for checking whether or not there is a same fault example that has occurred this time, for, when there is the same fault example, comparing system configuration information in said past same fault example with a computer system configuration in which said fault has occurred this time, for extracting a feature of said computer system configuration, and for memorizing said characteristic related to said fault information as fault example storage information in said storage section; a counting section for counting a frequency of fault occurrences for every feature of said at least one computer system when said fault occurs based on an extracted feature of computer system configuration; and a third processing section for checking a frequency of fault occurrences for every feature of said computer system configuration, and for registering a rule for avoiding an extracted feature of a computer system configuration in said component characteristic information, when said frequency of fault occurrences is more than a predetermined number.
-
-
21. A medium storing a program being used in a fault monitoring apparatus connected to a first computer,
wherein said fault monitoring apparatus is programmed with a fault recovery information including a rule defining a recovery operation when a fault occurs in said first computer, wherein when a fault occurs in said first computer, said fault monitoring apparatus'"'"' programming causes a second computer to execute a process that refers to said rule, instructs said first computer to perform a fault recovery operation corresponding to said fault and to execute a recovery operation corresponding to said rule, and further causing said second computer to execute: -
a process, when a fault occurs in said first computer, of storing fault information notified from a fault monitoring agent in said first computer in which said fault occurs in a fault example storage area, and of extracting a feature of said first computer configuration for said fault information; a process, when said fault occurs in said first computer, of referring to a fault recovery rule including a fault type judging rule for judging which position said fault occurs and what type of said fault and an operation specifying section in which an operation to be executed when a fault occurs, of retrieving a fault type judging rule corresponding to a condition, and of instructing a fault monitoring agent to execute an operation described in a corresponding operation specifying section; and a process of referring to a system configuration rule including a system requisite rule which is a rule of said first computer'"'"'s configuration to be met by said second computer to be monitored and an operation specifying section for defining an operation to satisfy said rule, of checking whether all of system requisite rules are met or not, and of instructing said fault monitoring agent to execute an operation described in said operation specifying section corresponding to said system requisite rule when there is a non-met system requisite rule.
-
Specification