System and method for using failure casting to manage failures in computer systems
First Claim
1. A system for managing failures in a computer system using failure casting, the computer system including an array of disks, comprising:
- one or more processors operable to provide a system manager that performs actions on the computer system to address failures that occur within the computer system;
a failure casting logic that detects failures as they occur in the computer system;
a failure casting hierarchy that defines a plurality of failures that can occur within the computer system, and which is used by the failure casting logic upon detecting the occurrence of a failure to cast the failure from a first failure type to a second failure type, wherein the second failure type is then communicated to the system manager to allow the system manager to treat the failure as if it were the second failure type;
wherein the failure casting hierarchy defines at least two sets of failures, including a set of reboot-curable failures and a set of non-reboot-curable failures, wherein the reboot-curable failures are addressed by the system manager by rebooting the computer system or component thereof that includes the failure;
wherein the failure casting logic and the failure casting hierarchy are part of a script that detects the occurrence of failures in the computer system and then casts the failure into one of either a reboot-curable failure or non-reboot-curable failure, wherein the script is executed by the computer system when powered on;
wherein the script is used to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array; and
wherein the failure casting hierarchy in the script includes the set of non-reboot curable failures that are checked at boot time, and if a disk, when added to the array, exhibits a failure upon bootup within the set of non-reboot-curable failures, then the disk is not added to the array.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for using failure casting to manage failures in a computer system. In accordance with an embodiment, the system uses a failure casting hierarchy to cast failures of one type into failures of another type. In doing this, the system allows incidents, problems, or failures to be cast into a (typically smaller) set of failures, which the system knows how to handle. In accordance with a particular embodiment, failures can be cast into a category that is considered reboot-curable. If a failure is reboot-curable then rebooting the system will likely cure the problem. Examples include hardware failures, and reboot-specific methods that can be applied to disk failures and to failures within clusters of databases. The system can even be used to handle failures that were hitherto unforeseen—failures can be cast into known failures based on the failure symptoms, rather than any underlying cause.
45 Citations
7 Claims
-
1. A system for managing failures in a computer system using failure casting, the computer system including an array of disks, comprising:
-
one or more processors operable to provide a system manager that performs actions on the computer system to address failures that occur within the computer system; a failure casting logic that detects failures as they occur in the computer system; a failure casting hierarchy that defines a plurality of failures that can occur within the computer system, and which is used by the failure casting logic upon detecting the occurrence of a failure to cast the failure from a first failure type to a second failure type, wherein the second failure type is then communicated to the system manager to allow the system manager to treat the failure as if it were the second failure type; wherein the failure casting hierarchy defines at least two sets of failures, including a set of reboot-curable failures and a set of non-reboot-curable failures, wherein the reboot-curable failures are addressed by the system manager by rebooting the computer system or component thereof that includes the failure; wherein the failure casting logic and the failure casting hierarchy are part of a script that detects the occurrence of failures in the computer system and then casts the failure into one of either a reboot-curable failure or non-reboot-curable failure, wherein the script is executed by the computer system when powered on;
wherein the script is used to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array; andwherein the failure casting hierarchy in the script includes the set of non-reboot curable failures that are checked at boot time, and if a disk, when added to the array, exhibits a failure upon bootup within the set of non-reboot-curable failures, then the disk is not added to the array.
-
-
2. A method for managing failures in a computer system using failure casting, the computer system including an array of disks, comprising:
-
detecting the occurrence of failures in the computer system; referring to a failure casting hierarchy that defines a plurality of failures that can occur within the computer system; using the failure casting hierarchy to cast the failure from a first failure type to a second failure type; communicating the second failure type to a system manager; performing an action by the system manager on the computer system to address the failure including treating the failure as if it were the second failure type, wherein the failure casting hierarchy defines at least two sets of failures, including a set of reboot-curable failures, and a set of non-reboot-curable failures, and addressing the reboot curable failures by the system manager by restarting the computer system or component thereof that includes the failure, and wherein the failure casting logic and the failure casting hierarchy are part of a script, and detecting the occurrence of failures in the computer system using the script and then casting the failure into one of either a reboot-curable failure or non-reboot-curable failure, and executing the script by the computer system when first powered on; using the script to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array; and checking at boot time the failure casting hierarchy in the script includes the set of non-reboot curable failures, and not adding a disk to the array if the disk, when added to the array, exhibits a failure upon bootup within the set of nonreboot-curable failures. - View Dependent Claims (3)
-
-
4. A non-transient system readable medium, including executable instructions stored thereon, which when executed by a system having an array of disks, causes the system to perform the following:
-
executing a script that detects the occurrence of a failure in the computer system; using a failure casting hierarchy within the script to cast the failure from a first failure type to a second failure type, different that the first failure type, and communicating the second failure type to a system manager to allow the system manager to treat the failure as if it was the second failure type; using the script to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array; and not including a disk in the array if the disk, when added to the array, exhibits a failure upon bootup within a set of non-reboot-curable failures; and wherein the script defines at least two sets of failures, including a set of reboot-curable failures and a set of non-reboot-curable failures, and including addressing the reboot-curable failures by restarting the system or component thereof that includes the failure. - View Dependent Claims (5)
-
-
6. A system for managing failures in a computer system using failure
casting, the computer system including an array of disks, comprising: -
(i) one or more processors operable to provide a system manager that performs actions on the computer system to address failures that occur within the computer system; (ii) a failure casting logic that detects failures as they occur in the computer system; (iii) a failure casting hierarchy that defines a plurality of failures that can occur within the computer system, and which is used by the failure casting logic upon detecting the occurrence of a failure to cast the failure from a first failure type to a second failure type, wherein the second failure type is then communicated to the system manager to allow the system manager to treat the failure as if it were the second failure type; (iv) wherein the failure casting hierarchy defines at least two sets of failures, including a set of reboot-curable failures and a set of non-reboot-curable failures, wherein the reboot-curable failures are addressed by the system manager by rebooting the computer system or component thereof that includes the failure; (v) wherein the failure casting logic and the failure casting hierarchy are part of a script that detects the occurrence of failures in the computer system and then casts the failure into one of either a reboot-curable failure or non-reboot-curable failure, wherein the script is executed by the computer system when powered on; and (vi) wherein the script is used to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array.
-
-
7. A method for managing failures in a computer system using failure casting, the computer system including an array of disks, comprising the steps of:
-
(i) detecting the occurrence of failures in the computer system; (ii) referring to a failure casting hierarchy that defines a plurality of failures that can occur within the computer system; (iii) using the failure casting hierarchy to cast the failure from a first failure type to a second failure type; (iv) communicating the second failure type to a system manager; (v) performing an action by the system manager on the computer system to address the failure including treating the failure as if it were the second failure type; (vi) wherein the failure casting hierarchy defines at least two sets of failures, including a set of reboot-curable failures, and a set of non-reboot-curable failures, and addressing the reboot curable failures by the system manager by restarting the computer system or component thereof that includes the failure; (vii) wherein the failure casting logic and the failure casting hierarchy are part of a script, and detecting the occurrence of failures in the computer system using the script and then casting the failure into one of either a reboot-curable failure or non-reboot-curable failure, and executing the script by the computer system when first powered on; and (viii) using the script to address failures within the array of disks at boot time by verifying the health of each disk prior to adding a disk to the array.
-
Specification