Expert system for identifying likely failure points in a digital data processing system
First Claim
1. A method of detecting one of a plurality of likely failures of components in a digital data processing system, comprising the steps ofstoring a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system,analyzing, through use of a digital expert system, said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration Of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of identifying components of differing types that corresponds with one of a plurality of failure theories,and, based on said failure theory, identifying a said likely failure of a said component.
2 Assignments
0 Petitions
Accused Products
Abstract
An expert system for determining the likelihood of failure of a unit in a computer system. The operating system of the computer system maintains a log of the errors occurring for each unit in the computer system. If a predetermine number of errors have been entered in the log for a specific unit, the expert system retrieves the error entries relating to that unit and processes them to determine whether a failure is likely to occur. In this, the processing performed by the expert system is arranged so that tests relating to components of increasing particularity, and decreasing generality, are performed after the tests relating to more general components.
-
Citations
51 Claims
-
1. A method of detecting one of a plurality of likely failures of components in a digital data processing system, comprising the steps of
storing a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, analyzing, through use of a digital expert system, said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration Of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of identifying components of differing types that corresponds with one of a plurality of failure theories, and, based on said failure theory, identifying a said likely failure of a said component.
-
18. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, comprising
a collector module means for collecting a plurality of stored error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, and an analyzer module means for analyzing said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means identifying a said likely failure of a said component based on said failure theory, said collector module means and said analyzer module means being adapted for implementation by a digital data processing system.
-
35. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, said digital data processing system comprising a plurality of units each comprising a plurality of said components, said expert system comprising
an operating system means for storing in an error log a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, a monitor module means for monitoring said error entries to determine whether the number of error entries associated with a particular unit exceeds a threshold, generating a fault entry for each unit having error entries that exceed said threshold, each said fault entry identifying a unit and identifying said error entries associated with said unit, and inserting said fault entries into a fault queue, a collector module means for retrieving a fault entry from said fault queue, retrieving, from said error log, stored error entries associated with a unit identified in said fault entry, and inserting said error entries into an error log subset, an analyzer module means for analyzing said plurality of differing indicia contained within said error entries in said error log subset, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means storing said failure theory in a theory file, a notification module means for querying said theory file, and, based on a failure theory in said theory file, notifying a user of likely failure of said component, and a recovery module means for querying said theory file, and, based on a failure theory in said theory file, initiating recovery operations, to avoid data loss, said operating system means, said monitor module means, said collector module means, said analyzer module means, said notification module means, and said recover module means being adapted for implementation by a digital data processing system.
-
38. A method of detecting one of a plurally of likely failures of components in a digital data processing system, said digital data processing system comprising a plurality of units each comprising a plurality of said components, comprising the steps of
storing a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, monitoring said stored error entries to determine whether the number of stored error entries associated with a particular unit exceeds a threshold, collecting said stored error entries associated with said particular unit if said number of said stored entries associated with said particular unit exceeds said threshold, analyzing, through use of a digital expert system, said plurality of differing indicia contained within said collected error entries associated with said particular unit, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indica with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying component of differing types that corresponds with one of a plurality of failure theories, and, based on said failure theory, identifying a said likely failure of a said component.
-
42. A method of detecting one of a plurality of likely failures of components in a digital data processing system, comprising the steps of
storing a plurality of error entries identifying characteristics of differing types of error events in said digital data processing system, storing a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, analyzing, through use of a digital expert system, said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indica with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said step of analyzing said error entries comprising the steps of determining whether a sufficient number of error entries that identify communications errors have been stored to justify generating a fault theory entry indicating a likely communications failure, if said sufficient number of error entries that identify communications errors have been stored, comparing at least one number representing occurrences of said error entries that identify communications errors with at least one number representing occurrences of error entries that identify non-media drive-detected errors, and determining whether a ratio of said number representing occurrences of error entries that identify communications errors to said number representing occurrences of error entries that identify non-media drive-detected errors is sufficient to justify generating a fault theory indicating a likely communications failure, and, based on said failure theory, identifying a said likely failure of a said component.
-
43. A method of detecting one of a plurality of likely failures of components in a digital data processing system, comprising the steps of
storing a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, analyzing, through use of a digital expert system, said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indica with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said step of analyzing said error entries comprising the steps of determining whether a sufficient number of error entries that identify non-media drive-detected errors have been stored to justify generating a fault theory entry indicating a likely drive-detected non-media failure, if said sufficient number of said error entries that identify non-media drive-detected errors have been stored, and if most of said error entries that identify non-media drive-detected errors identify a common error type, generating a fault theory entry indicating a likelihood of said common error type, and if said sufficient number of said error entries that identify non-media drive-detected errors have been stored, and if most of said error entries that identify non-media drive-detected errors do no identify a common error type, generating fault theory entries identifying error types most frequently identified by said error entries that identify non-media drive-detected errors, and, based on said failure theory, identifying a said likely failure of a said component.
-
44. A method of detecting one of a plurality of likely failures of components in a digital data processing system, comprising the steps of
storing a plurality of error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, analyzing, through use of a digital expert system, said plurality of differing indicia contained within said error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indica with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said step of analyzing said error entries comprising the steps of performing analysis in connection with head matrix failure, if no head matrix failure is likely, performing analysis in connection with bad surfaces, if there are no likely bad surfaces, performing analysis in connection with head slaps, if no head slap failure is likely, performing analysis in connection with errors directed to a servo system, if no servo system failure is likely, performing analysis in connection with read path failure, if no read path failure is likely, performing analysis in connection with bad heads on opposing media surfaces, if there are no likely bad heads on opposing media surfaces, performing analysis in connection with radial scratches, and, if there are no likely radial scratches, performing analysis in connection with bad heads, and, based on said failure theory, identifying a said likely failure of a said component.
-
45. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, said digital data processing system comprising a plurality of units each comprising a plurality of said components, comprising
a monitor module means for monitoring a plurality of stored error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, to determine whether the number of error entries associated with a particular unit exceeds a threshold, a collector module means for collecting said error entries associated with said particular unit, and an analyzer module means for analyzing said plurality of differing indicia contained within said collected error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means identifying a said likely failure of a said component based on said failure theory, said monitor module means, said collector module means and said analyzer module means being adapted for implementation by a digital data processing system.
-
49. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, comprising
a collector module means for collecting a plurality of stored error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, and an analyzer module means for analyzing said plurality of differing indicia contained within said collected error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means identifying a said likely failure of a said component based on said failure theory, said analyzer module means determining whether a sufficient number of error entries that identify communications errors have been stored to justify generating a fault theory entry indicating a likely communications failure, and if said sufficient number of error entries that identify communications errors have been stored, said analyzer module means comparing at least one number representing occurrences of said error entries that identify communications errors with at least one number representing occurrences of error entries that identify non-media drive-detected errors, and said analyzer module means determining whether a ratio of said number representing occurrences of error entries that identify communications errors to said number representing occurrences of error entries that identify non-media drive-detected errors is sufficient to justify generating a fault theory indicating a likely communications failure, said collector module means and said analyzer module means being adapted for implementation by a digital data processing system.
-
50. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, comprising
a collector module means for collecting a plurality of stored error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, and an analyzer module means for analyzing said plurality of differing indicia contained within said collected error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means identifying a said likely failure of a said component based on said failure theory, said analyzer module means determining whether a sufficient number of error entries that identify non-media drive-detected errors have been stored to justify generating a fault theory entry indicating a likely drive-detected non-media failure, if said sufficient number of said error entries that identify non-media drive-detected errors have been stored, and if most of said error entries that identify non-media drive-detected errors identify a common error type, said analyzer module means generating a fault theory entry indicating a likelihood of said common error type, and if said sufficient number of said error entries that identify non-media drive-detected errors have been stored, and if most of said error entries that identify non-media drive-detected errors do no identify a common error type, said analyzer module means generating fault theory entries identifying error types most frequently identified by said error entries that identify non-media drive-detected errors, said collector module means and said analyzer module means being adapted for implementation by a digital data processing system.
-
51. An expert system for detecting one of a plurality of likely failures of components in a digital data processing system, comprising.
a collector module means for collecting a plurality of stored error entries, each error entry containing a plurality of differing indicia identifying components of differing types associated with a single error event in said digital data processing system, and an analyzer module means for analyzing said plurality of differing indicia contained within said collected error entries, by determining whether there is a substantially random distribution of indicia with respect to a plurality of components of a given type or a concentration of indicia with respect to one or more components of a given type as at least one step in identifying a pattern of indicia identifying components of differing types that corresponds with one of a plurality of failure theories, said analyzer module means identifying a said likely failure of a said component based on said failure theory, said analyzer module means performing analysis in connection with head matrix failure, if no head matrix failure is likely, said analyzer module means performing analysis in connection with bad surfaces, if there are no likely bad surfaces, said analyzer module means performing analysis in connection with head slaps, if no head slap failure is likely, said analyzer module means performing analysis in connection with errors directed to a servo system, if no servo system failure is likely, said analyzer module means performing analysis in connection with read path failure, if no read path failure is likely, said analyzer module means performing analysis in connection with bad heads on opposing media surfaces, if there are no likely bad heads on opposing media surfaces, said analyzer module means performing analysis in connection with radial scratches, and if there are no likely radial scratches, said analyzer module means performing analysis in connection with bad heads said collector module means and said analyzer module means being adapted for implementation by a digital data processing system.
Specification