Computer system fault recovery based on historical analysis
First Claim
1. In a computer system comprising a processing unit (CPU), a peripheral unit (SM), and at least one interface unit (MSCU, MMP, MICU, TMS, DLI) connecting the processing unit to the peripheral unit, and in which prescribed ones of the units include means for generating and returning error report messages to the processing unit in response to detections of prescribed error conditions, a software implemented method for controlling the processing unit to identify a faulty system unit, said method comprising the steps ofin response to receipt of an error report identifying a predefined error condition and one of the prescribed units generating the error report, generating an initial list of predetermined fault probability weights for the system units,aging a history list containing fault weights generated by the next-mentioned step as a result of receipt of a last error report by reducing the fault weights for each unit in the history list according to a first prescribed algorithm based on elapsed time since receipt of the last error report,generating a new history list by individually combining the fault weights in the initial list with the fault weights in the aged history list for each unit according to a second prescribed algorithm, andselecting as faulty unit that unit having the largest fault weight in the new history list that also has a non-zero fault weight in the initial list.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of identifying faulty units in a computer-controlled system. The system units generate error reports in response to the detection of error conditions. When an error report is received, an initial list is generated containing probable fault weights for each of the system units based on the type of the error report. The probable fault weights are prespecified based on a logical analysis of the fault modes and error propagation paths in the system. A history list of fault weights is first aged and then combined with the initial list to generate a resultant list. The resultant list becomes the new history list. The resultant list is then masked by the initial list to form a selection list from which a most probable fault unit is selected.
69 Citations
12 Claims
-
1. In a computer system comprising a processing unit (CPU), a peripheral unit (SM), and at least one interface unit (MSCU, MMP, MICU, TMS, DLI) connecting the processing unit to the peripheral unit, and in which prescribed ones of the units include means for generating and returning error report messages to the processing unit in response to detections of prescribed error conditions, a software implemented method for controlling the processing unit to identify a faulty system unit, said method comprising the steps of
in response to receipt of an error report identifying a predefined error condition and one of the prescribed units generating the error report, generating an initial list of predetermined fault probability weights for the system units, aging a history list containing fault weights generated by the next-mentioned step as a result of receipt of a last error report by reducing the fault weights for each unit in the history list according to a first prescribed algorithm based on elapsed time since receipt of the last error report, generating a new history list by individually combining the fault weights in the initial list with the fault weights in the aged history list for each unit according to a second prescribed algorithm, and selecting as faulty unit that unit having the largest fault weight in the new history list that also has a non-zero fault weight in the initial list.
-
12. In a computer system comprising replicated processing units, a peripheral unit, and at least one set of replicated interface units connecting the processing units to the peripheral unit via alternative communication paths, and in which each unit includes means for detecting a plurality of different error conditions signifying fault conditions and for generating and returning unique error reports to the processing unit in response to the detection of the different error conditions, a software implemented method for identifying a faulty unit, said method comprising the steps of
in response to a receipt of an error report of a given type, determining the specific unit generating the error report, determining an active communication path containing the specific unit, determining all other units contained in the active communication path, obtaining predefined probable fault weights for each type of unit in the system based on the error report type, generating an initial list of suspect units by assigning the predefined probable fault weights for each type of system unit to the corresponding units in the active communication path, obtaining an existing history list containing present fault weights for each of the system units, aging the history list by reducing the fault weights for each unit according to a second prescribed algorithm based on time elapsed since receipt of a last error report, generating a new history list by logically combining the initial list and the present history list in a predetermined manner, and selecting as a fault unit that unit having the largest fault weight in the new history list that also has a non-zero fault weight in the initial list.
Specification