Global hard error distribution using the SCI interconnect
First Claim
1. A multiple node computer system having a communication linkage between the various nodes, each node having a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred, said system comprising:
- means at each node for receiving from each functional unit at said node error signals which have been sent from any said functional unit at said node and for logging said received error signal;
said receiving means including means operative in response to a logged error signal from a particular functional unit for sending an error signal to each other functional unit at said node indicating that said particular node has logged an error signal; and
said receiving means including means operative in response to a logged error signal from a particular functional unit for communicating said logged error to the receiving means associated with each other node.
3 Assignments
0 Petitions
Accused Products
Abstract
An error propagation system and method uses a central control point at each node of a multinodal computer system to control error message distribution. The central point at each node ANDs all of the error messages from each of the functional units at that node and then distributes an error signal to all of the local functional units and to a next node via the SCI linkage. A single bit on the SCI protocol alerts the next node that an error has occurred on another node. The central point at that node then distributes the error signal to all of the local functional units at that node. The error signal is then passed along to a next node for a repeat of the process. Clock stoppage, which would normally occur when an error is detected, is inhibited long enough to allow the error signal to be passed along to a next node. The clock stoppage inhibiting circuit is itself inhibited if the error information could be lost thereby allowing immediate clock stoppage without regard to propagating the error to the next node.
45 Citations
22 Claims
-
1. A multiple node computer system having a communication linkage between the various nodes, each node having a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred, said system comprising:
-
means at each node for receiving from each functional unit at said node error signals which have been sent from any said functional unit at said node and for logging said received error signal;
said receiving means including means operative in response to a logged error signal from a particular functional unit for sending an error signal to each other functional unit at said node indicating that said particular node has logged an error signal; and
said receiving means including means operative in response to a logged error signal from a particular functional unit for communicating said logged error to the receiving means associated with each other node. - View Dependent Claims (2, 3, 4, 5, 6, 7)
means for placing a particular bit on said communication linkage.
-
-
4. The invention set forth in claim 2 further including:
means at each node for inhibiting clock stoppage at that node until the received error signal has been passed to a next node.
-
5. The invention set forth in claim 4 further including:
means for inhibiting said inhibiting means.
-
6. The invention set forth in claim 5 further including
means for determining that an error will be cleared within a relatively few clock cycles and whereby said means for inhibiting said inhibiting means is enabled by said determining means. -
7. The invention set forth in claim 1 further including:
means for confining the propagation of the error received signal to certain defined ones of said nodes.
-
8. A method of tracking errors in a multinode SCI computer system when an error at a functional unit of a first node can yield error signals at one or more other nodes, the method comprising the steps of:
-
logging all of the errors occurring at said first node; and
distributing to each other node in turn a notification that at least one error has occurred at a node remote to said each other node during a particular clock cycle. - View Dependent Claims (9, 10, 11, 12, 13, 14, 22)
adding all of the error signals from all functional units within a particular node to form a single error signal for distribution to other nodes.
-
-
10. The method set forth in claim 8 further including the step of:
inhibiting clock stoppage at each node until the received error signal has been passed to a next node.
-
11. The method set forth in claim 10 further including the step of:
inhibiting said step of inhibiting when it is determined that inhibiting clock stoppage will adversely impact error detection.
-
12. The method set forth in claim 8 further including the step of:
analyzing the relative distribution of error notifications to determine which errors can be grouped into a single error.
-
13. The method set forth in claim 8 further including the step of:
analyzing the relative distribution of error notifications to determine a first occurrence of a particular error.
-
14. The method set forth in claim 13 further including the step of:
removing the error signal from all of the nodes other than said node having said first occurrence of a particular error.
-
22. The method of claim 8, wherein:
each node of the multinode SCI computer system includes a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred.
-
15. A method for controlling error signal distribution between nodes in a multiple node, multiple processor SCI computer system, wherein each node has a plurality of functional units, each such functional unit capable of monitoring for errors occurring with respect to such functional unit and wherein such functional units are operable for sending out an error signal when such an error has occurred, said method comprising the steps of:
-
receiving error signals at a common point at each node from functional units at said node from which error signals have been sent;
sending in response to received signals from a particular functional unit an error signal to each other functional unit at said node; and
sending in response to received signals from a particular functional unit an error signal over the SCI link to a next node. - View Dependent Claims (16, 17, 18, 19, 20, 21)
at said next node passing an error signal to a common point at said next node;
sending in response to received signals from another node an error signal to each other functional unit at said next node; and
sending in response to received signals from another node an error signal over the SCI link to a next node.
-
-
17. The method set forth in claim 15 wherein said next node sending step includes the step of:
placing a particular bit in one of the protocols on said SCI link between nodes.
-
18. The method set forth in claim 15 further including the step of:
at each node inhibiting clock stoppage at that node until a received error signal has been passed to a next node.
-
19. The method set forth in claim 18, further including the step of:
at each node inhibiting said inhibiting step thereby allowing clocks to stop immediately.
-
20. The method set forth in claim 19, further including the steps of:
-
determining that error information may be lost; and
inhibiting said step of inhibiting thereby allowing clocks to stop in order to preserve such error information.
-
-
21. The method set forth in claim 15 further including the step of:
confining the sending of said error signal to certain ones of said nodes.
Specification