Global hard error distribution using the SCI interconnect

US 6,175,931 B1
Filed: 01/31/1997
Issued: 01/16/2001
Est. Priority Date: 01/31/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A multiple node computer system having a communication linkage between the various nodes, each node having a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred, said system comprising:

means at each node for receiving from each functional unit at said node error signals which have been sent from any said functional unit at said node and for logging said received error signal;

said receiving means including means operative in response to a logged error signal from a particular functional unit for sending an error signal to each other functional unit at said node indicating that said particular node has logged an error signal; and

said receiving means including means operative in response to a logged error signal from a particular functional unit for communicating said logged error to the receiving means associated with each other node.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An error propagation system and method uses a central control point at each node of a multinodal computer system to control error message distribution. The central point at each node ANDs all of the error messages from each of the functional units at that node and then distributes an error signal to all of the local functional units and to a next node via the SCI linkage. A single bit on the SCI protocol alerts the next node that an error has occurred on another node. The central point at that node then distributes the error signal to all of the local functional units at that node. The error signal is then passed along to a next node for a repeat of the process. Clock stoppage, which would normally occur when an error is detected, is inhibited long enough to allow the error signal to be passed along to a next node. The clock stoppage inhibiting circuit is itself inhibited if the error information could be lost thereby allowing immediate clock stoppage without regard to propagating the error to the next node.

45 Citations

View as Search Results

22 Claims

1. A multiple node computer system having a communication linkage between the various nodes, each node having a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred, said system comprising:
- means at each node for receiving from each functional unit at said node error signals which have been sent from any said functional unit at said node and for logging said received error signal;
  
  said receiving means including means operative in response to a logged error signal from a particular functional unit for sending an error signal to each other functional unit at said node indicating that said particular node has logged an error signal; and
  
  said receiving means including means operative in response to a logged error signal from a particular functional unit for communicating said logged error to the receiving means associated with each other node.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The invention set forth in claim 1 wherein said communicating means is a ring linkage where messages pass serially from node to node.
  - 3. The invention set forth in claim 2 wherein said communicating means further includes:
4. The invention set forth in claim 2 further including:
- means at each node for inhibiting clock stoppage at that node until the received error signal has been passed to a next node.
5. The invention set forth in claim 4 further including:
- means for inhibiting said inhibiting means.
6. The invention set forth in claim 5 further includingmeans for determining that an error will be cleared within a relatively few clock cycles and whereby said means for inhibiting said inhibiting means is enabled by said determining means.
7. The invention set forth in claim 1 further including:
- means for confining the propagation of the error received signal to certain defined ones of said nodes.

8. A method of tracking errors in a multinode SCI computer system when an error at a functional unit of a first node can yield error signals at one or more other nodes, the method comprising the steps of:
- logging all of the errors occurring at said first node; and
  
  distributing to each other node in turn a notification that at least one error has occurred at a node remote to said each other node during a particular clock cycle.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 22)
- - 9. The method set forth in claim 8 wherein said distributing means includes the step of:
10. The method set forth in claim 8 further including the step of:
- inhibiting clock stoppage at each node until the received error signal has been passed to a next node.
11. The method set forth in claim 10 further including the step of:
- inhibiting said step of inhibiting when it is determined that inhibiting clock stoppage will adversely impact error detection.
12. The method set forth in claim 8 further including the step of:
- analyzing the relative distribution of error notifications to determine which errors can be grouped into a single error.
13. The method set forth in claim 8 further including the step of:
- analyzing the relative distribution of error notifications to determine a first occurrence of a particular error.
14. The method set forth in claim 13 further including the step of:
- removing the error signal from all of the nodes other than said node having said first occurrence of a particular error.
22. The method of claim 8, wherein:
- each node of the multinode SCI computer system includes a plurality of functional units each such functional unit capable of monitoring for errors occurring with respect to such functional unit and for sending out an error signal when such an error has occurred.

15. A method for controlling error signal distribution between nodes in a multiple node, multiple processor SCI computer system, wherein each node has a plurality of functional units, each such functional unit capable of monitoring for errors occurring with respect to such functional unit and wherein such functional units are operable for sending out an error signal when such an error has occurred, said method comprising the steps of:
- receiving error signals at a common point at each node from functional units at said node from which error signals have been sent;
  
  sending in response to received signals from a particular functional unit an error signal to each other functional unit at said node; and
  
  sending in response to received signals from a particular functional unit an error signal over the SCI link to a next node.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The method set forth in claim 15 further including the step of:
17. The method set forth in claim 15 wherein said next node sending step includes the step of:
- placing a particular bit in one of the protocols on said SCI link between nodes.
18. The method set forth in claim 15 further including the step of:
- at each node inhibiting clock stoppage at that node until a received error signal has been passed to a next node.
19. The method set forth in claim 18, further including the step of:
- at each node inhibiting said inhibiting step thereby allowing clocks to stop immediately.
20. The method set forth in claim 19, further including the steps of:
- determining that error information may be lost; and
  
  inhibiting said step of inhibiting thereby allowing clocks to stop in order to preserve such error information.
21. The method set forth in claim 15 further including the step of:
- confining the sending of said error signal to certain ones of said nodes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Hewlett-Packard Company (HP Inc.)
Inventors
Hornung, Bryan
Primary Examiner(s)
LE, DIEU MINH T

Application Number

US08/792,324
Time in Patent Office

1,446 Days
Field of Search

395/182.02, 395/182.18, 395/185.01, 395/185.02, 395/185.08, 714/4, 714/20, 714/48, 714/49, 714/55
US Class Current

714/4.4
CPC Class Codes

G06F 11/0724   in a multiprocessor or a mu...

G06F 11/0772   Means for error signaling, ...

G06F 11/0784   Routing of error reports, e...

Global hard error distribution using the SCI interconnect

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

45 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Global hard error distribution using the SCI interconnect

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links