×

Apparatus for recovery from failures in a multiprocessing system

  • US 4,503,535 A
  • Filed: 06/30/1982
  • Issued: 03/05/1985
  • Est. Priority Date: 06/30/1982
  • Status: Expired due to Term
First Claim
Patent Images

1. In a data processing system including, a number of bus interface unit (BIU) nodes and memory control unit (MCU) nodes and in which a switching matrix provides electrical interconnections between horizontal MACD buses and vertical ACD buses connected in said matrix by means of said BIU nodes located at the intersections of said MACD and ACD busses, said memory control unit (MCU) nodes connected to said MACD busses,means for detecting an error,an error-reporting matrix including horizontal Bus Error Report Lines (BERLs) and vertical Module Error Report Lines (MERLs),said BERLs being associated with said MACD buses such that all BIU and MCU nodes sharing an MACD bus are connected with a pair of BERLs,said MERLs being associated with said ACD buses such that all nodes sharing an ACD bus are connected with a MERL, and,error-reporting means in a particular node connected to said means for detecting an error,said error-reporting means including means for receiving error messages transmitted over at least said one BERL, and means for reporting error messages over at least said one BERL, said error messages identifying the type of error and the locations (ID) at which the error was detected, a recovery mechanism in said particular node comprising:

  • a recovery machine;

    said recovery machine including first means for causing said particular node to become quiescent for a first timeout period to thereby wait for transients to subside, said first means including means for disabling the reporting of errors by said error-reporting means for the duration of said first timeout period;

    said recovery machine including second means for causing said particular node to enter a second timeout period;

    means for storing memory accesses;

    means for generating memory accesses;

    means connected to said second means, operative during said second timeout period, for retrying a memory access stored in said storing means;

    permanent error determining means connected to said means for detecting an error, to said recovery means, and to said error reporting means, operative upon the condition that an error recurs during said second timeout period, for causing said error-reporting means in said particular node to propagate a permanent-error error report message, said error message identifying the type of error and the location (ID) at which the permanent error was detected; and

    ,error report logging means in said particular node connected to at least one of said error report lines, for logging received error report messages propagated to said particular node.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×