Fault-tolerant cache coherence over a lossy network
First Claim
1. A method, comprising:
- storing, in a hardware unit of each node of a plurality of nodes, a respective first plurality of data records for each message sent by said each node, wherein each data record of the respective first plurality of data records comprises;
a message type for said each message, a source node identifier for said each message, a destination node identifier for said each message, route information of said each message between the source node and the destination node of said each message, and a sequence number for said each message;
detecting that a particular message containing a particular sequence number was not received by a first node of the plurality of nodes;
in response to the detecting that the particular message was not received by the first node, sending a Nack message to a second node of the plurality of nodes, wherein the second node is the source node of the particular message, and wherein the Nack message identifies a lost sequence number and the route information for the particular message;
in response to receiving the Nack message at the second node, identifying, from the respective first plurality of data records stored at the second node, a particular data record for the particular message, based on the lost sequence number and the route information for the particular message; and
using the particular data record to process the particular message again.
1 Assignment
0 Petitions
Accused Products
Abstract
A cache coherence system manages both internode and intranode cache coherence in a cluster of nodes. Each node in the cluster of nodes is either a collection of processors running an intranode coherence protocol between themselves, or a single processor. A node comprises a plurality of coherence ordering units (COUs) that are hardware circuits configured to manage intranode coherence of caches within the node and/or internode coherence with caches on other nodes in the cluster. Each node contains one or more directories which tracks the state of cache line entries managed by the particular node. Each node may also contain one or more scoreboards for managing the status of ongoing transactions. The internode cache coherence protocol implemented in the COUs may be used to detect and resolve communications errors, such as dropped message packets between nodes, late message delivery at a node, or node failure. Additionally, a transport layer manages communication between the nodes in the cluster, and can additionally be used to detect and resolve communications errors.
50 Citations
18 Claims
-
1. A method, comprising:
-
storing, in a hardware unit of each node of a plurality of nodes, a respective first plurality of data records for each message sent by said each node, wherein each data record of the respective first plurality of data records comprises;
a message type for said each message, a source node identifier for said each message, a destination node identifier for said each message, route information of said each message between the source node and the destination node of said each message, and a sequence number for said each message;detecting that a particular message containing a particular sequence number was not received by a first node of the plurality of nodes; in response to the detecting that the particular message was not received by the first node, sending a Nack message to a second node of the plurality of nodes, wherein the second node is the source node of the particular message, and wherein the Nack message identifies a lost sequence number and the route information for the particular message; in response to receiving the Nack message at the second node, identifying, from the respective first plurality of data records stored at the second node, a particular data record for the particular message, based on the lost sequence number and the route information for the particular message; and using the particular data record to process the particular message again. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer system, comprising:
-
a plurality of nodes, wherein each node of the plurality of nodes comprises one or more hardware units, wherein each hardware unit of the one or more hardware units comprises one or more processors, registers, content-addressable memories, and/or other computer-implemented hardware circuitry; wherein each hardware unit of the one or more hardware units is coupled to a particular memory and a particular cache and each particular hardware unit of the one or more hardware units is configured as a cache controller of the particular memory and the particular cache; each node of the plurality of nodes is configured to; store, in a first hardware unit of the node, a respective first plurality of data records for each message sent by said each node, wherein each data record of the respective first plurality of data records comprises;
a message type for said each message, a source node identifier for said each message, a destination node identifier for said each message, route information of said each message between the source node and the destination node, and a sequence number for said each message;a first node of the plurality of nodes configured to; detect that a particular message containing a particular sequence number was not received by the first node; in response to the detecting that the particular message was not received by the first node, send a Nack message to a second node of the plurality of nodes, wherein the second node is the source node of the particular message, and wherein the Nack message identifies a lost sequence number and the route information for the particular message; the second node of the plurality of nodes configured to; in response to receiving the Nack message at the second node, identify from the respective first plurality of data records stored at the second node, a particular data record for the particular message, based on the lost sequence number and the route information for the particular message; and use the particular data record to process the particular message again. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. One or more non-transitory computer-readable storage media storing instructions, which when executed by one or more processors, cause:
-
storing, in a hardware unit of each node of a plurality of nodes, a respective first plurality of data records for each message sent by said each node, wherein each data record of the respective first plurality of data records comprises;
a message type for said each message, a source node identifier for said each message, a destination node identifier for said each message, route information of said each message between the source node and the destination node of said each message, and a sequence number for said each message;detecting that a particular message containing a particular sequence number was not received by a first node of the plurality of nodes; in response to the detecting that the particular message was not received by the first node, sending a Nack message to a second node of the plurality of nodes, wherein the second node is the source node of the particular message, and wherein the Nack message identifies a lost sequence number and the route information for the particular message; in response to receiving the Nack message at the second node, identifying, from the respective first plurality of data records stored at the second node, a particular data record for the particular message, based on the lost sequence number and the route information for the particular message; and using the particular data record to process the particular message again. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification