Latency tolerant distributed shared memory multiprocessor computer

US 7,543,133 B1
Filed: 08/18/2003
Issued: 06/02/2009
Est. Priority Date: 08/18/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system comprising:

a network;

one or more processing nodes connected via the network, wherein each processing node includes;

a plurality of processors, wherein each processor includes a scalar processing unit, a vector processing unit, means for operating the scalar processing unit independently of the vector processing unit, a processor cache and a translation look-aside buffer (TLB), wherein the scalar processing unit places instructions for the vector processing unit in a queue for execution by the vector processing unit and the scalar processing unit continues to execute additional instructions; and

a shared memory, wherein the shared memory is connected to each of the processors within the processing node, wherein the shared memory includes a Remote Address Translation Table (RTT), wherein the RTT contains translation information for an entire virtual memory address space associated with the processing node and wherein the RTT translates memory addresses received from other processing nodes such that the memory addresses are translated into physical addresses within the shared memory;

wherein processors on one node can load data directly from and store data directly to shared memory on another processing node via addresses that are translated on the other processing node using the other processing node'"'"'s RTT; and

wherein each TLB in a corresponding processing node exists separate from the RTT in that processing node and wherein each TLB translates memory references from its associated processor to the shared memory on its processing node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer system having low memory access latency. In one embodiment, the computer system includes a network and one or more processing nodes connected via the network, wherein each processing node includes a plurality of processors and a shared memory connected to each of the processors. The shared memory includes a cache. Each processor includes a scalar processing unit, a vector processing unit and means for operating the scalar processing unit independently of the vector processing unit. Processors on one node can load data directly from and store data directly to shared memory on another processing node via the network.

Citations

15 Claims

1. A computer system comprising:
- a network;
  
  one or more processing nodes connected via the network, wherein each processing node includes;
  
  a plurality of processors, wherein each processor includes a scalar processing unit, a vector processing unit, means for operating the scalar processing unit independently of the vector processing unit, a processor cache and a translation look-aside buffer (TLB), wherein the scalar processing unit places instructions for the vector processing unit in a queue for execution by the vector processing unit and the scalar processing unit continues to execute additional instructions; and
  
  a shared memory, wherein the shared memory is connected to each of the processors within the processing node, wherein the shared memory includes a Remote Address Translation Table (RTT), wherein the RTT contains translation information for an entire virtual memory address space associated with the processing node and wherein the RTT translates memory addresses received from other processing nodes such that the memory addresses are translated into physical addresses within the shared memory;
  
  wherein processors on one node can load data directly from and store data directly to shared memory on another processing node via addresses that are translated on the other processing node using the other processing node'"'"'s RTT; and
  
  wherein each TLB in a corresponding processing node exists separate from the RTT in that processing node and wherein each TLB translates memory references from its associated processor to the shared memory on its processing node.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer system of claim 1, wherein the shared memory further includes a plurality of cache coherence directories, wherein each processing node is coupled to one of the cache coherence directories.
  - 3. The computer system of claim 1, wherein each processor includes two vector pipelines.
  - 4. The computer system of claim 1, wherein the processing nodes include at least one input/out (I/O) channel controller, wherein each I/O channel controller is coupled to the shared memory of the processing node.
  - 5. The computer system of claim 1, wherein each scalar processing unit contains a scalar cache memory, wherein the scalar cache memory contains a subset of cache lines stored in the processor cache associated with its respective processor.
  - 6. The computer system according to claim 1, wherein the network includes a router connecting one or more of the processing nodes.

7. A computer system comprising:
- a network;
  
  one or more processing nodes connected via the network, wherein each processing node includes;
  
  four processors configured as a Multi-Streaming Processor, wherein each processor includes a scalar processing unit, a vector processing unit, means for operating the scalar processing unit independently of the vector processing unit, a processor cache connected to each of the processing units and a translation look-aside buffer (TLB), wherein the scalar processing unit places instructions for the vector processing unit in a queue for execution by the vector processing unit and the scalar processing unit continues to execute additional instructions; and
  
  a shared memory, wherein the shared memory is connected to each of the processors within the processing node, wherein the shared memory includes a Remote Address Translation Table (RTT), wherein the RTT contains translation information for an entire virtual memory address space associated with the processing node and wherein the RTT translates memory addresses received from other processing nodes such that the memory addresses are translated into physical addresses within the shared memory;
  
  wherein processors on one node can load data directly from and store data directly to shared memory on another processing node via addresses that are translated on the other processing node using the other processing node'"'"'s RTT; and
  
  wherein each TLB in a corresponding processing node exists separate from the RTT in that processing node and wherein each TLB translates memory references from its associated processor to the shared memory on its processing node.
- View Dependent Claims (8)
- - 8. The computer system of claim 7, wherein the shared memory further includes a plurality of cache coherence directories, wherein each processing node is coupled to one of the cache coherence directories.

9. A method of providing latency tolerant distributed shared memory multiprocessor computer system, wherein the method of providing comprising:
- connecting one or more processing nodes via a network, wherein each processing node includes;
  
  a plurality of processors, wherein each processor includes a scalar processing unit, a vector processing unit, means for operating the scalar processing unit independently of the vector processing unit, a processor cache and a translation look-aside buffer (TLB), wherein the scalar processing unit places instructions for the vector processing unit in a queue for execution by the vector processing unit and the scalar processing unit continues to execute additional instructions; and
  
  a shared memory, wherein the shared memory is connected to each of the processors within the processing node, wherein the shared memory includes a Remote Address Translation Table (RTT), wherein the RTT contains translation information for an entire virtual memory address space associated with the processing node;
  
  storing data from a processor on a first processing node to shared memory on a second processing node via the network, wherein storing includes translating via the RTT on the second processing node memory addresses received from the first processing node such that the memory addresses received from the first processing node are translated into physical addresses within the shared memory of the second processing node; and
  
  reading data from shared memory on the second processing node to the processor on the first processing node;
  
  wherein memory references from the processor on the first processing node to the shared memory on the first processing node is translated by the associated TLB in the first processing node, wherein the TLB exists separate from the RTT in the first processing node.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 9, wherein each shared memory includes a plurality of cache coherence directories and wherein connecting includes coupling each processing node to one of the cache coherence directories.
  - 11. The method of claim 9, wherein each processing node includes at least one input/out (I/O) channel controller and wherein connecting includes coupling each I/O channel controller to the shared memory of the processing node.
  - 12. The method of claim 9, wherein each scalar processing unit includes a scalar cache memory and wherein connecting includes having the scalar cache memory contain a subset of cache lines stored in the processor cache associated with its respective processor.
  - 13. The method of claim 9, wherein connecting includes routing one or more of the processing nodes through a router.

14. A method of providing latency tolerant distributed shared memory multiprocessor computer system, wherein the method of providing comprising:
- connecting one or more processing nodes via a network, wherein each processing node includes;
  
  four processors configured as a Multi-Streaming Processor, wherein each processor includes a scalar processing unit, a vector processing unit, means for operating the scalar processing unit independently of the vector processing unit, a processor cache connected to each of the processing units and a translation look-aside buffer (TLB), wherein the scalar processing unit places instructions for the vector processing unit in a queue for execution by the vector processing unit and the scalar processing unit continues to execute additional instructions; and
  
  a shared memory, wherein the shared memory is connected to each of the processors within the processing node, wherein the shared memory includes a Remote Address Translation Table (RTT), wherein the RTT contains translation information for an entire virtual memory address space associated with the processing node;
  
  storing data from a processor on a first processing node to shared memory on a second processing node via the network, wherein storing includes translating via the RTT on the second processing node memory addresses received from the first processing node such that the memory addresses received from the first processing node are translated into physical addresses within the shared memory of the second processing node; and
  
  reading data from shared memory on the second processing node to the processor on the first processing node;
  
  wherein memory references from the processor on the first processing node to the shared memory on the first processing node is translated by the associated TLB in the first processing node, wherein the TLB exists separate from the RTT in the first processing node.
- View Dependent Claims (15)
- - 15. The method of claim 14, wherein the shared memory includes a plurality of cache coherence directories and wherein connecting includes coupling each processing node to one of the cache coherence directories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CRAY Incorporated (Hewlett-Packard Enterprise Company)
Original Assignee
CRAY Incorporated (Hewlett-Packard Enterprise Company)
Inventors
Scott, Steven L.
Primary Examiner(s)
Kim; Matthew
Assistant Examiner(s)
TSAI, SHENG JEN

Application Number

US10/643,585
Time in Patent Office

2,115 Days
Field of Search

711/147, 711/205
US Class Current

711/205
CPC Class Codes

G06F 12/0813   with a network or matrix co...

G06F 12/0817   using directory methods

G06F 12/084   with a shared cache

G06F 12/1027   using associative or pseudo...

Latency tolerant distributed shared memory multiprocessor computer

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Latency tolerant distributed shared memory multiprocessor computer

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links