Building a wavecache
First Claim
1. A method for facilitating thread synchronization in a dataflow computing device having a plurality of processing elements, when the plurality of processing elements are employed for executing at least one thread, comprising the steps of:
- (a) dividing a control flow graph of a thread into a plurality of waves, wherein each wave comprises a connected, directed acyclic portion of the control flow graph, with a single entrance;
(b) providing wave number tags to be used in identifying each individual dynamic instance of data used when executing a thread, each memory operation being annotated with its location in the wave and in regard to its ordering relationship with other memory operations in the wave, to define a wave-ordered memory, the wave-ordered memory ensuring that the results of earlier memory operations are visible to later memory operations;
(c) annotating data values to include a thread identification indicating a specific thread with which each data value is associated, wherein each thread being executed has a different thread identification; and
(d) including a memory fence instruction that, when executed, indicates a commitment to memory of prior load and prior store operations that have occurred during execution of a thread with which the memory fence instruction is employed.
9 Assignments
0 Petitions
Accused Products
Abstract
A microarchitecture and instruction set that supports multiple, simultaneously executing threads. The approach is disclosed in regard to its applicability in connection with a recently developed microarchitecture called “WaveScalar.” WaveScalar is a compiler that breaks a control flow graph for a program into pieces called waves having instructions that are partially ordered (i.e., a wave contains no back-edges), and for which control enters at a single point. Certain aspects of the present approach are also generally applicable to executing multiple threads on a more conventional microarchitecture. In one aspect of this approach, instructions are provided that enable and disable wave-ordered memory. Additional memory access instructions bypass wave-ordered memory, exposing additional parallelism. Also, a lightweight, interthread synchronization is employed that models hardware queue locks. Finally, a simple fence instruction is used to allow applications to handle relaxed memory consistency.
139 Citations
38 Claims
-
1. A method for facilitating thread synchronization in a dataflow computing device having a plurality of processing elements, when the plurality of processing elements are employed for executing at least one thread, comprising the steps of:
-
(a) dividing a control flow graph of a thread into a plurality of waves, wherein each wave comprises a connected, directed acyclic portion of the control flow graph, with a single entrance;
(b) providing wave number tags to be used in identifying each individual dynamic instance of data used when executing a thread, each memory operation being annotated with its location in the wave and in regard to its ordering relationship with other memory operations in the wave, to define a wave-ordered memory, the wave-ordered memory ensuring that the results of earlier memory operations are visible to later memory operations;
(c) annotating data values to include a thread identification indicating a specific thread with which each data value is associated, wherein each thread being executed has a different thread identification; and
(d) including a memory fence instruction that, when executed, indicates a commitment to memory of prior load and prior store operations that have occurred during execution of a thread with which the memory fence instruction is employed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for ensuring that memory operations carried out when executing a first thread are visible to a second thread that is executing on the same wave ordered memory system, comprising the steps of:
-
(a) carrying out a memory operation in connection with execution of the first thread; and
(b) issuing an acknowledgement visible to the first thread and indicating that the memory operation carried out by the first thread has been completed, the acknowledgement also triggering the first thread to execute any further instruction that is next to be executed in the thread, and triggering the second thread to access data resulting from the memory operation. - View Dependent Claims (11, 12)
-
-
13. A method for use in a processing system in which a plurality of memory operations are executed in a defined order, comprising the steps of:
-
(a) executing a memory sequence start instruction to start a memory ordering sequence, so that in response to the memory sequence start instruction, a new memory sequence is created; and
(b) executing a memory sequence stop instruction to terminate the memory ordering sequence. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method for synchronizing a plurality of threads in a dataflow processing architecture, comprising the steps of:
-
(a) providing tags to be used in identifying each individual dynamic instance of data used when executing the thread;
(b) annotating data values used in the plurality of threads, to include a specific thread identification indicating a specific thread with which each data value is associated, the thread identification being used in tokens for the instructions employed in the plurality of threads; and
(c) providing a thread coordinate instruction that executes when a data value of a first token supplied to a first input of the thread coordinate instruction matches a thread identification of a second token supplied to a second input of the thread coordinate instruction, producing an output token having a tag of the first input, and a data value from the second input. - View Dependent Claims (24, 25, 26, 27)
-
-
28. A method for use with a dataflow processing architecture that includes a plurality of processing elements employed for executing instructions of a plurality of threads, the method facilitating processing of both ordered instructions and unordered instructions by the plurality of processing elements, comprising the steps of:
-
(a) dividing a control flow graph of at least one thread into a plurality of waves, wherein each wave comprises a connected, directed acyclic portion of the control flow graph, with a single entrance;
(b) providing wave number tags to be used in identifying each individual dynamic instance of data used when executing the at least one thread, each memory operation being annotated with its location in the wave and in regard to its ordering relationship with other memory operations in the wave, to define a wave-ordered memory comprising the ordered memory instructions;
(c) executing at least a portion of the ordered memory instructions;
(d) providing an indication that at least the portion of the ordered instructions have completed;
(e) in response to the indication, enabling the unordered instructions to execute on at least one of the plurality of processing elements; and
(f) after the unordered instructions have executed, starting execution of another portion of the ordered instructions on at least one of the plurality of processing elements. - View Dependent Claims (29, 30, 31, 32, 33)
-
-
34. A method for managing memory ordering hardware to allow storing of memory addresses and memory data comprising memory operations, so that a memory address and memory data can be supplied to the memory ordering hardware at different times, comprising the steps of:
-
(a) providing a partial store structure for temporarily storing memory addresses and memory data for memory operations, where the memory addresses and the memory data for a memory operation arrive at the partial store structure at different times;
(b) if a specific memory operation inserted into the memory ordering hardware is a memory load or a memory store, and if the memory address for the memory operation is already stored in the partial store structure, transferring the specific memory operation to the partial store structure;
(c) if a specific memory operation is a memory store, but a memory datum for the specific memory operation has not yet arrived at the memory ordering hardware, and if a partial store structure does not yet exist for the memory address of the specific memory operation, providing a new partial store structure for temporarily storing the memory store until its datum arrives at the memory ordering hardware; and
(d) once both the memory datum and the memory address for the specific memory operation have been inserted into the memory ordering hardware and temporarily stored in a partial store structure, transferring the memory datum and the memory address for all memory operations in the partial store structure to another portion of a memory system. - View Dependent Claims (35, 36, 37, 38)
-
Specification