MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES
First Claim
1. A method for processing an input data stream comprising a plurality of input data elements, the method comprising:
- storing the input data elements of the input data stream in memory;
defining a number of thread arrays to be executed concurrently by parallel processing hardware, each thread array comprising a number of concurrent threads, each thread having a unique thread identifier and each thread array having a unique array identifier, wherein each thread is assigned to process one or more of the input data elements, an input data element for a given thread being selected based on the unique thread identifier and the unique array identifier associated with the thread;
executing, using the parallel processing hardware, the plurality of thread arrays to process the input data stream and write an output data stream to the memory, wherein executing one of the plurality of thread arrays includes;
organizing the threads of the thread array into one or more SIMD groups, wherein at least a first one of the SIMD groups includes a plurality of threads; and
retrieving the input data elements for all threads of the first SIMD group from the memory in a single memory access operation.
0 Assignments
0 Petitions
Accused Products
Abstract
In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
20 Citations
16 Claims
-
1. A method for processing an input data stream comprising a plurality of input data elements, the method comprising:
-
storing the input data elements of the input data stream in memory; defining a number of thread arrays to be executed concurrently by parallel processing hardware, each thread array comprising a number of concurrent threads, each thread having a unique thread identifier and each thread array having a unique array identifier, wherein each thread is assigned to process one or more of the input data elements, an input data element for a given thread being selected based on the unique thread identifier and the unique array identifier associated with the thread; executing, using the parallel processing hardware, the plurality of thread arrays to process the input data stream and write an output data stream to the memory, wherein executing one of the plurality of thread arrays includes; organizing the threads of the thread array into one or more SIMD groups, wherein at least a first one of the SIMD groups includes a plurality of threads; and retrieving the input data elements for all threads of the first SIMD group from the memory in a single memory access operation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for processing an input data stream comprising a plurality of input data elements, the system comprising:
-
a memory adapted to store data including input data elements of an input data stream; a parallel processing unit communicatively coupled to the memory and adapted to concurrently execute a plurality of thread arrays, each thread array comprising a plurality of concurrent threads, each thread having a unique thread identifier and each thread array having a unique array identifier, wherein each thread processes one or more of the input data elements, an input data element for a given thread being selected based on the unique thread identifier and the unique array identifier associated with the thread, wherein the parallel processing hardware is further configured to execute the threads of each thread array in one or more SIMD groups and to retrieve the respective input data elements for all threads of a same one of the SIMD groups in a single memory access operation. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification