Method for synchronizing independent cooperative thread arrays running on a graphics processing unit
First Claim
1. A method for synchronizing a plurality of cooperative thread arrays (CTAs) executing an algorithm within a parallel processing system, the method comprising:
- for a first plurality of threads within each CTA of the plurality of CTAs, performing a first set of computations associated with a first pass of the algorithm;
for each CTA in the plurality of CTAs, performing a thread synchronization operation across the first plurality of threads with the CTA to ensure that all threads within the first plurality of threads within the CTA have completed the first set of computations;
for each CTA in the plurality of CTAs, performing an atomic add operation via a first thread within the CTA to increment a first semaphore to indicate that all threads within the first plurality of threads within the CTA have completed the first set of computations; and
for each CTA in the plurality of CTAs, performing a semaphore wait operation to ensure that each CTA within the plurality of CTAs has completed the first set of computations.
1 Assignment
0 Petitions
Accused Products
Abstract
One embodiment of the present invention sets forth a technique for synchronizing the execution of multiple cooperative thread arrays (CTAs) implementing a parallel algorithm that is mapped onto a graphics processing unit. An array of semaphores provides synchronization status to each CTA, while one designated thread within each CTA provides updated status for the CTA. The designated thread within each participating CTA reports completion of a given computational phase by updating a current semaphore within the array of semaphores. The designated thread then polls the status of the current semaphore until all participating CTAs have reported completion of the current computational phase. After each CTA has completed the current computational phase, all participating CTAs may proceed to the next computational phase.
18 Citations
20 Claims
-
1. A method for synchronizing a plurality of cooperative thread arrays (CTAs) executing an algorithm within a parallel processing system, the method comprising:
-
for a first plurality of threads within each CTA of the plurality of CTAs, performing a first set of computations associated with a first pass of the algorithm; for each CTA in the plurality of CTAs, performing a thread synchronization operation across the first plurality of threads with the CTA to ensure that all threads within the first plurality of threads within the CTA have completed the first set of computations; for each CTA in the plurality of CTAs, performing an atomic add operation via a first thread within the CTA to increment a first semaphore to indicate that all threads within the first plurality of threads within the CTA have completed the first set of computations; and for each CTA in the plurality of CTAs, performing a semaphore wait operation to ensure that each CTA within the plurality of CTAs has completed the first set of computations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable medium including instructions that, when executed by a processing unit, cause the processing unit to synchronize a plurality of cooperative thread arrays (CTAs) executing an algorithm, by performing the steps of:
-
for a first plurality of threads within each CTA of the plurality of CTAs, performing a first set of computations associated with a first pass of the algorithm; for each CTA in the plurality of CTAs, performing a thread synchronization operation across the first plurality of threads of each CTA to ensure that all threads within the first plurality of threads within the CTA have completed the first set of computations; for each CTA in the plurality of CTAs, performing an atomic add operation via a first thread within the CTA to increment a first semaphore to indicate that all threads within the first plurality of threads within the CTA have completed the first set of computations; and for each CTA in the plurality of CTAs, performing a semaphore wait operation to ensure that each CTA within the plurality of CTAs has completed the first set of computations. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computing device configured to synchronize a plurality of cooperative thread arrays (CTAs) executing an algorithm, the computing device comprising:
-
a memory; and a parallel processing unit coupled to the memory, wherein the plurality of CTAs executes within the parallel processing unit, and the plurality of CTAs is configured such that; a first plurality of threads within each CTA of the plurality of CTAs performs a first set of computations associated with a first pass of the algorithm, each CTA performs a thread synchronization operation across the first plurality of threads to ensure that all threads within the first plurality of threads within the CTA have completed the first set of computations, a first thread within each CTA performs an atomic add operation to increment a first semaphore to indicate that all threads within the first plurality of threads within the CTA have completed the first set of computations, and each CTA in the plurality of CTAs performs a semaphore wait operation to ensure that each CTA within the plurality of CTAs has completed the first set of computations. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification