Method for synchronizing independent cooperative thread arrays running on a graphics processing unit
First Claim
1. A method for synchronizing a plurality of cooperative thread arrays (CTAs) executing an algorithm within a parallel processing system, wherein the plurality of CTAs includes a first CTA, the method comprising:
- for a first plurality of threads within the first CTA, performing a first set of computations associated with a first pass of the algorithm;
performing a thread synchronization operation across the first plurality of threads to ensure that all threads within the first CTA have completed the first set of computations; and
for a first thread within the first CTA, incrementing a value stored in a unique location associated with each CTA in the plurality of CTAs within a semaphore array when all threads within the first CTA have completed the first set of computations, wherein the value stored in the unique location indicates the number of CTAs included in the plurality of CTAs that have completed the first pass of the algorithm, and wherein a subgroup of threads within the first CTA and a subgroup of threads within each of the other CTAs in the plurality of CTAs polls the unique location to determine whether all threads within all CTAs have completed the first set of computations.
1 Assignment
0 Petitions
Accused Products
Abstract
One embodiment of the present invention sets forth a technique for synchronizing the execution of multiple cooperative thread arrays (CTAs) implementing a parallel algorithm that is mapped onto a graphics processing unit. An array of semaphores provides synchronization status to each CTA, while one designated thread within each CTA provides updated status for the CTA. The designated thread within each participating CTA reports completion of a given computational phase by updating a current semaphore within the array of semaphores. The designated thread then polls the status of the current semaphore until all participating CTAs have reported completion of the current computational phase. After each CTA has completed the current computational phase, all participating CTAs may proceed to the next computational phase.
29 Citations
16 Claims
-
1. A method for synchronizing a plurality of cooperative thread arrays (CTAs) executing an algorithm within a parallel processing system, wherein the plurality of CTAs includes a first CTA, the method comprising:
-
for a first plurality of threads within the first CTA, performing a first set of computations associated with a first pass of the algorithm; performing a thread synchronization operation across the first plurality of threads to ensure that all threads within the first CTA have completed the first set of computations; and for a first thread within the first CTA, incrementing a value stored in a unique location associated with each CTA in the plurality of CTAs within a semaphore array when all threads within the first CTA have completed the first set of computations, wherein the value stored in the unique location indicates the number of CTAs included in the plurality of CTAs that have completed the first pass of the algorithm, and wherein a subgroup of threads within the first CTA and a subgroup of threads within each of the other CTAs in the plurality of CTAs polls the unique location to determine whether all threads within all CTAs have completed the first set of computations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable medium including instructions that, when executed by a processing unit, cause the processing unit to synchronize a plurality of cooperative thread arrays (CTAs) executing an algorithm and including a first CTA, by performing the steps of:
-
for a first plurality of threads within the first CTA, performing a first set of computations associated with a first pass of the algorithm; performing a thread synchronization operation across the first plurality of threads to ensure that all threads within the first CTA have completed the first set of computations; and for a first thread within the first CTA, incrementing a value stored in a unique location associated with each CTA in the plurality of CTAs within a semaphore array when all threads within the first CTA have completed the first set of computations, wherein the value stored in the unique location indicates the number of CTAs included in the plurality of CTAs that have completed the first pass of the algorithm, and wherein a subgroup of threads within the first CTA and a subgroup of threads within each of the other CTAs in the plurality of CTAs polls the unique location to determine whether all threads within all CTAs have completed the first set of computations. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computing device configured to synchronize a plurality of cooperative thread arrays (CTAs) executing an algorithm and including a first CTA, the computing device comprising:
-
a memory; and a parallel processing unit coupled to the memory, wherein the plurality of CTAs executes within the parallel processing unit and the plurality of CTAs is configured such that; a first plurality of threads within the first CTA performs a first set of computations associated with a first pass of the algorithm, the first CTA performs a thread synchronization operation across the first plurality of threads to ensure that all threads within the first CTA have completed the first set of computations; and a first thread within the first CTA, incrementing a value stored in a unique location associated with each CTA in the plurality of CTAs within a semaphore array when all threads within the first CTA have completed the first set of computations, wherein the value stored in the unique location indicates the number of CTAs included in the plurality of CTAs that have completed the first pass of the algorithm, and wherein a subgroup of threads within the first CTA and a subgroup of threads within each of the other CTAs in the plurality of CTAs polls the unique location to determine whether all threads within all CTAs have completed the first set of computations.
-
Specification