Parallel Gibbs sampler using butterfly-patterned partial sums
First Claim
1. A method for parallel random sampling from a plurality of discrete probability distributions by one or more multiprocessors, wherein the method comprises:
- initializing a partial sums table based on a matrix product of parameters for the plurality of discrete probability distributions, wherein the partial sums table comprises a plurality of blocks, each of which has a maximum size W, which is a value that is based on a number of threads in a warp of at least one of the one or more multiprocessors;
applying one or more iterations of butterfly patterned replacements to each of the plurality of blocks such that a single row of each of the plurality of blocks is un-transposed; and
drawing, in parallel from each of the plurality of discrete probability distributions, at least one respective z value, wherein each parallel drawing performs, respectively;
a first binary search on the single un-transposed row of each of the plurality of blocks to determine a candidate block, anda second binary search within the candidate block to draw the at least one respective z value;
wherein the method is performed by one or more computing devices.
1 Assignment
0 Petitions
Accused Products
Abstract
An efficient parallel Gibbs sampler using butterfly-patterned partial sums is provided. Instead of building and searching a complete prefix sums table, an alternative “butterfly patterned partial sums table” is described that integrates a lightweight transposition and partial sums operation. Accordingly, the usual full matrix transposition and full prefix sums table building operations can be omitted in favor of building the butterfly-patterned partial sums table, which requires less computational and communication effort. This butterfly-patterned partial sums table is used by a modified binary search phase that calculates the needed prefix-sum table values on-the-fly using the butterfly-patterned partial sums table. Transposed memory access is also provided while avoiding the full matrix transform, providing significant performance benefits for highly parallel architectures, such as graphics processing units (GPUs) where 1-stride or sequential memory accesses are important for optimization.
15 Citations
24 Claims
-
1. A method for parallel random sampling from a plurality of discrete probability distributions by one or more multiprocessors, wherein the method comprises:
-
initializing a partial sums table based on a matrix product of parameters for the plurality of discrete probability distributions, wherein the partial sums table comprises a plurality of blocks, each of which has a maximum size W, which is a value that is based on a number of threads in a warp of at least one of the one or more multiprocessors; applying one or more iterations of butterfly patterned replacements to each of the plurality of blocks such that a single row of each of the plurality of blocks is un-transposed; and drawing, in parallel from each of the plurality of discrete probability distributions, at least one respective z value, wherein each parallel drawing performs, respectively; a first binary search on the single un-transposed row of each of the plurality of blocks to determine a candidate block, and a second binary search within the candidate block to draw the at least one respective z value; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more non-transitory computer-readable media storing one or more sequences of instructions which, when executed by one or more processors, cause parallel random sampling from a plurality of discrete probability distributions by one or more multiprocessors comprising:
-
initializing a partial sums table based on a matrix product of parameters for the plurality of discrete probability distributions, wherein the partial sums table comprises a plurality of blocks, each of which has a maximum size W, which is a value that is based on a number of threads in a warp of at least one of the one or more multiprocessors; applying one or more iterations of butterfly patterned replacements to each of the plurality of blocks such that a single row of each of the plurality of blocks is un-transposed; and drawing, in parallel from each of the plurality of discrete probability distributions, at least one respective z value, wherein each parallel drawing performs, respectively; a first binary search on the single un-transposed row of each of the plurality of blocks to determine a candidate block, and a second binary search within the candidate block to draw the at least one respective z value. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system comprising one or more multiprocessors configured to perform parallel random sampling from a plurality of discrete probability distributions by the one or more multiprocessors by being configured to:
-
initialize a partial sums table based on a matrix product of parameters for the plurality of discrete probability distributions, wherein the partial sums table comprises a plurality of blocks, each of which has a maximum size W, which is a value that is based on a number of threads in a warp of at least one of the one or more multiprocessors; apply one or more iterations of butterfly patterned replacements to each of the plurality of blocks such that a single row of each of the plurality of blocks is un-transposed; and draw, in parallel from each of the plurality of discrete probability distributions, at least one respective z value, wherein each parallel draw performs, respectively; a first binary search on the single un-transposed row of each of the plurality of blocks to determine a candidate block, and a second binary search within the candidate block to draw the at least one respective z value. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification