Parallel processing of data sets
First Claim
1. A method comprising:
- partitioning a data set into a plurality of data partitions, the partitioning including removing dependencies in the data set that require some of the data partitions to be processed sequentially rather than in parallel;
distributing the plurality of data partitions to a plurality of processors, each of the plurality of data partitions being assigned to a single one of the plurality of processors;
processing, by the plurality of processors, each of the plurality of data partitions in parallel; and
synchronizing the plurality of processors to obtain a global record corresponding to the processed data partitions.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and devices are described for implementing learning algorithms on data sets. A data set may be partitioned into a plurality of data partitions that may be distributed to two or more processors, such as a graphics processing unit. The data partitions may be processed in parallel by each of the processors to determine local counts associated with the data partitions. The local counts may then be aggregated to form a global count that reflects the local counts for the data set. The partitioning may be performed by a data partition algorithm and the processing and the aggregating may be performed by a parallel collapsed Gibbs sampling (CGS) algorithm and/or a parallel collapsed variational Bayesian (CVB) algorithm. In addition, the CGS and/or the CVB algorithms may be associated with the data partition algorithm and may be parallelized to train a latent Dirichlet allocation model.
-
Citations
20 Claims
-
1. A method comprising:
-
partitioning a data set into a plurality of data partitions, the partitioning including removing dependencies in the data set that require some of the data partitions to be processed sequentially rather than in parallel; distributing the plurality of data partitions to a plurality of processors, each of the plurality of data partitions being assigned to a single one of the plurality of processors; processing, by the plurality of processors, each of the plurality of data partitions in parallel; and synchronizing the plurality of processors to obtain a global record corresponding to the processed data partitions. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
distributing subsets of a plurality of documents of a data set across a plurality of processors, the plurality of documents being partitioned into the subsets to remove dependencies between the plurality of documents, the dependencies between the plurality of documents causing the plurality of documents to be processed sequentially rather than in parallel; processing, by a particular one of the plurality of processors, a particular subset of the plurality of documents in parallel with the plurality of processors to identify local counts associated with the subset of documents; and aggregating the local counts from each of the processors to generate a global count that is representative of the data set. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a plurality of processors; and memory to store computer-executable instructions that, when executed by one of the plurality of processors, perform operations comprising; distributing, across the plurality of processors, subsets of documents partitioned from a plurality of documents included in a data set; determining, by each processor and in parallel with the plurality of processors, an expected local count corresponding to topics or words expected to be identified in the documents distributed to each processor; and synchronizing, based at least in part on the expected local counts, the plurality of processors to determine variational parameters that represent a distribution of the topics or the words expected to be identified in the plurality of documents. - View Dependent Claims (17, 18, 19, 20)
-
Specification