HASH PARTITIONING STREAMED DATA
First Claim
1. At a computer system including one or more processors and system memory, the computer system also including an input data stream, a plurality of worker threads, and a corresponding plurality of output data streams, each worker thread in the plurality of worker threads directly corresponding to an output data stream in the plurality of output data streams, each worker thread in the plurality of worker threads configured to determine the equivalence of data elements from the input data stream in accordance with an equivalence function and to assign data elements determined to be equivalent to the same worker thread for output on the worker'"'"'s corresponding output data stream such that equivalent data elements are output on the same output data stream, a method for distributing data elements from the input data stream across the plurality of output data streams, the method comprising:
- an act of a first worker thread, from among the plurality of worker threads, acquiring a lock on the input data stream, the lock shared among the plurality of worker threads;
an act of the first worker thread accessing one or more data elements from the input data stream subsequent to locking the input data stream;
an act of the first worker thread releasing the lock on the input data stream;
for each data element in the one or more data elements;
an act of the first worker thread using the equivalence function to identify the appropriate output data stream, from among the plurality of output data streams, for outputting the data element; and
an act of the first worker thread assigning the data element to a worker thread corresponding to the identified appropriate output data stream such that data elements determined to be equivalent by the equivalence function are assigned to the same worker thread; and
an act of the first worker thread determining if other worker threads assigned any data elements from the input data stream to the first worker thread for output on the first worker thread'"'"'s corresponding output data stream.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention extends to methods, systems, and computer program products for partitioning streaming data. Embodiments of the invention can be used to hash partition a stream of data and thus avoids unnecessary memory usage (e.g., associated with buffering). Hash partitioning can be used to split an input sequence (e.g., a data stream) into multiple partitions that can be processed independently. Other embodiments of the invention can be used to hash repartition a plurality of streams of data. Hash repartitioning converts a set of partitions into another set of partitions with the hash partitioned property. Partitioning and repartitioning can be done in a streaming manner at runtime by exchanging values between worker threads responsible for different partitions.
-
Citations
20 Claims
-
1. At a computer system including one or more processors and system memory, the computer system also including an input data stream, a plurality of worker threads, and a corresponding plurality of output data streams, each worker thread in the plurality of worker threads directly corresponding to an output data stream in the plurality of output data streams, each worker thread in the plurality of worker threads configured to determine the equivalence of data elements from the input data stream in accordance with an equivalence function and to assign data elements determined to be equivalent to the same worker thread for output on the worker'"'"'s corresponding output data stream such that equivalent data elements are output on the same output data stream, a method for distributing data elements from the input data stream across the plurality of output data streams, the method comprising:
-
an act of a first worker thread, from among the plurality of worker threads, acquiring a lock on the input data stream, the lock shared among the plurality of worker threads; an act of the first worker thread accessing one or more data elements from the input data stream subsequent to locking the input data stream; an act of the first worker thread releasing the lock on the input data stream; for each data element in the one or more data elements; an act of the first worker thread using the equivalence function to identify the appropriate output data stream, from among the plurality of output data streams, for outputting the data element; and an act of the first worker thread assigning the data element to a worker thread corresponding to the identified appropriate output data stream such that data elements determined to be equivalent by the equivalence function are assigned to the same worker thread; and an act of the first worker thread determining if other worker threads assigned any data elements from the input data stream to the first worker thread for output on the first worker thread'"'"'s corresponding output data stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
11. At a computer system including one or more processors and system memory, the computer system also including a plurality of input data streams, a plurality of worker threads, and a plurality of output data streams, each of the plurality of input data streams streaming a previously partitioned subset of data to one or more of the plurality of worker threads, data streamed on each input data stream determined to be equivalent by a prior equivalence function, each worker thread in the plurality of worker threads directly corresponding to an output data stream in the plurality of output data streams, each worker thread in the plurality of worker threads configured to determine the equivalence of data elements from the plurality of input data streams in accordance with an equivalence function and to assign data elements determined to be equivalent by the equivalence function to the same worker thread for output on the worker'"'"'s corresponding output data stream such that data elements determined to be equivalent by the equivalence function are output on the same output data stream, a method for repartitioning the plurality of input data streams, the method comprising:
at each worker thread in the plurality of worker threads; an act of the worker thread accessing one or more data elements from an input data stream, the data elements determined to be equivalent by the prior equivalence function; for each of the one or more data elements; an act of using the equivalence function to identify the appropriate output data stream, from among the plurality of output data streams, for outputting the data element; an act of assigning the data element to a worker thread corresponding to the identified appropriate output data stream such that data elements determined to be equivalent by the equivalence function are assigned to the same worker thread; an act of checking a storage buffer for data elements assigned to the worker thread by other worker threads, data elements in the storage buffer determined to be equivalent by the equivalence function; and an act of outputting any data elements in the storage buffer on the worker thread'"'"'s corresponding output data stream.
-
20. A computer program product for use at a computer system, the computer system including an input data stream, a plurality of worker threads, and a corresponding plurality of output data streams, each worker thread in the plurality of worker threads directly corresponding to an output data stream in the plurality of output data streams, each worker thread in the plurality of worker threads configured to determine the equivalence of data elements from the input data stream in accordance with an equivalence function and to assign data elements determined to be equivalent to the same worker thread for output on the worker'"'"'s corresponding output data stream, the computer program product for implementing a method for distributing data elements from the input data stream across the plurality of output data streams, the computer program product comprising one or more computer storage media having stored thereon computer-executable instructions that, when execution at a processor, cause the computer system to perform the method, including the following at a worker thread from among the plurality of worker threads:
-
acquire a lock on the input data stream, the lock shared among the plurality of worker threads; access one or more data elements from the input data stream subsequent to locking the input data stream; release the lock on the input data stream; for each data element in the one or more data elements; use the equivalence function to compute a hash code for the data element; map the data element to the appropriate output data stream, from among the plurality of output data streams, based on the hash code; and assign the data element to a worker thread corresponding to the appropriate output data stream such that data elements determined to be equivalent by the equivalence function are mapped to the same worker thread; and check a lock-free stack corresponding to the worker thread to determining if other worker threads assigned any data elements from the input data stream to the worker thread for output on the worker thread'"'"'s corresponding output data stream.
-
Specification