Hash partitioning streamed data
First Claim
1. At a computer system including one or more processors and system memory, a computer-implemented method for distributing data elements from an input data stream across a plurality of output data streams to facilitate parallel processing of the distributed data elements, the computer-implemented method comprising the following acts:
- receiving at a computing system an input data stream comprised of a plurality of data elements for distribution among a plurality of worker threads, with each worker thread providing a separate output data stream for those data elements output by it, and each worker thread having a storage buffer for receiving data elements from other worker threads;
at each worker thread that accesses data elements from the input data stream, performing the following acts;
prior to acquiring a lock on the input data stream when accessing data elements, each worker thread first checking its storage buffer to determine whether a data element has been assigned to it for output on that worker thread'"'"'s output data stream;
if a data element has been stored in its storage buffer, outputting any data elements stored in its storage buffer to its own output data stream prior to acquiring the lock on the input data stream;
if a data element has not been stored in its storage buffer, then acquiring the lock and using an equivalence function to identify a plurality of data elements that are essentially equivalent in terms of processing requirements; and
distributing at least some of the essentially equivalent data elements to one or more storage buffers of one or more other worker threads so that the at least some essentially equivalent data elements are output on separate output data streams corresponding to the one or more other worker threads so as to facilitate parallel processing of the distributed data elements.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention extends to methods, systems, and computer program products for partitioning streaming data. Embodiments of the invention can be used to hash partition a stream of data and thus avoids unnecessary memory usage (e.g., associated with buffering). Hash partitioning can be used to split an input sequence (e.g., a data stream) into multiple partitions that can be processed independently. Other embodiments of the invention can be used to hash repartition a plurality of streams of data. Hash repartitioning converts a set of partitions into another set of partitions with the hash partitioned property. Partitioning and repartitioning can be done in a streaming manner at runtime by exchanging values between worker threads responsible for different partitions.
-
Citations
10 Claims
-
1. At a computer system including one or more processors and system memory, a computer-implemented method for distributing data elements from an input data stream across a plurality of output data streams to facilitate parallel processing of the distributed data elements, the computer-implemented method comprising the following acts:
-
receiving at a computing system an input data stream comprised of a plurality of data elements for distribution among a plurality of worker threads, with each worker thread providing a separate output data stream for those data elements output by it, and each worker thread having a storage buffer for receiving data elements from other worker threads; at each worker thread that accesses data elements from the input data stream, performing the following acts; prior to acquiring a lock on the input data stream when accessing data elements, each worker thread first checking its storage buffer to determine whether a data element has been assigned to it for output on that worker thread'"'"'s output data stream; if a data element has been stored in its storage buffer, outputting any data elements stored in its storage buffer to its own output data stream prior to acquiring the lock on the input data stream; if a data element has not been stored in its storage buffer, then acquiring the lock and using an equivalence function to identify a plurality of data elements that are essentially equivalent in terms of processing requirements; and distributing at least some of the essentially equivalent data elements to one or more storage buffers of one or more other worker threads so that the at least some essentially equivalent data elements are output on separate output data streams corresponding to the one or more other worker threads so as to facilitate parallel processing of the distributed data elements. - View Dependent Claims (2, 3, 4)
-
-
5. At a computer system including one or more processors and system memory, a computer-implemented method for distributing data elements from a plurality of input data streams across a plurality of output data streams to facilitate parallel processing of the distributed data elements, the computer-implemented method comprising:
-
receiving at a computing system a plurality of input data streams each comprised of a plurality of data elements for distribution among a plurality of worker threads, with each worker thread providing a separate output data stream for those data elements output by it, and each worker thread having a storage buffer for receiving data elements from other worker threads; at each worker thread in the plurality of worker threads that accesses one or more data elements from an input stream, performing the following acts; prior to acquiring a lock on the input data stream when accessing data elements, each worker thread first checking its storage buffer to determine whether a data element has been assigned to it for output on that worker thread'"'"'s output data stream; if a data element has been stored in its storage buffer, outputting any data elements stored in its storage buffer to its own output data stream prior to acquiring the lock on the input data stream; if a data element has not been stored in its storage buffer, then acquiring the lock and using an equivalence function to identify a plurality of data elements that are essentially equivalent in terms of processing requirements and also using the equivalence function to identify an output data stream for outputting each of the essentially equivalent data elements; and distributing at least some of the essentially equivalent data elements to the storage buffer of one or more other worker threads corresponding to the identified output data stream for the essentially equivalent data elements so that the at least some essentially equivalent data elements are output on separate output data streams corresponding to the one or more other worker threads so as to facilitate parallel processing of the distributed data elements. - View Dependent Claims (6, 7, 8)
-
-
9. In a computer system including one or more processors and system memory, a computer program product comprising physical storage media having computer-executable instructions stored thereon, which when executed by one or more processors, cause the computing system to perform a computer-implemented method for distributing data elements from an input data stream across a plurality of output data streams to facilitate parallel processing of the distributed data elements, and wherein the computer-implemented method is comprised of the following acts:
-
receiving at a computing system an input data stream comprised of a plurality of data elements for distribution among a plurality of worker threads, with each worker thread providing a separate output data stream for those data elements output by it, and each worker thread having a storage buffer for receiving data elements from other worker threads; at each worker thread that accesses data elements from the input data stream, performing the following acts; prior to acquiring a lock on the input data stream when accessing data elements, each worker thread first checking its storage buffer to determine whether a data element has been assigned to it for output on that worker thread'"'"'s output data stream; if a data element has been stored in its storage buffer, outputting any data elements stored in its storage buffer to its own output data stream prior to acquiring the lock on the input data stream; if a data element has not been stored in its storage buffer, then acquiring the lock and using an equivalence function to identify a plurality of data elements that are essentially equivalent in terms of processing requirements; and distributing at least some of the essentially equivalent data elements to one or more storage buffers of one or more other worker threads so that the at least some essentially equivalent data elements are output on separate output data streams corresponding to the one or more other worker threads so as to facilitate parallel processing of the distributed data elements.
-
-
10. In a computer system including one or more processors and system memory, a computer program product comprising physical storage media having computer-executable instructions stored thereon, which when executed by one or more processors, cause the computing system to perform a computer-implemented method for distributing data elements from an input data stream across a plurality of output data streams to facilitate parallel processing of the distributed data elements, and wherein the computer-implemented method is comprised of the following acts:
-
receiving at a computing system a plurality of input data streams each comprised of a plurality of data elements for distribution among a plurality of worker threads, with each worker thread providing a separate output data stream for those data elements output by it, and each worker thread having a storage buffer for receiving data elements from other worker threads; at each worker thread in the plurality of worker threads that accesses one or more data elements from an input stream, performing the following acts; prior to acquiring a lock on the input data stream when accessing data elements, each worker thread first checking its storage buffer to determine whether a data element has been assigned to it for output on that worker thread'"'"'s output data stream; if a data element has been stored in its storage buffer, outputting any data elements stored in its storage buffer to its own output data stream prior to acquiring the lock on the input data stream; if a data element has not been stored in its storage buffer, then acquiring the lock and using an equivalence function to identify a plurality of data elements that are essentially equivalent in terms of processing requirements and also using the equivalence function to identify an output data stream for outputting each of the essentially equivalent data elements; and distributing at least some of the essentially equivalent data elements to the storage buffer of one or more other worker threads corresponding to the identified output data stream for the essentially equivalent data elements so that the at least some essentially equivalent data elements are output on separate output data streams corresponding to the one or more other worker threads so as to facilitate parallel processing of the distributed data elements.
-
Specification