Handling multiple task sequences in a stream processing framework
First Claim
1. A method of handling multiple task sequences, including long tail task sequences, the method comprising:
- defining a container over worker nodes having up to one physical thread per processor core of a worker node;
for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container;
determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume;
processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching;
dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises;
during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching;
when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and
when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and
assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level.
1 Assignment
0 Petitions
Accused Products
Abstract
The technology disclosed improves existing streaming processing systems by allowing the ability to both scale up and scale down resources within an infrastructure of a stream processing system. In particular, the technology disclosed relates to a dispatch system for a stream processing system that adapts its behavior according to a computational capacity of the system based on a run-time evaluation. The technical solution includes, during run-time execution of a pipeline, comparing a count of available physical threads against a set number of logically parallel threads. When a count of available physical threads equals or exceeds the number of logically parallel threads, the solution includes concurrently processing the batches at the physical threads. Further, when there are fewer available physical threads than the number of logically parallel threads, the solution includes multiplexing the batches sequentially over the available physical threads.
-
Citations
17 Claims
-
1. A method of handling multiple task sequences, including long tail task sequences, the method comprising:
-
defining a container over worker nodes having up to one physical thread per processor core of a worker node; for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container; determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume; processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching; dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises; during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching; when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for handling multiple task sequences including long-tail task sequences, the system including one or more processors coupled to memory and configured to perform operations comprising:
-
defining a container over worker nodes having up to one physical thread per processor core of a worker node; for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container; determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume; processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching; dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises; during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching; when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable storage medium with computer program instructions stored thereon for handling multiple task sequences including long-tail task sequences, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform operations comprising:
-
defining a container over worker nodes having up to one physical thread per processor core of a worker node; for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container; determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume; processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching; dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises; during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching; when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level. - View Dependent Claims (14, 15, 16, 17)
-
Specification