Handling multiple task sequences in a stream processing framework

US 10,198,298 B2
Filed: 12/31/2015
Issued: 02/05/2019
Est. Priority Date: 09/16/2015
Status: Active Grant

First Claim

Patent Images

1. A method of handling multiple task sequences, including long tail task sequences, the method comprising:

defining a container over worker nodes having up to one physical thread per processor core of a worker node;

for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container;

determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume;

processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching;

dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises;

during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching;

when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and

when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and

assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology disclosed improves existing streaming processing systems by allowing the ability to both scale up and scale down resources within an infrastructure of a stream processing system. In particular, the technology disclosed relates to a dispatch system for a stream processing system that adapts its behavior according to a computational capacity of the system based on a run-time evaluation. The technical solution includes, during run-time execution of a pipeline, comparing a count of available physical threads against a set number of logically parallel threads. When a count of available physical threads equals or exceeds the number of logically parallel threads, the solution includes concurrently processing the batches at the physical threads. Further, when there are fewer available physical threads than the number of logically parallel threads, the solution includes multiplexing the batches sequentially over the available physical threads.

Citations

17 Claims

1. A method of handling multiple task sequences, including long tail task sequences, the method comprising:
- defining a container over worker nodes having up to one physical thread per processor core of a worker node;
  
  for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container;
  
  determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume;
  
  processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching;
  
  dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises;
  
  during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching;
  
  when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and
  
  when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and
  
  assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further including concurrently processing at least one subset of the plurality of batches in the at least one pipeline with at least another subset of the plurality of batches in the at least one pipeline and multiplexing at least one batch in the subsets sequentially.
  - 3. The method of claim 1, further including concurrently processing a plurality of pipelines over multiple worker nodes in a given container, including:
    - when there are fewer physical threads available for batch processing than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of pipelines sequentially over the available physical threads.
  - 4. The method of claim 1, wherein at least one batch is defined on a time-slice basis and includes a portion of at least one incoming NRT data stream most recently received within a time window.
  - 5. The method of claim 1, wherein at least one batch is defined on a batch-size basis.
  - 6. The method of claim 1, wherein the at least one pipeline comprises a plurality of grouped event tuples from a portion of at least one incoming NRT data stream.

7. A system for handling multiple task sequences including long-tail task sequences, the system including one or more processors coupled to memory and configured to perform operations comprising:
- defining a container over worker nodes having up to one physical thread per processor core of a worker node;
  
  for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container;
  
  determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume;
  
  processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching;
  
  dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises;
  
  during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching;
  
  when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and
  
  when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and
  
  assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, configured to perform operations comprising concurrently processing at least one subset of the plurality of batches in the at least one pipeline with at least another subset of the plurality of batches in the at least one pipeline and multiplexing at least one batch in the subsets sequentially.
  - 9. The system of claim 7, further configured to perform operations comprising concurrently processing a plurality of pipelines over multiple worker nodes in a given container, including:
    - when there are fewer available physical threads than the number of logically parallel threads, multiplexing the plurality of pipelines sequentially over the available physical threads.
  - 10. The system of claim 7, wherein at least one batch is defined on a time-slice basis and includes a portion of at least one incoming NRT data stream most recently received within a time window.
  - 11. The system of claim 7, wherein at least one batch is defined on a batch-size basis.
  - 12. The system of claim 7, wherein the at least one pipeline comprises a plurality of grouped event tuples from a portion of at least one incoming NRT data stream.

13. A non-transitory computer-readable storage medium with computer program instructions stored thereon for handling multiple task sequences including long-tail task sequences, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform operations comprising:
- defining a container over worker nodes having up to one physical thread per processor core of a worker node;
  
  for multiple task sequences, queuing data from a plurality of incoming near real-time (NRT) data streams in at least one pipeline running in the container;
  
  determining, based on analysis of a given NRT data stream for processing by at least one task sequence of the multiple task sequences, that a data volume for the at least one task sequence is expected to decrease after a surge in the data volume;
  
  processing data from the plurality of NRT data streams in a plurality of batches via a container-coordinator configured to control batch dispatching;
  
  dispatching the plurality of batches to physical threads of the worker nodes for processing, wherein the dispatching comprises;
  
  during execution, comparing a count of physical threads available for batch processing against a set number of logically parallel threads available for batch dispatching;
  
  when a count of the physical threads available for batch processing equals or exceeds the number of logically parallel threads available for batch dispatching, concurrently processing the plurality of batches at the physical threads; and
  
  when the count of physical threads available for batch processing is less than the number of logically parallel threads available for batch dispatching, multiplexing the plurality of batches sequentially over at least one of the physical threads available for batch processing, including processing a batch of the plurality of batches in the at least one pipeline till completion or time out before processing a next batch of the plurality of batches in the at least one pipeline; and
  
  assigning, via a scheduler, a priority level to at least a first pipeline and a second pipeline, wherein execution of a first number of batches of the plurality of batches in the first pipeline is performed before execution of a second number of batches of the plurality of batches in the second pipeline, according to the priority level.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The non-transitory computer-readable storage medium of claim 13, the operations further comprising concurrently processing at least one subset of the plurality of batches in the at least one pipeline with at least another subset of the plurality of batches in the at least one pipeline and multiplexing at least one batch in the subsets sequentially.
  - 15. The non-transitory computer-readable storage medium of claim 13, the operations further comprising concurrently processing a plurality of pipelines over multiple worker nodes in a given container, including:
    - when there are fewer available physical threads than the number of logically parallel threads, multiplexing the plurality of pipelines sequentially over the available physical threads.
  - 16. The non-transitory computer-readable storage medium of claim 13, wherein at least one batch is defined on a time-slice basis and includes a portion of at least one incoming NRT data stream most recently received within a time window.
  - 17. The non-transitory computer-readable storage medium of claim 13, wherein at least one batch is defined on a batch-size basis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Bishop, Elden Gregory, Chao, Jeffrey
Primary Examiner(s)
Ghaffari, Abu

Application Number

US14/986,351
Publication Number

US 20170075693A1
Time in Patent Office

1,132 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 2209/483   Multiproc

G06F 9/4881   Scheduling strategies for d...

G06F 9/5088   involving task migration

Handling multiple task sequences in a stream processing framework

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Handling multiple task sequences in a stream processing framework

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links