×

Scalable distributed data streaming computations across multiple data processing clusters

  • US 10,404,787 B1
  • Filed: 08/22/2017
  • Issued: 09/03/2019
  • Est. Priority Date: 04/06/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones;

    in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches;

    associating multiple ones of the data batches across the data processing clusters with a global data batch data structure;

    associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and

    processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations;

    wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes;

    wherein the global data batch data structure is organized in levels with different levels of the global data batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data batch data structure comprises data batches generated by nodes of the corresponding level in the global computation graph, the data batches at the given level of the global data batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the data batch was generated, a sequence number associated with generation of the data batch and a time-stamp associated with generation of the data batch; and

    wherein the method is performed by at least one processing device comprising a processor coupled to a memory.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×