Scalable distributed data streaming computations across multiple data processing clusters
First Claim
1. A method comprising:
- initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones;
in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches;
associating multiple ones of the data batches across the data processing clusters with a global data batch data structure;
associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and
processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations;
wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes;
wherein the global data batch data structure is organized in levels with different levels of the global data batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data batch data structure comprises data batches generated by nodes of the corresponding level in the global computation graph, the data batches at the given level of the global data batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the data batch was generated, a sequence number associated with generation of the data batch and a time-stamp associated with generation of the data batch; and
wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
7 Assignments
0 Petitions
Accused Products
Abstract
An apparatus in one embodiment comprises at least one processing device having a processor coupled to a memory. The processing device is configured to initiate distributed data streaming computations across data processing clusters associated with respective data zones, and in each of the data processing clusters, to separate a data stream provided by a data source of the corresponding data zone into a plurality of data batches and process the data batches to generate respective result batches. Multiple ones of the data batches across the data processing clusters are associated with a global data batch data structure, and multiple ones of the result batches across the data processing clusters are associated with a global result batch data structure based at least in part on the global data batch data structure. The result batches are processed in accordance with the global result batch data structure to generate one or more global result streams.
-
Citations
25 Claims
-
1. A method comprising:
-
initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches; associating multiple ones of the data batches across the data processing clusters with a global data batch data structure; associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes; wherein the global data batch data structure is organized in levels with different levels of the global data batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data batch data structure comprises data batches generated by nodes of the corresponding level in the global computation graph, the data batches at the given level of the global data batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the data batch was generated, a sequence number associated with generation of the data batch and a time-stamp associated with generation of the data batch; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory. - View Dependent Claims (2, 3, 4, 5, 6, 8, 9, 10, 12, 14, 15, 16, 17, 18, 19)
-
-
7. A method comprising:
-
initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches; associating multiple ones of the data batches across the data processing clusters with a global data batch data structure; associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes; wherein the global result batch data structure is organized in levels with different levels of the global result batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global result batch data structure comprises result batches generated by nodes of the corresponding level in the global computation graph, the result batches at the given level of the global result batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the result batch was generated, a sequence number associated with generation of the result batch and a time-stamp associated with generation of the result batch; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
-
-
11. A method comprising:
-
initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches; associating multiple ones of the data batches across the data processing clusters with a global data batch data structure; associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the global data batch data structure comprises a plurality of local data batch data structures of respective ones of the data processing clusters and wherein at least a subset of the local data batch data structures have respective different formats so as to support local data batch heterogeneity across the data processing clusters; wherein at least one of the local data batch data structures itself comprises a global data batch data structure having a plurality of additional local data batch data structures of respective additional data processing clusters associated therewith; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory. - View Dependent Claims (24)
-
-
13. A method comprising:
-
initiating distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, separating a data stream provided by a data source of the corresponding data zone into a plurality of data batches and processing the data batches to generate respective result batches; associating multiple ones of the data batches across the data processing clusters with a global data batch data structure; associating multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and processing the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the global result batch data structure comprises a plurality of local result batch data structures of respective ones of the data processing clusters and wherein at least a subset of the local result batch data structures have respective different formats so as to support local result batch heterogeneity across the data processing clusters; wherein at least one of the local result batch data structures itself comprises a global result batch data structure having a plurality of additional local result batch data structures of respective additional data processing clusters associated therewith; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory. - View Dependent Claims (25)
-
-
20. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
-
to initiate distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, to separate a data stream provided by a data source of the corresponding data zone into a plurality of data batches and process the data batches to generate respective result batches; to associate multiple ones of the data batches across the data processing clusters with a global data batch data structure; to associate multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and to process the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes; and wherein the global data batch data structure is organized in levels with different levels of the global data batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data batch data structure comprises data batches generated by nodes of the corresponding level in the global computation graph, the data batches at the given level of the global data batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the data batch was generated, a sequence number associated with generation of the data batch and a time-stamp associated with generation of the data batch. - View Dependent Claims (21)
-
-
22. An apparatus comprising:
-
at least one processing device having a processor coupled to a memory; wherein said at least one processing device is configured; to initiate distributed data streaming computations across a plurality of data processing clusters associated with respective data zones; in each of the data processing clusters, to separate a data stream provided by a data source of the corresponding data zone into a plurality of data batches and process the data batches to generate respective result batches; to associate multiple ones of the data batches across the data processing clusters with a global data batch data structure; to associate multiple ones of the result batches across the data processing clusters with a global result batch data structure based at least in part on the global data batch data structure; and to process the result batches in accordance with the global result batch data structure to generate one or more global result streams providing global results of the distributed data streaming computations; wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed data streaming computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes; and wherein the global data batch data structure is organized in levels with different levels of the global data batch data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data batch data structure comprises data batches generated by nodes of the corresponding level in the global computation graph, the data batches at the given level of the global data batch data structure being approximately synchronized with one another as belonging to a common iteration of a global data stream data structure based at least in part on at least one of a time interval during which the data batch was generated, a sequence number associated with generation of the data batch and a time-stamp associated with generation of the data batch. - View Dependent Claims (23)
-
Specification