Scalable distributed computations utilizing multiple distinct computational frameworks
First Claim
1. A method comprising:
- initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and
combining local processing results of the distributed computations from respective ones of the data processing clusters;
each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;
a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and
at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;
wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework;
wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension;
wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations; and
wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
7 Assignments
0 Petitions
Accused Products
Abstract
An apparatus in one embodiment comprises at least one processing device having a processor coupled to a memory. The processing device is configured to initiate distributed computations across a plurality of data processing clusters associated with respective data zones, and to combine local processing results of the distributed computations from respective ones of the data processing clusters. Each of the data processing clusters is configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster. A first one of data processing clusters utilizes a first local data structure configured to support a first computational framework, and at least a second one of the data processing clusters utilizes a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework.
126 Citations
20 Claims
-
1. A method comprising:
-
initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and combining local processing results of the distributed computations from respective ones of the data processing clusters; each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster; a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework; wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework; wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension; wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 10, 11, 12)
-
-
8. A method comprising:
-
initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and combining local processing results of the distributed computations from respective ones of the data processing clusters; each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster; a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework; wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes; wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations; wherein portions of the local processing results at a given level of the global data structure are approximately synchronized with one another as belonging to a common iteration of the global data structure; and wherein the method is performed by at least one processing device comprising a processor coupled to a memory. - View Dependent Claims (9)
-
-
13. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
-
to initiate distributed computations across a plurality of data processing clusters associated with respective data zones; and to combine local processing results of the distributed computations from respective ones of the data processing clusters; each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster; a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework; wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework; wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension; and wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations. - View Dependent Claims (14, 17, 18)
-
-
15. An apparatus comprising:
-
at least one processing device having a processor coupled to a memory; wherein said at least one processing device is configured; to initiate distributed computations across a plurality of data processing clusters associated with respective data zones; and to combine local processing results of the distributed computations from respective ones of the data processing clusters; each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster; a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework; wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework; wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension; and wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations. - View Dependent Claims (16, 19, 20)
-
Specification