Scalable distributed computations utilizing multiple distinct computational frameworks

US 10,366,111 B1
Filed: 08/22/2017
Issued: 07/30/2019
Est. Priority Date: 04/06/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and

combining local processing results of the distributed computations from respective ones of the data processing clusters;

each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;

a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and

at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;

wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework;

wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension;

wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations; and

wherein the method is performed by at least one processing device comprising a processor coupled to a memory.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus in one embodiment comprises at least one processing device having a processor coupled to a memory. The processing device is configured to initiate distributed computations across a plurality of data processing clusters associated with respective data zones, and to combine local processing results of the distributed computations from respective ones of the data processing clusters. Each of the data processing clusters is configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster. A first one of data processing clusters utilizes a first local data structure configured to support a first computational framework, and at least a second one of the data processing clusters utilizes a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework.

126 Citations

20 Claims

1. A method comprising:
- initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and
  
  combining local processing results of the distributed computations from respective ones of the data processing clusters;
  
  each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;
  
  a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and
  
  at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;
  
  wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework;
  
  wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension;
  
  wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations; and
  
  wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 10, 11, 12)
- - 2. The method of claim 1 wherein the first computational framework comprises a MapReduce framework and the second computational framework comprises a Spark framework.
  - 3. The method of claim 2 wherein the Spark framework comprises one of a Spark batch framework and a Spark streaming framework.
  - 4. The method of claim 1 wherein the Spark streaming framework is configured to support at least one of Spark iterative processing and Spark interactive processing.
  - 5. The method of claim 1 wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes.
  - 6. The method of claim 5 wherein a particular one of the data processing clusters corresponding to a root node of the global computation graph initiates the distributed computations in accordance with a control flow that propagates from the root node toward leaf nodes of the global computation graph via one or more intermediate nodes of the global computation graph and wherein local processing results from respective ones of the data processing clusters corresponding to respective ones of the nodes propagate back from those nodes toward the root node.
  - 7. The method of claim 5 wherein the global data structure is organized in levels with different levels of the global data structure corresponding to respective ones of the levels of the global computation graph and wherein a given one of the levels of the global data structure comprises local processing results generated by nodes of the corresponding level in the global computation graph.
  - 10. The method of claim 1 wherein each of the data processing clusters generates its corresponding portion of the local processing results independently of and at least partially in parallel with the other data processing clusters.
  - 11. The method of claim 1 wherein each of the data processing clusters generates its portion of the local processing results asynchronously with respect to portions of the local processing results generated by the other data processing clusters but the portions of the local processing results are eventually synchronized across the plurality of data processing clusters in conjunction with generation of the global processing results in accordance with the global data structure.
  - 12. The method of claim 1 wherein the data processing clusters are implemented in one or more clouds of a particular type provided by a common cloud service provider.

8. A method comprising:
- initiating distributed computations across a plurality of data processing clusters associated with respective data zones; and
  
  combining local processing results of the distributed computations from respective ones of the data processing clusters;
  
  each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;
  
  a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and
  
  at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;
  
  wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes;
  
  wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations;
  
  wherein portions of the local processing results at a given level of the global data structure are approximately synchronized with one another as belonging to a common iteration of the global data structure; and
  
  wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
- View Dependent Claims (9)
- - 9. The method of claim 8 wherein the portions of the local processing results at the given level of the global data structure are determined to belong to the common iteration of the global data structure based at least in part on at least one of a time interval during which the portions of the local processing results were generated, a sequence number associated with generation of the portions of the local processing results and a time-stamp associated with generation of the portions of the local processing results.

13. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
- to initiate distributed computations across a plurality of data processing clusters associated with respective data zones; and
  
  to combine local processing results of the distributed computations from respective ones of the data processing clusters;
  
  each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;
  
  a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and
  
  at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;
  
  wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework;
  
  wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension; and
  
  wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations.
- View Dependent Claims (14, 17, 18)
- - 14. The computer program product of claim 13 wherein the first computational framework comprises a MapReduce framework and the second computational framework comprises a Spark framework.
  - 17. The computer program product of claim 13 wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes.
  - 18. The computer program product of claim 17 wherein a particular one of the data processing clusters corresponding to a root node of the global computation graph initiates the distributed computations in accordance with a control flow that propagates from the root node toward leaf nodes of the global computation graph via one or more intermediate nodes of the global computation graph and wherein local processing results from respective ones of the data processing clusters corresponding to respective ones of the nodes propagate back from those nodes toward the root node.

15. An apparatus comprising:
- at least one processing device having a processor coupled to a memory;
  
  wherein said at least one processing device is configured;
  
  to initiate distributed computations across a plurality of data processing clusters associated with respective data zones; and
  
  to combine local processing results of the distributed computations from respective ones of the data processing clusters;
  
  each of the data processing clusters being configured to process data from a data source of the corresponding data zone using a local data structure and an associated computational framework of that data processing cluster;
  
  a first one of data processing clusters utilizing a first local data structure configured to support a first computational framework; and
  
  at least a second one of the data processing clusters utilizing a second local data structure different than the first local data structure and configured to support a second computational framework different than the first computational framework;
  
  wherein at least one of the data processing clusters is configured in accordance with a Spark batch framework and one or more other ones of the data processing clusters are configured in accordance with a Spark streaming framework;
  
  wherein the Spark batch framework implements one or more batch mode extensions comprising at least one of a Spark SOL extension, a Spark MLlib extension and a Spark GraphX extension; and
  
  wherein the local processing results of the distributed computations from respective ones of the data processing clusters are combined utilizing a global data structure configured based at least in part on the local data structures in order to produce global processing results of the distributed computations.
- View Dependent Claims (16, 19, 20)
- - 16. The apparatus of claim 15 wherein the first computational framework comprises a MapReduce framework and the second computational framework comprises a Spark framework.
  - 19. The apparatus of claim 15 wherein the plurality of data processing clusters associated with the respective data zones are organized in accordance with a global computation graph for performance of the distributed computations and wherein the global computation graph comprises a plurality of nodes corresponding to respective ones of the data processing clusters and further wherein the plurality of nodes are arranged in multiple levels each including at least one of the nodes.
  - 20. The apparatus of claim 19 wherein a particular one of the data processing clusters corresponding to a root node of the global computation graph initiates the distributed computations in accordance with a control flow that propagates from the root node toward leaf nodes of the global computation graph via one or more intermediate nodes of the global computation graph and wherein local processing results from respective ones of the data processing clusters corresponding to respective ones of the nodes propagate back from those nodes toward the root node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Florissi, Patricia Gomes Soares, Masad, Ofri
Primary Examiner(s)
Tokuta, Shean

Application Number

US15/683,243
Time in Patent Office

707 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/182   Distributed file systems

G06F 16/2246   Trees, e.g. B+trees

G06F 16/2477   Temporal data queries

G06F 16/27   Replication, distribution o...

G06F 16/285   Clustering or classification

G06F 16/951   Indexing; Web crawling tech...

Scalable distributed computations utilizing multiple distinct computational frameworks

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

126 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable distributed computations utilizing multiple distinct computational frameworks

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

126 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links