Scalable distributed in-memory computation

US 10,656,861 B1
Filed: 04/12/2017
Issued: 05/19/2020
Est. Priority Date: 12/29/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

distributing in-memory computations across at least first and second nodes of respective distinct data processing clusters of a plurality of data processing clusters over at least one network; and

aggregating results of the distributed in-memory computations for delivery to a requesting client device, wherein the results of the distributed in-memory computations are generated in respective ones of the at least first and second nodes in a decentralized and privacy-preserving manner;

wherein the data processing clusters are associated with respective distinct data zones, the first and second nodes of the respective distinct data processing clusters being configured to perform corresponding portions of the distributed in-memory computations utilizing respective ones of first and second in-memory datasets locally accessible within their respective data zones;

wherein the aggregating comprises processing local results received from respective ones of the at least first and second nodes of the data processing clusters to generate a global result as a function of the local results; and

wherein the method is performed by at least one processing device comprising a processor coupled to a memory.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus in one embodiment comprises at least one processing device having a processor coupled to a memory. The processing device is configured to distribute in-memory computations across at least first and second nodes of respective distinct data processing clusters of a plurality of data processing clusters over at least one network, and to aggregate results of the distributed in-memory computations for delivery to a requesting client device. The data processing clusters are associated with respective distinct data zones, and the first and second nodes of the respective distinct data processing clusters are configured to perform corresponding portions of the distributed in-memory computations utilizing respective ones of first and second in-memory datasets locally accessible within their respective data zones. The in-memory computations in some embodiments illustratively comprise Spark computations, such as Spark Core batch computations. The in-memory datasets in such an arrangement may comprise respective Spark resilient distributed datasets.

Citations

22 Claims

1. A method comprising:
- distributing in-memory computations across at least first and second nodes of respective distinct data processing clusters of a plurality of data processing clusters over at least one network; and
  
  aggregating results of the distributed in-memory computations for delivery to a requesting client device, wherein the results of the distributed in-memory computations are generated in respective ones of the at least first and second nodes in a decentralized and privacy-preserving manner;
  
  wherein the data processing clusters are associated with respective distinct data zones, the first and second nodes of the respective distinct data processing clusters being configured to perform corresponding portions of the distributed in-memory computations utilizing respective ones of first and second in-memory datasets locally accessible within their respective data zones;
  
  wherein the aggregating comprises processing local results received from respective ones of the at least first and second nodes of the data processing clusters to generate a global result as a function of the local results; and
  
  wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein the in-memory computations comprise Spark computations.
  - 3. The method of claim 2 wherein the Spark computations comprise Spark Core batch computations.
  - 4. The method of claim 1 wherein the first and second in-memory data sets comprise respective Spark resilient distributed datasets (RDDs).
  - 5. The method of claim 1 wherein the plurality of data processing clusters comprise respective YARN clusters.
  - 6. The method of claim 1 wherein the distributing and aggregating are performed at least in part in a worldwide data node coupled to one or more of the data processing clusters.
  - 7. The method of claim 1 wherein the distributing and aggregating are performed at least in part in a worldwide data node that comprises a processing node of a given one of the data processing clusters.
  - 8. The method of claim 1 wherein a given one of the data processing clusters comprises:
    - an in-memory processing driver;
      
      a distributed processing application master; and
      
      a resource manager coupled to the in-memory processing driver and the distributed processing application master;
      
      wherein the in-memory processing driver and the distributed processing application master are configured to communicate with one another via the resource manager.
  - 9. The method of claim 8 wherein the in-memory processing driver comprises a Spark Core driver program.
  - 10. The method of claim 8 wherein the distributed processing application master comprises a WWH application master.
  - 11. The method of claim 8 wherein the distributed processing application master of the given data processing cluster is configured to interact with a distributed processing application master of another one of the data processing clusters via a resource manager of that other data processing cluster.
  - 12. The method of claim 1 wherein a given one of the data processing clusters comprises at least one in-memory aggregator instance generated by a distributed processing application master of the given data processing cluster and configured to combine in-memory processing results from respective other ones of the data processing clusters.
  - 13. The method of claim 12 wherein the in-memory aggregator instance comprises a Spark aggregator instance generated by a WWH application master.
  - 14. The method of claim 1 wherein the distributed in-memory computations are performed utilizing multiple instances of local code running on respective nodes within respective ones of the data processing clusters and at least one instance of global code running on an initiating node within or otherwise associated with a particular one of the data processing clusters, and wherein the global code receives respective results from the multiple instances of the local code running on the respective nodes within the respective ones of the data processing clusters and aggregates those results.
  - 15. The method of claim 14 wherein one or more of the local code, the global code and a list of data resources are received in a distributed processing application master of a worldwide data node from an application running on the client device, and wherein the local code and the global code are executed against data resources identified at least in part by the list of data resources.
  - 16. The method of claim 1 wherein at least one of the portions of the distributed in-memory computations in a given one of the data processing clusters is itself distributed across multiple nodes in respective other ones of the data processing clusters.

17. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
- to distribute in-memory computations across at least first and second nodes of respective distinct data processing clusters of a plurality of data processing clusters over at least one network; and
  
  to aggregate results of the distributed in-memory computations for delivery to a requesting client device, wherein the results of the distributed in-memory computations are generated in respective ones of the at least first and second nodes in a decentralized and privacy-preserving manner;
  
  wherein the data processing clusters are associated with respective distinct data zones, the first and second nodes of the respective distinct data processing clusters being configured to perform corresponding portions of the distributed in-memory computations utilizing respective ones of first and second in-memory datasets locally accessible within their respective data zones; and
  
  wherein the aggregating comprises processing local results received from respective ones of the at least first and second nodes of the data processing clusters to generate a global result as a function of the local results.
- View Dependent Claims (18, 19)
- - 18. The computer program product of claim 17 wherein the distributed in-memory computations are performed utilizing multiple instances of local code running on respective nodes within respective ones of the data processing clusters and at least one instance of global code running on an initiating node within or otherwise associated with a particular one of the data processing clusters, and wherein the global code receives respective results from the multiple instances of the local code running on the respective nodes within the respective ones of the data processing clusters and aggregates those results.
  - 19. The computer program product of claim 18 wherein one or more of the local code, the global code and a list of data resources are received in a distributed processing application master of a worldwide data node from an application running on the client device, and wherein the local code and the global code are executed against data resources identified at least in part by the list of data resources.

20. An apparatus comprising:
- at least one processing device having a processor coupled to a memory;
  
  wherein said at least one processing device is configured;
  
  to distribute in-memory computations across at least first and second nodes of respective distinct data processing clusters of a plurality of data processing clusters over at least one network; and
  
  to aggregate results of the distributed in-memory computations for delivery to a requesting client device, wherein the results of the distributed in-memory computations are generated in respective ones of the at least first and second nodes in a decentralized and privacy-preserving manner;
  
  wherein the data processing clusters are associated with respective distinct data zones, the first and second nodes of the respective distinct data processing clusters being configured to perform corresponding portions of the distributed in-memory computations utilizing respective ones of first and second in-memory datasets locally accessible within their respective data zones; and
  
  wherein the aggregating comprises processing local results received from respective ones of the at least first and second nodes of the data processing clusters to generate a global result as a function of the local results.
- View Dependent Claims (21, 22)
- - 21. The apparatus of claim 20 wherein the distributed in-memory computations are performed utilizing multiple instances of local code running on respective nodes within respective ones of the data processing clusters and at least one instance of global code running on an initiating node within or otherwise associated with a particular one of the data processing clusters, and wherein the global code receives respective results from the multiple instances of the local code running on the respective nodes within the respective ones of the data processing clusters and aggregates those results.
  - 22. The apparatus of claim 21 wherein one or more of the local code, the global code and a list of data resources are received in a distributed processing application master of a worldwide data node from an application running on the client device, and wherein the local code and the global code are executed against data resources identified at least in part by the list of data resources.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Florissi, Patricia Gomes Soares, Masad, Ofri, Vijendra, Sudhir, Singer, Ido
Primary Examiner(s)
Tokuta, Shean
Assistant Examiner(s)
Turriate Gastulo, Juan C

Application Number

US15/485,843
Time in Patent Office

1,133 Days
Field of Search
US Class Current
CPC Class Codes

G06F 3/0604   Improving or facilitating a...

G06F 3/0643   Management of files

G06F 3/067   Distributed or networked st...

G06F 9/5072   Grid computing

G06F 9/5077   Logical partitioning of res...

Scalable distributed in-memory computation

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable distributed in-memory computation

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links