Duplicate in-memory shared-intermediate data detection and reuse module in spark framework

US 10,311,025 B2
Filed: 01/11/2017
Issued: 06/04/2019
Est. Priority Date: 09/06/2016
Status: Active Grant

First Claim

Patent Images

1. A cache management system for managing a plurality of intermediate data, the cache management system comprising:

a processor; and

a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform;

identifying a new intermediate data to be accessed;

loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and

in response to not identifying the new intermediate data as one of the plurality of intermediate data;

identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and

generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data,wherein the identifying of the reusable intermediate data comprises;

identifying a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and

comparing functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A cache management system for managing a plurality of intermediate data includes a processor, and a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform identifying a new intermediate data to be accessed, loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data, and in response to not identifying the new intermediate data as one of the plurality of intermediate data identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, and generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data.

Citations

18 Claims

1. A cache management system for managing a plurality of intermediate data, the cache management system comprising:
- a processor; and
  
  a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform;
  
  identifying a new intermediate data to be accessed;
  
  loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and
  
  in response to not identifying the new intermediate data as one of the plurality of intermediate data;
  
  identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and
  
  generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data,wherein the identifying of the reusable intermediate data comprises;
  
  identifying a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and
  
  comparing functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The cache management system of claim 1, wherein the instructions further cause the processor to perform:
    - in response to a lack of sufficient space within the memory to store the new intermediate data;
      
      selecting a set of victim intermediate data from the plurality of intermediate data to evict from the memory; and
      
      evicting the set of victim intermediate data from the memory.
  - 3. The cache management system of claim 1, wherein the sequence is an order of functions from the intermediate data source to the each one of the subset of intermediate data.
  - 4. The cache management system of claim 1, wherein the intermediate data source of each of the subset of intermediate data has a same size as, and a same data structure as, that of the new intermediate data.
  - 5. The cache management system of claim 1, wherein the identifying of the subset of intermediate data comprises:
    - comparing, byte-to-byte, the intermediate data source of each one of the subset of intermediate data with the new intermediate data.
  - 6. The cache management system of claim 1, wherein the comparing of the functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data comprises:
    - determining that the functions of the generating logic chain of each one of the subset of intermediate data and those of the new intermediate data are deterministic; and
      
      in response, determining that types of the functions of the generating logic chain of each one of the subset of intermediate data are the same as those of the new intermediate data.
  - 7. The cache management system of claim 6, wherein the determining that the functions are deterministic comprises:
    - matching each of the functions with one of a preset list of deterministic functions.
  - 8. The cache management system of claim 1, wherein the generating of the new intermediate data from the reusable intermediate data comprises:
    - successively applying, to the reusable intermediate data, functions of the generating logic chain of the new intermediate data not in common with the reusable intermediate data to generate the new intermediate data.
  - 9. The cache management system of claim 1, wherein each one of the plurality of intermediate data is a resilient distributed data (RDD).

10. A method of managing a plurality of intermediate data stored in memory, the method comprising:
- identifying, by a processor, a new intermediate data to be accessed;
  
  loading, by the processor, the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and
  
  in response to not identifying the new intermediate data as one of the plurality of intermediate data;
  
  identifying, by the processor, a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and
  
  generating, by the processor, the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data,wherein the identifying of the reusable intermediate data comprises;
  
  identifying, by the processor, a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and
  
  comparing, by the processor, functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The method of claim 10, further comprising:
    - in response to a lack of sufficient space within the memory to store the new intermediate data;
      
      selecting, by the processor, a set of victim intermediate data from the plurality of intermediate data to evict from the memory; and
      
      evicting, by the processor, the set of victim intermediate data from the memory.
  - 12. The method of claim 10, wherein the sequence is an order of functions from the intermediate data source to the each one of the subset of intermediate data.
  - 13. The method of claim 10, wherein the intermediate data source of each of the subset of intermediate data has a same size as, and a same data structure as, that of the new intermediate data.
  - 14. The method of claim 10, wherein the identifying of the subset of intermediate data comprises:
    - comparing, byte-to-byte, by the processor, the intermediate data source of each one of the subset of intermediate data with the new intermediate data.
  - 15. The method of claim 10, wherein the comparing of the functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data comprises:
    - determining, by the processor, that the functions of the generating logic chain of each one of the subset of intermediate data and those of the new intermediate data are deterministic; and
      
      in response, determining, by the processor, that types of the functions of the generating logic chain of each one of the subset of intermediate data are the same as those of the new intermediate data.
  - 16. The method of claim 15, wherein the determining that the functions are deterministic comprises:
    - matching, by the processor, each of the functions with one of a preset list of deterministic functions.
  - 17. The method of claim 10, wherein the generating of the new intermediate data from the reusable intermediate data comprises:
    - successively applying, to the reusable intermediate data, by the processor, functions of the generating logic chain of the new intermediate data not in common with the reusable intermediate data to generate the new intermediate data.
  - 18. The method of claim 10, wherein each one of the plurality of intermediate data is a resilient distributed data (RDD).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Samsung Electronics Co. Ltd.
Original Assignee
Samsung Electronics Co. Ltd.
Inventors
Yang, Zhengyu, Wang, Jiayin, Evans, Thomas David
Primary Examiner(s)
Bragdon, Reginald G
Assistant Examiner(s)
Knight, Paul M

Application Number

US15/404,100
Publication Number

US 20180067861A1
Time in Patent Office

874 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 12/0808   with cache invalidating mea...

G06F 12/0811   with multilevel cache hiera...

G06F 12/0842   for multiprocessing or mult...

G06F 12/0862   with prefetch

G06F 12/0868   Data transfer between cache...

G06F 12/124   being minimized, e.g. non MRU

G06F 12/126   with special data handling,...

G06F 16/172   Caching, prefetching or hoa...

G06F 16/188   Virtual file systems

G06F 2009/4557   Distribution of virtual mac...

G06F 2009/45579   I/O management, e.g. provid...

G06F 2009/45583   Memory management, e.g. acc...

G06F 2212/1021   Hit rate improvement

G06F 2212/602   Details relating to cache p...

G06F 2212/6024   History based prefetching

G06F 2212/62   Details of cache specific t...

G06F 3/06   Digital input from, or digi...

G06F 3/061   Improving I/O performance

G06F 3/0634   by changing the state or mo...

G06F 3/0685   Hybrid storage combining he...

G06F 9/45558 : Hypervisor-specific managem...

View All

Duplicate in-memory shared-intermediate data detection and reuse module in spark framework

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Duplicate in-memory shared-intermediate data detection and reuse module in spark framework

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links