×

Duplicate in-memory shared-intermediate data detection and reuse module in spark framework

  • US 10,311,025 B2
  • Filed: 01/11/2017
  • Issued: 06/04/2019
  • Est. Priority Date: 09/06/2016
  • Status: Active Grant
First Claim
Patent Images

1. A cache management system for managing a plurality of intermediate data, the cache management system comprising:

  • a processor; and

    a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform;

    identifying a new intermediate data to be accessed;

    loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and

    in response to not identifying the new intermediate data as one of the plurality of intermediate data;

    identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and

    generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data,wherein the identifying of the reusable intermediate data comprises;

    identifying a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and

    comparing functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×