Duplicate in-memory shared-intermediate data detection and reuse module in spark framework
First Claim
1. A cache management system for managing a plurality of intermediate data, the cache management system comprising:
- a processor; and
a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform;
identifying a new intermediate data to be accessed;
loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and
in response to not identifying the new intermediate data as one of the plurality of intermediate data;
identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and
generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data,wherein the identifying of the reusable intermediate data comprises;
identifying a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and
comparing functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data.
1 Assignment
0 Petitions
Accused Products
Abstract
A cache management system for managing a plurality of intermediate data includes a processor, and a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform identifying a new intermediate data to be accessed, loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data, and in response to not identifying the new intermediate data as one of the plurality of intermediate data identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, and generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data.
-
Citations
18 Claims
-
1. A cache management system for managing a plurality of intermediate data, the cache management system comprising:
-
a processor; and a memory having stored thereon the plurality of intermediate data and instructions that when executed by the processor, cause the processor to perform; identifying a new intermediate data to be accessed; loading the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and in response to not identifying the new intermediate data as one of the plurality of intermediate data; identifying a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and generating the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data, wherein the identifying of the reusable intermediate data comprises; identifying a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and comparing functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of managing a plurality of intermediate data stored in memory, the method comprising:
-
identifying, by a processor, a new intermediate data to be accessed; loading, by the processor, the intermediate data from the memory in response to identifying the new intermediate data as one of the plurality of intermediate data; and in response to not identifying the new intermediate data as one of the plurality of intermediate data; identifying, by the processor, a reusable intermediate data having a longest duplicate generating logic chain that is at least in part the same as a generating logic chain of the new intermediate data, each generating logic chain comprising a chain of functions applied to a corresponding intermediate data source, a first function of the chain of functions operating on the corresponding intermediate data source and each subsequent function of the chain of functions operating on an output of a previous function of the chain of functions; and generating, by the processor, the new intermediate data from the reusable intermediate data and a portion of the generating logic chain of the new intermediate data not in common with the reusable intermediate data, wherein the identifying of the reusable intermediate data comprises; identifying, by the processor, a subset of intermediate data, each one of the subset of intermediate data having a same intermediate data source as the new intermediate data and a generating logic chain that is not longer than that of the new intermediate data; and comparing, by the processor, functions of the generating logic chain of each one of the subset of intermediate data to those of the new intermediate data, in a sequence, to identify the reusable intermediate data as one of the subset of intermediate data having a longest sequence of matching functions with the new intermediate data. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification