Out-of core similarity matching
First Claim
1. A computer-implemented method for data reduction, the method comprising:
- in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and a base data chunk;
generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk, the representative data including fingerprints of the target data chunk and the base data chunk and a plurality of features extracted from the target data chunk and the base data chunk, wherein each of the plurality of features is a value having a property that a probability of the target data chunk having same representative value as the base data chunk is proportional to data similarity of the target data chunk and the base data chunk;
sorting the representative data for the target data chunk and the base data chunk to form a sorted representative data list based on a first feature defined in the representative data for the target data chunk and the base data chunk; and
generating a delta data chunk as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is proximate to the representative data of the base data chunk in the sorted representative data list.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for storing data in a data storage system by partitioning the data into a plurality of data chunks and generating representative data for each of the plurality of chunks by applying a predetermined algorithm to each chunk of the plurality of chunks. Subsequently, the representative data is compared and sorted. Representative data for base data chunks and representative data for other data chunks that can be stored relative to the base data chunks are identified by evaluating the sorted set of representative data. Finally, each of the other data chunks identified as those that can be stored relative to a base data chunk are stored in the data storage system as the difference between the data chunk and a base data chunk.
-
Citations
27 Claims
-
1. A computer-implemented method for data reduction, the method comprising:
-
in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and a base data chunk; generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk, the representative data including fingerprints of the target data chunk and the base data chunk and a plurality of features extracted from the target data chunk and the base data chunk, wherein each of the plurality of features is a value having a property that a probability of the target data chunk having same representative value as the base data chunk is proportional to data similarity of the target data chunk and the base data chunk; sorting the representative data for the target data chunk and the base data chunk to form a sorted representative data list based on a first feature defined in the representative data for the target data chunk and the base data chunk; and generating a delta data chunk as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is proximate to the representative data of the base data chunk in the sorted representative data list. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage medium having instructions stored therein, which when executed by a computer, cause the computer to:
-
in response to a request for compressing data in a data storage system, partition the data into a plurality of data chunks, including a target data chunk and base data chunk; generate representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk, the representative data including fingerprints of the target data chunk and the base data chunk and a plurality of features extracted from the target data chunk and the base data chunk, wherein each of the plurality of features is a value having a property that a probability of the target data chunk having same representative value as the base data chunk is proportional to data similarity of the target data chunk and the base data chunk; sort the representative data for the target data chunk and the base data chunk to form a sorted representative data list based on a first feature defined in the representative data for the target data chunk and the base data chunk; and generate a delta data chunk in the data storage system as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is proximate to the representative data of the base data chunk in the sorted representative data list. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A data storage system, comprising:
-
a memory unit to store a chunk storage engine, a compression engine, a comparison and sorting module, a similarity matching module and a delta encoding module; a processor coupled to the memory unit, the processor configured to execute the chunk storage engine, the compression engine, the comparison and sorting module, the similarity matching module, and the delta encoding module, the chunk storage engine to partition data into a plurality of data chunks, including a target data chunk and a base data chunk, the compression engine to generate representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk, the representative data including fingerprints of the target data chunk and the base data chunk and a plurality of features extracted from the target data chunk and the base data chunk, wherein each of the plurality of features is a value having a property that a probability of the target data chunk having same representative value as the base data chunk is proportional to data similarity of the target data chunk and the base data chunk, the comparison and sorting module to sort the representative data for the target data chunk and the base data chunk based on similarity of a first feature defined in the representative data for the target data chunk and a first feature defined in the representative data for the base data chunk to form a sorted representative data list, the similarity matching module to evaluate where representative data of the target chunk are proximate to representative data of the base data chunk in the sorted representative data list, and the delta encoding module to generate a delta data chunk as the difference between the target data chunk and the base data chunk, wherein the delta data chunk and the base data chunk represent the target data chunk. - View Dependent Claims (26, 27)
-
Specification