Systems and methods for efficient data searching, storage and reduction
First Claim
Patent Images
1. A method enabling lossless data reduction comprising:
- partitioning version data into;
a) data corresponding to data already stored in a repository; and
b) data not already stored in the repository;
wherein the data already stored in the repository comprise a plurality of repository chunks, wherein the version data comprise a plurality of version chunks,the method further comprising;
storing in an index a plurality of n repository distinguishing characteristics (RDCs) and a position in the repository of each of the plurality of repository chunks, where n is smaller than size m of the repository chunk, where m is a value representative of a number of bytes of the repository chunk; and
for each version chunk;
determining a plurality of k input distinguishing characteristics (IDCs) of the version chunk, where k is greater than or equal to n;
determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk, wherein the similarity determination includes searching for each of the k distinguishing characteristics of the version chunk in the index until at most n matches are found;
determining that one or more similar repository chunks exist where the number of matches satisfies a threshold;
determining differences between the version chunk and similar repository chunk by comparing full data of the respective chunks; and
storing the differences in the repository.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.
-
Citations
19 Claims
-
1. A method enabling lossless data reduction comprising:
-
partitioning version data into; a) data corresponding to data already stored in a repository; and b) data not already stored in the repository; wherein the data already stored in the repository comprise a plurality of repository chunks, wherein the version data comprise a plurality of version chunks, the method further comprising; storing in an index a plurality of n repository distinguishing characteristics (RDCs) and a position in the repository of each of the plurality of repository chunks, where n is smaller than size m of the repository chunk, where m is a value representative of a number of bytes of the repository chunk; and for each version chunk; determining a plurality of k input distinguishing characteristics (IDCs) of the version chunk, where k is greater than or equal to n; determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk, wherein the similarity determination includes searching for each of the k distinguishing characteristics of the version chunk in the index until at most n matches are found; determining that one or more similar repository chunks exist where the number of matches satisfies a threshold; determining differences between the version chunk and similar repository chunk by comparing full data of the respective chunks; and storing the differences in the repository. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification