Systems and Methods for Efficient Data Searching, Storage and Reduction
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.
-
Citations
152 Claims
-
1-115. -115. (canceled)
-
116. A method in repository data for data that are similar to an input data, wherein the repository data comprises a plurality of repository data chunks and the input data comprises a plurality of input data chunks, the method comprising:
-
for each repository data chunk, generating a corresponding set of repository distinguishing characteristics (RDCs); for each input data chunk, generating a corresponding set of input distinguishing characteristics (IDCs); and searching for data in the repository data that is similar to the input data by comparing the IDC s and RDCs, wherein each set of ROCs and IOCs is generated by; applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk; applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk; applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes; and
defining the second subset of hashes as the set of respective IDCs or RDCs. - View Dependent Claims (117, 118, 119, 120, 121, 122, 123, 124, 125)
-
-
126-1. A. The method of claim 126, wherein the comparing is conducted in a time independent of a size of the repository data and linear in a size of the input data.
-
128. A computer-readable medium encoded with computer-executable instructions that cause a computer to perform a method of searching in repository data for data that are similar to an input data, wherein the repository data is divided into one or more repository data chunks and the input data comprises a plurality of input data chunks, the medium comprising:
-
program code for generating a corresponding set of repository distinguishing characteristics (RDCs) for each repository data chunk; program code for generating a set of input distinguishing characteristics (IDCs) for each input data chunk; and program code for searching for data in the repository data that is similar to the input data by comparing the IDCs and RDCs, wherein the program code for generating each set of IDCs and the program code for generating each set of RDCs comprises; program code for applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk; program code for applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk; program code for applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes; and program code for defining the second subset of hashes as the set of respective IDCs or RDCs. - View Dependent Claims (129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139)
-
-
140. A system for searching in repository data for data that are similar to an input data, wherein the repository data is divided into one or more repository data chunks and the input data comprises a plurality of input data chunks, the system comprising:
-
means for, for each repository data chunk, generating a corresponding set of repository distinguishing characteristics (RDCs); means for generating a corresponding set of input distinguishing characteristics (IDCs) for each input data chunk; and means for searching for data in the repository data that is similar to the input data by comparing the IDCs and RDCs, wherein the means for generating the IDCs and the means for generating the RDCs each comprises; means for applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk; means for applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk; means for applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes; and means for defining the second subset of hashes as the set of respective IDCs or RDCs. - View Dependent Claims (141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151)
-
-
152-186. -186. (canceled)
Specification