Systems and methods for efficient data searching, storage and reduction
First Claim
1. A method for identifying input data in repository data comprising:
- providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
partitioning the input data into a plurality of input chunks;
for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.
-
Citations
94 Claims
-
1. A method for identifying input data in repository data comprising:
-
providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
partitioning the input data into a plurality of input chunks;
for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for identifying common sections of two data intervals comprising:
-
determining anchors that define corresponding intervals in the two data intervals which are likely to contain matching data, each anchor comprising a pair of matching seeds in the two data intervals; and
comparing the data between and in the vicinity of the anchors in the corresponding intervals to find matching data intervals. - View Dependent Claims (22, 23, 24, 25)
-
-
26. A system for identifying input data in repository data comprising:
-
means for providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
means for partitioning the input data into a plurality of input chunks;
means for determining at least K distinguishing characteristics for each input chunk and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
means for computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
-
-
27. A system for identifying input data in repository data comprising:
-
a processor; and
a memory, wherein the processor and memory are configured to perform a method comprising;
providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
partitioning the input data into a plurality of input chunks;
for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
-
-
28. A computer-readable medium containing instructions to configure a data processor to perform a method for identifying input data in repository data, the method comprising:
-
providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
partitioning the input data into a plurality of input chunks;
for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
-
-
29. A system for identifying input data in repository data, the system comprising at least one memory comprising:
-
code that provides an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
code that partitions the input data into a plurality of input chunks;
code that determines at least K distinguishing characteristics for each input chunk and searches the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
code that computes at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
-
-
30. A computer readable media for identifying input data in repository data, the computer readable media comprising code, the code comprising:
-
code that provides an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
code that partitions the input data into a plurality of input chunks;
code that determines at least K distinguishing characteristics for each input chunk and searches the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
N≦
K; and
code that computes at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
-
-
31. A method enabling lossless data reduction by partitioning version data into:
-
a) data already stored in a repository; and
b) data not already stored in the repository;
wherein, each of the repository data and the version data comprise a plurality of data chunks, and wherein the method comprises, for each version chunk;
determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk; and
determining differences between the version chunk and similar repository chunk by comparing the full data of the respective chunks. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
-
-
56. A method of locating matching data in a repository to input data comprising:
-
applying a hash-based function to determine, for each of a plurality of chunks of the input data, a set of representation values for each input chunk;
selecting a subset of the set of representation values to determine a set of distinguishing characteristics for each input chunk;
using the set of input distinguishing characteristics to locate a chunk of the repository data deemed likely to contain matching data;
using the input representation values to identify matching data in the repository chunk. - View Dependent Claims (57, 58, 59)
-
-
60. A method of searching a repository of binary uninterpretted data for a location of common data to an input data comprising:
-
analyzing segments of each of the repository and input data to determine a repository segment that is similar to an input segment, the analyzing step including searching an index of representation values of the repository data for matching representation values of the input in a time independent of a size of the repository and linear in a size of the input data; and
analyzing the similar repository segment with respect to the input segment to determine their common data sections while utilizing at least some of the matching representation values for data alignment, in a time linear in a size of the input segment.
-
-
61. A method of indexing repository data comprising:
-
generating distinguishing characteristics of input data;
using the input data distinguishing characteristics for locating a similar data segment in the repository data;
using the input data distinguishing characteristics for locating common data sections in the similar repository data segment;
storing in the index at least some of the distinguishing characteristics of the input data; and
storing at least some noncommon data sections of the input data in the repository data.
-
-
62. A method comprising:
computing data characteristics for incoming data; and
searching for elements of the incoming data characteristics within an index of repository data characteristics and declaring a similarity match between a portion of a repository and a portion of the new data if the matched characteristics pass a threshold. - View Dependent Claims (63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73)
-
74. A method for searching in repository data for parts that are sufficiently similar to an input data according to a similarity criterion, comprising:
-
1. processing the repository data by a. dividing the repository data into parts called repository chunks;
b. for each of the repository chunks, calculating one or a plurality of repository distinguishing characteristics (RDCs), each RDC belonging an interval of integers called value range;
c. creating pairs associating each RDC with a corresponding repository chunk; and
d. maintaining an index storing the pairs;
2. processing the input data by a. dividing the input data into parts called input chunks; and
b. performing for each of the input chunks;
i. calculating one or a plurality of input distinguishing characteristics (IDCs);
ii. searching for the IDCs in the pairs stored in the index; and
iii. if a threshold j of the IDCs has been found in the pairs stored in the index, declaring a match between the input chunk and the corresponding repository chunk(s) that are associated with the IDCs in the pairs. - View Dependent Claims (75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94)
-
Specification