Systems and methods for efficient data searching, storage and reduction
First Claim
1. A computer-readable storage media encoded with computer-executable instructions to configure a processor to perform a method for identifying input data in repository data, the method comprising:
- providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data;
partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs);
wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by;
selecting a seed size and calculating hash values for every seed of the chunk;
selecting a subset of the plurality of hash values;
determining positions of the seeds within a seed sequence of the selected subset of hash values;
applying a function to the determined positions to determine corresponding other positions within the seed sequence;
defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions;
conducting a similarity search for each input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and
computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.
-
Citations
31 Claims
-
1. A computer-readable storage media encoded with computer-executable instructions to configure a processor to perform a method for identifying input data in repository data, the method comprising:
-
providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data; partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs); wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by; selecting a seed size and calculating hash values for every seed of the chunk; selecting a subset of the plurality of hash values; determining positions of the seeds within a seed sequence of the selected subset of hash values; applying a function to the determined positions to determine corresponding other positions within the seed sequence; defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions; conducting a similarity search for each input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A system for identifying input data in repository data comprising:
-
a processor; and a memory, wherein the processor and memory are configured to perform a method comprising; providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data; partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs); wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by; selecting a seed size and calculating hash values for every seed of the chunk; selecting a subset of the plurality of hash values; determining positions of the seeds within a seed sequence of the selected subset of hash values; applying a function to the determined positions to determine corresponding other positions within the seed sequence; defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions; conducting a similarity search for each of input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections.
-
Specification