Systems and methods for efficient data searching, storage and reduction
First Claim
1. A system for providing input data to a repository to search repository data in the repository for data that are similar to the input data, the input data being divided into one or more input chunks, the system comprising:
- a data processor and a memory storing instructions for, for each input chunk, calculating a corresponding set of input distinguishing characteristics (IDCs), each set of IDCs comprising a plurality of distinguishing characteristics, said data processor being configured to partition the respective input chunk into a plurality of seeds, each seed being a smaller part of the respective input chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value, characterized in that;
said memory storing instructions configured to cause the data processor to select a subset (k) of the plurality of hash values;
determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
apply a function to the determined positions to determine corresponding other positions within the seed sequence; and
define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.
93 Citations
7 Claims
-
1. A system for providing input data to a repository to search repository data in the repository for data that are similar to the input data, the input data being divided into one or more input chunks, the system comprising:
-
a data processor and a memory storing instructions for, for each input chunk, calculating a corresponding set of input distinguishing characteristics (IDCs), each set of IDCs comprising a plurality of distinguishing characteristics, said data processor being configured to partition the respective input chunk into a plurality of seeds, each seed being a smaller part of the respective input chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value, characterized in that; said memory storing instructions configured to cause the data processor to select a subset (k) of the plurality of hash values; determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values; apply a function to the determined positions to determine corresponding other positions within the seed sequence; and define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions. - View Dependent Claims (2, 3)
-
-
4. A method for providing input data to a repository to search repository data in the repository for data that is similar to the input data, the method comprising:
-
dividing the input data into one or more input chunks; calculating a set of input distinguishing characteristics (IDCs) for each chunk, the set of input distinguishing characteristics comprising a plurality of characteristics and being obtained by; partitioning the respective input chunk into a plurality of seeds (s), each seed being a smaller part of the respective input chunk and ordered in a seed sequence; applying a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value; selecting a subset (k) of the plurality of hash values; determining positions of the seeds within the seed sequence corresponding to the selected subset of hash values; applying a function to the determined positions to determine corresponding other positions within the seed sequence; and defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions. - View Dependent Claims (5, 6, 7)
-
Specification