Systems and methods for efficient data searching, storage and reduction

US 8,275,756 B2
Filed: 03/20/2009
Issued: 09/25/2012
Est. Priority Date: 09/15/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A system for searching in a repository data for data that are similar to an input data, the repository data being divided into one or more repository chunks, the system comprising:

means for, for each repository chunk, calculating a corresponding set of repository distinguishing characteristics (RDCs), each set of RDCs comprising a plurality of distinguishing characteristics, said means arranged to partition the respective data chunks into a plurality of seeds, each seed being a smaller part of the respective data chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value;

means for maintaining an index associating each set of RDCs and the corresponding repository chunk;

means for comparing input distinguishing characteristics of an input chunk of input data to one or more sets of RDCs stored in the index to determine whether a similarity exists between the input chunk and the distinguishing repository chunk, characterized in that;

said comparing means is configured to determine a similarity exists if a similarity threshold (j) of a set of input distinguishing characteristics is found in a set of RDCs stored in the index; and

in that said calculating means is configured to select a subset (k) of the plurality of hash values;

to determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values;

to apply a function to the determined positions to determine corresponding other positions within the seed sequence; and

to define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

Citations

20 Claims

1. A system for searching in a repository data for data that are similar to an input data, the repository data being divided into one or more repository chunks, the system comprising:
- means for, for each repository chunk, calculating a corresponding set of repository distinguishing characteristics (RDCs), each set of RDCs comprising a plurality of distinguishing characteristics, said means arranged to partition the respective data chunks into a plurality of seeds, each seed being a smaller part of the respective data chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value;
  
  means for maintaining an index associating each set of RDCs and the corresponding repository chunk;
  
  means for comparing input distinguishing characteristics of an input chunk of input data to one or more sets of RDCs stored in the index to determine whether a similarity exists between the input chunk and the distinguishing repository chunk, characterized in that;
  
  said comparing means is configured to determine a similarity exists if a similarity threshold (j) of a set of input distinguishing characteristics is found in a set of RDCs stored in the index; and
  
  in that said calculating means is configured to select a subset (k) of the plurality of hash values;
  
  to determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
  
  to apply a function to the determined positions to determine corresponding other positions within the seed sequence; and
  
  to define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein:
    - the subset (k) of hash values is selected by identifying the k largest hash values; and
      
      the function applied to determine the corresponding other positions is to identify a next seed in the seed (s) sequence.
  - 3. The system of claim 1, further comprising:
    - means for determining one or more differences between the input data chunk and the identified similar repository data chunk by comparing the full data of the respective data chunks.
  - 4. The system of claim 3 further comprising:
    - means for storing the determined differences in a same repository in which the repository data is stored.
  - 5. The system of claim 1, used for at least one of:
    - data factoring, and data backup.

6. A method of searching in repository data for data that are similar to an input data, wherein the repository data is divided into one or more repository chunks, the method comprising:
- for each repository chunk, calculating a corresponding set of repository distinguishing characteristics (RDCs), each set of RDCs comprising a plurality (n) of distinguishing characteristics;
  
  maintaining an index associating each set of RDCs and the corresponding repository chunk;
  
  calculating input distinguishing characteristics (IDCs) for an input chunk of data;
  
  comparing the IDCs to one or more sets of RDCs stored in the index to determine if a similarity exists between the input chunk and the corresponding repository chunk, wherein the RDCs and the IDCs are obtained by;
  
  partitioning the respective data chunk into a plurality of seed(s), each seed being a smaller part of the respective data chunk and ordered in a seed sequence;
  
  applying a hash function to each of the seeds to generate a plurality of hash values, wherein each seed yields one hash value;
  
  characterized in that a set of IDCs is calculated for each input data chunk, the set comprising a plurality (k) of distinguishing characteristics, said set being compared with the sets of RDCs;
  
  in that it is determined that similarity exists if a similarity threshold (j) of the distinguishing characteristics in the set of IDCs is found in a set of RDCs stored in the index; and
  
  in that each set of RDCs and IDCs is obtained by the further steps of selecting a subset (k) of the plurality of hash values;
  
  determining positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
  
  applying a function to the determined positions to determine corresponding other positions within the seed sequence;
  
  and defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 7. The method of claim 6, wherein:
    - the subset of hash values is selected by identifying the k largest hash values; and
      
      the function applied to determine the corresponding other positions is to identify a next seed in the seed sequence.
  - 8. The method of claim 6, further comprising:
    - determining one or more differences between the input data chunk and the identified similar repository data chunk by comparing the full data of the respective data chunks.
  - 9. The method of claim 6, further comprising:
    - storing the determined differences in a same repository in which the repository data is stored.
  - 10. The method of claim 7, wherein the method is used for at least one of:
    - data factoring, and data backup.
  - 11. The method of claim 7, wherein the similarity threshold is met when a predetermined number of the distinguishing characteristics in the set of IDCs is found in a set of RDCs.
  - 12. The method of claim 7, wherein the subset (k) of the plurality of hash values are the k largest mathematical hash values in a set, and wherein the function which is applied to the determined positions is to take the next sequential seed relative to each seed corresponding to each of the k largest mathematical hash values.
  - 13. The method of claim 7, wherein the hash function is selected from a rolling hash function and a modular hash function.
  - 14. The method of claim 7, wherein the sets of RDCs are stored in the index as at least one of:
    - a binary tree, a B tree, assorted list, and a hash table.
  - 15. The method of claim 7, wherein each seed is a consecutive sequence of base elements and has the same seed size s.
  - 16. The method of claim 15, wherein the seeds comprise overlapping seeds.
  - 17. The method of claim 7, wherein the step of comparing the IDCs to one or more sets of RDCs to determine if a similarity exists is conducted in a time independent of a size of the repository and linear in a size of the input data.
  - 18. A non-transitory computer-readable medium encoded with computer-executable instructions that cause a computer to perform a method comprising identifying input data in repository data wherein the repository data comprise repository data chunks and the input data comprise input data chunks, and wherein each repository data chunk has a corresponding set of one or more repository data chunk distinguishing characteristics (ROCs), the method including the steps of claim 7.

19. A method of searching in repository data for data that is similar to an input data wherein the repository data is divided into one or more repository chunks, the method comprising:
- for each repository chunk, calculating a corresponding set of repository distinguishing characteristics, each set of RDCs comprising a plurality of distinguishing characteristics and being obtained by partitioning the respective data chunk into a plurality of seeds(s), each seed being a smaller part of the respective data chunk and ordered in a seed sequence, and applying a hash function to each of the seeds to generate a plurality of hash values, wherein each seed yields one hash value;
  
  maintaining an index associating each set of ROCs and the corresponding repository chunk;
  
  comparing input distinguishing characteristics of an input chunk of input data to one or more sets of RDCs stored in the index to determine whether a similarity exists between the input chunk and the corresponding repository chunk, characterized in that;
  
  it is determined that a similarity exists between the input chunk and the corresponding repository chunk if a similarity threshold (j) of the distinguishing characteristics in the set of IDCs is found in a set of ROCs stored in the index; and
  
  in that the set of RDCs is obtained by;
  
  selecting a subset (k) of the plurality of hash values;
  
  determining positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
  
  applying a function to the determined positions to determine corresponding other positions within the seed sequence; and
  
  defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
- View Dependent Claims (20)
- - 20. A non-transitory computer-readable medium encoded with computer-executable instructions that cause a computer to perform a method of searching in repository data for data that is similar to an input data, wherein the repository data is divided into one or more repository chunks, the method comprising the steps of claim 19.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hirsch, Michael, Bitner, Haim, Aronovich, Lior, Asher, Ron, Bachmat, Eitan, Klein, Shmuel T.
Primary Examiner(s)
LEWIS, ALICIA M

Application Number

US12/407,788
Publication Number

US 20090228455A1
Time in Patent Office

1,285 Days
Field of Search

707/999.003, 707/999.203, 707/687, 707695-698, 707/705, 707/736, 707/741, 707/747, 707/758
US Class Current

707/687
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links