Systems and methods for efficient data searching, storage and reduction

US 7,523,098 B2
Filed: 09/15/2004
Issued: 04/21/2009
Est. Priority Date: 09/15/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-readable storage media encoded with computer-executable instructions to configure a processor to perform a method for identifying input data in repository data, the method comprising:

providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data;

partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs);

wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by;

selecting a seed size and calculating hash values for every seed of the chunk;

selecting a subset of the plurality of hash values;

determining positions of the seeds within a seed sequence of the selected subset of hash values;

applying a function to the determined positions to determine corresponding other positions within the seed sequence;

defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions;

conducting a similarity search for each input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and

computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

Citations

31 Claims

1. A computer-readable storage media encoded with computer-executable instructions to configure a processor to perform a method for identifying input data in repository data, the method comprising:
- providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data;
  
  partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs);
  
  wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by;
  
  selecting a seed size and calculating hash values for every seed of the chunk;
  
  selecting a subset of the plurality of hash values;
  
  determining positions of the seeds within a seed sequence of the selected subset of hash values;
  
  applying a function to the determined positions to determine corresponding other positions within the seed sequence;
  
  defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions;
  
  conducting a similarity search for each input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and
  
  computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 2. The media of claim 1, wherein the computing step includes moving in seed-size steps on one of the corresponding intervals, and moving in sub-seed size steps on the other of the corresponding intervals.
  - 3. The media of claim 1, wherein the computing step includes determining for each corresponding interval a matching interval of matching data.
  - 4. The media of claim 3, wherein the determining matching interval step includes determining a directive which includes a size and positions of the matching interval in the input chunk and similar repository chunk.
  - 5. The media of claim 4, wherein the determining directive step includes providing the directive in order of ascending input data position.
  - 6. The media of claim 1, wherein the computing step includes selecting at least two anchor pairs as an anchor set located within a position range.
  - 7. The media of claim 6, wherein the computing step includes partitioning the input chunk into a plurality of consecutive non-overlapping anchor sets.
  - 8. The media of claim 7, wherein the computing step includes performing a binary difference method on each anchor set.
  - 9. The media of claim 1, wherein the computing step includes, for each corresponding interval, expanding the matching data around the anchors, and determining the expanded matches as anchor matches.
  - 10. The media of claim 9, wherein the computing step includes determining a directive for the anchor match which includes a size and positions in the input chunk and repository data of the anchor match.
  - 11. The media of claim 9, wherein the computing step includes determining hash values of consecutive non-overlapping seeds in the repository interval, excluding the anchor matches.
  - 12. The media of claim 1, wherein the computing step includes comparing hash values of seeds in the input interval with hash values of the seeds in the repository interval, to determine matching data.
  - 13. The media of claim 12, wherein the computing step includes expanding the matching data to determine a matching interval.
  - 14. The media of claim 13, wherein the computing step includes determining a directive for the matching interval which includes a size and positions in the input chunk and repository data of the copy interval.
  - 15. The media of claim 1, wherein the method is used for data storage.
  - 16. The media of claim 1, wherein the method is used for data reduction.
  - 17. The media of claim 1, wherein the method is used for backup data storage.
  - 18. The media of claim 1, wherein the method is used for data factoring.
  - 19. The media of claim 1, wherein the method is used for updating the repository data.
  - 20. The media of claim 1, wherein the hash values determined for every seed of the input chunk are reused in the computing step for identifying said common or noncommon data sections.
  - 21. The media of claim 1, wherein the DCs are uniformly distributed over a value range.
  - 22. The media of claim 1, wherein each DC has a characteristic location, and the characteristic locations of the DCs for a given chunk are well spread over the chunk.
  - 23. The media of claim 22, wherein the DCs are robust to changes of the data in the respective chunk.
  - 24. The media of claim 1, wherein the computing step is applied to a one most similar repository chunk.
  - 25. The media of claim 1, wherein the computing step is applied to plural most similar repository chunks.
  - 26. The media of claim 1, wherein the computing step is applied to the corresponding repository interval having the largest number of successive anchors.
  - 27. The media of claim 1, wherein:
    - the subset of hash values is selected by identifying the k largest hash values; and
      
      the function applied to determine the corresponding other positions is to identify a next seed in the seed sequence.
  - 28. The media of claim 1, wherein the step of selecting a subset of the plurality of hash values comprises one or more of:
    - selecting a number of the largest hash values;
      
      selecting a number of the smallest hash values;
      
      selecting a number of the hash values closest to a median value of the generated hash values for the corresponding data chunk;
      
      selecting a number of the hash values closest to a constant value; and
      
      selecting a number of the hash values closest to a percentile value of the generated hash values for the corresponding data chunk.
  - 29. The media of claim 1, wherein the step of applying a function to the determined positions comprises:
    - applying a constant value to each hash position corresponding to each of the hashes of the selected subset.
  - 30. The system media of claim 29, wherein:
    - an absolute value of the constant value is 1.

31. A system for identifying input data in repository data comprising:
- a processor; and
  
  a memory,wherein the processor and memory are configured to perform a method comprising;
  
  providing an index of the repository data comprising a plurality of repository distinguishing characteristics (RDCs) for each of a plurality of chunks of the repository data;
  
  partitioning the input data into a plurality of input chunks and for each input chunk, determining a plurality of input distinguishing characteristics (IDCs);
  
  wherein the distinguishing characteristics (DCs) of the repository and input chunks are determined by;
  
  selecting a seed size and calculating hash values for every seed of the chunk;
  
  selecting a subset of the plurality of hash values;
  
  determining positions of the seeds within a seed sequence of the selected subset of hash values;
  
  applying a function to the determined positions to determine corresponding other positions within the seed sequence;
  
  defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions;
  
  conducting a similarity search for each of input chunk comprising searching the index for matches of the IDCs of the input chunk with the RDCs, wherein the similarity searching requires a threshold number of matching IDCs and RDCs for a declared similarity of an input chunk and similar repository chunk; and
  
  computing at least one of common and noncommon sections of the input data and repository data using the locations of pairs of matching distinguishing characteristics of an input chuck and similar repository chunk as anchors to define corresponding intervals in the input data and repository data for use in identifying said common or noncommon data sections.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Asher, Ron, Aronovich, Lior, Hirsch, Michael, Klein, Shmuel T., Bachmat, Eitan, Bitner, Haim
Primary Examiner(s)
Alam; Shahid A
Assistant Examiner(s)
Alvesteffer; Jason L

Application Number

US10/941,632
Publication Number

US 20060059173A1
Time in Patent Office

1,679 Days
Field of Search

707/3, 707/202, 707/203, 707/204, 707/2, 711/162
US Class Current

1/1
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and methods for efficient data searching, storage and reduction

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for efficient data searching, storage and reduction

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links