Systems and methods for efficient data searching, storage and reduction

US 20060059173A1
Filed: 09/15/2004
Published: 03/16/2006
Est. Priority Date: 09/15/2004
Status: Active Grant

First Claim

Patent Images

1. A method for identifying input data in repository data comprising:

providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;

partitioning the input data into a plurality of input chunks;

for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦

N≦

K; and

computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

Citations

94 Claims

1. A method for identifying input data in repository data comprising:
- providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  partitioning the input data into a plurality of input chunks;
  
  for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, wherein the computing step includes moving in seed-size steps on one of the corresponding intervals, and moving in sub-seed size steps on the other of the corresponding intervals.
  - 3. The method of claim 1, wherein the computing step includes determining for each corresponding interval a matching interval of matching data.
  - 4. The method of claim 3, wherein the determining matching interval step includes determining a directive which includes a size and positions of the matching interval in the input chunk and similar repository chunk.
  - 5. The method of claim 4, wherein the determining directive step includes providing the directive in order of ascending input data position.
  - 6. The method of claim 1, wherein the computing step includes determining an anchor set, the anchor set including at least two anchor pairs in the input chunk and repository data which define the input interval and repository interval of the corresponding interval.
  - 7. The method of claim 6, wherein the determining anchor set step includes selecting anchors located within a position range.
  - 8. The method of claim 7, wherein the method includes partitioning the input chunk into a plurality of consecutive non-overlapping anchor sets.
  - 9. The method of claim 8, wherein the computing step includes performing a binary difference method on each anchor set.
  - 10. The method of claim 6, wherein the computing step includes, for each corresponding interval, expanding the matching data around the anchors, and determining the expanded matches as anchor matches.
  - 11. The method of claim 10, wherein the computing step includes determining a directive for the anchor match which includes a size and positions in the input chunk and repository data of the anchor match.
  - 12. The method of claim 10, wherein the computing step includes determining hash values of consecutive non-overlapping seeds in the repository interval, excluding the anchor matches.
  - 13. The method of claim 6, wherein the computing step includes comparing hash values of seeds in the input interval with hash values of the seeds in the repository interval, to determine matching data.
  - 14. The method of claim 13, wherein the computing step includes expanding the matching data to determine a matching interval.
  - 15. The method of claim 14, wherein the computing step includes determining a directive for the matching interval which includes a size and positions in the input chunk and repository data of the copy interval.
  - 16. The method of claim 1, wherein the method is used for data storage.
  - 17. The method of claim 1, wherein the method is used for data reduction.
  - 18. The method of claim 1, wherein the method is used for backup data storage.
  - 19. The method of claim 1, wherein the method is used for data factoring.
  - 20. The method of claim 1, wherein the method is used for updating the repository data.

21. A method for identifying common sections of two data intervals comprising:
- determining anchors that define corresponding intervals in the two data intervals which are likely to contain matching data, each anchor comprising a pair of matching seeds in the two data intervals; and
  
  comparing the data between and in the vicinity of the anchors in the corresponding intervals to find matching data intervals.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The method of claim 21, wherein the comparing step includes moving in seed-size steps on one of the corresponding intervals, and moving in sub-seed size steps on the other of the corresponding intervals.
  - 23. The method of claim 21, wherein the method includes determining a set of directives in ascending position order of one of the data intervals, wherein each directive includes a size and positions in the two data intervals of the matching data interval.
  - 24. The method of claim 21, wherein the comparing step includes expanding the matching data around the anchors.
  - 25. The method of claim 21, wherein the determining anchors step includes determining matching distinguishing characteristics of the data intervals.

26. A system for identifying input data in repository data comprising:
- means for providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  means for partitioning the input data into a plurality of input chunks;
  
  means for determining at least K distinguishing characteristics for each input chunk and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  means for computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

27. A system for identifying input data in repository data comprising:
- a processor; and
  
  a memory, wherein the processor and memory are configured to perform a method comprising;
  
  providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  partitioning the input data into a plurality of input chunks;
  
  for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

28. A computer-readable medium containing instructions to configure a data processor to perform a method for identifying input data in repository data, the method comprising:
- providing an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  partitioning the input data into a plurality of input chunks;
  
  for each input chunk, determining at least K distinguishing characteristics and searching the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  computing at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

29. A system for identifying input data in repository data, the system comprising at least one memory comprising:
- code that provides an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  code that partitions the input data into a plurality of input chunks;
  
  code that determines at least K distinguishing characteristics for each input chunk and searches the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  code that computes at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

30. A computer readable media for identifying input data in repository data, the computer readable media comprising code, the code comprising:
- code that provides an index of repository data, including at least N distinguishing characteristics for each of a plurality of chunks of the repository data;
  
  code that partitions the input data into a plurality of input chunks;
  
  code that determines at least K distinguishing characteristics for each input chunk and searches the index for each of the K distinguishing characteristics until at least J matches with the repository distinguishing characteristics are found, and if J matches are found for an input chunk and a respective repository chunk, the respective repository chunk being determined to be a similar repository chunk where J≦
  
  N≦
  
  K; and
  
  code that computes at least one of common and noncommon sections of the input chunk and similar repository chunk using the matching distinguishing characteristics as anchors to define corresponding intervals in the input chunk and similar repository chunk.

31. A method enabling lossless data reduction by partitioning version data into:
- a) data already stored in a repository; and
  
  b) data not already stored in the repository;
  
  wherein, each of the repository data and the version data comprise a plurality of data chunks, and wherein the method comprises, for each version chunk;
  
  determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk; and
  
  determining differences between the version chunk and similar repository chunk by comparing the full data of the respective chunks.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
- - 32. The method of claim 31, further comprising storing the differences in the repository.
  - 33. The method of claim 31, wherein the determining differences step includes use of a method selected from the group consisting of binary difference and byte-wise factoring.
  - 34. The method of claim 31, wherein the determining similar repository chunk step includes searching an index of the distinguishing characteristics of the repository chunk.
  - 35. The method of claim 34, wherein the index includes n distinguishing characteristics of the repository chunk, where n is substantially smaller than size m of the repository chunk.
  - 36. The method of claim 34, wherein the index includes a location in the repository of the distinguishing characteristics of the repository chunk.
  - 37. The method of claim 35, wherein the determining similar repository chunk step includes determining k distinguishing characteristics of the version chunk, where k is greater than or equal to n.
  - 38. The method of claim 37, wherein the method includes searching for each of the k distinguishing characteristics of the version chunk until at most n matches are found.
  - 39. The method of claim 31, wherein the determining similar repository chunk step includes searching until at least two matching distinguishing characteristics are found for the version chunk and one or more repository chunks.
  - 40. The method of claim 31, wherein the distinguishing characteristics are determined by a hash function.
  - 41. The method of claim 40, wherein the distinguishing characteristics are determined by a rolling hash function.
  - 42. The method of claim 41, wherein the distinguishing characteristics are determined by a modular hash function.
  - 43. The method of claim 34, wherein the index is stored as a binary tree, a B tree, a sorted list, or a hash table.
  - 44. The method of claim 43, wherein the index is stored as a hash table.
  - 45. The method of claim 32, wherein pointers are provided to data of the version chunk already stored in the repository.
  - 46. The method of claim 31, wherein the repository and version chunks each comprise a plurality of seeds, each seed being a consecutive sequence of base elements and having the same seed size s, and wherein the distinguishing characteristics are hash values of a selected subset of the seeds of the respective chunk.
  - 47. The method of claim 46, wherein the seeds comprise overlapping seeds.
  - 48. The method of claim 31, wherein the method is used for data factoring.
  - 49. The method of claim 31, wherein the method is used for data backup.
  - 50. The method of claim 31, wherein the method is used for data backup with a data repository of a size for storing up to one or more petabytes of data.
  - 51. The method of claim 31, wherein the determining similar repository chunk step is conducted in a time independent of a size of the repository and linear in a size of the version data.
  - 52. The method of claim 31, wherein a ratio of space needed to store the repository chunk to space needed to store the distinguishing characteristics of the repository chunk is up to 250,000:
    - 1.
  - 53. The method of claim 31, wherein the method includes:
    - storing in an index n distinguishing characteristics and a position in the repository of each of a plurality of repository chunks, where n is substantially smaller than size m of the repository chunk;
      
      determining k distinguishing characteristics of the version chunk, where k is greater than or equal to n;
      
      searching for each of the k distinguishing characteristics of the version chunk in the index until at most n matches are found;
      
      determining that one or more similar repository chunks exist where the number of matches satisfies a threshold.
  - 54. The method of claim 53, wherein the method includes modifying the index to include a selected n of the k distinguishing characteristics of the version chunk.
  - 55. The method of claim 54, wherein the method includes modifying the repository to include the differences.

56. A method of locating matching data in a repository to input data comprising:
- applying a hash-based function to determine, for each of a plurality of chunks of the input data, a set of representation values for each input chunk;
  
  selecting a subset of the set of representation values to determine a set of distinguishing characteristics for each input chunk;
  
  using the set of input distinguishing characteristics to locate a chunk of the repository data deemed likely to contain matching data;
  
  using the input representation values to identify matching data in the repository chunk.
- View Dependent Claims (57, 58, 59)
- - 57. The method of claim 56, comprising:
    - storing in an index distinguishing characteristics of the repository data determined by the same hash-based function.
  - 58. The method of claim 57, wherein the storing includes storing the index as a data structure which allows searching of the index in a time independent of the size of the repository.
  - 59. The method of claim 56, comprising:
    - storing in one memory device the input representation values for use in the identifying matching data step;
      
      storing in another memory device a set of repository representation values determined by the same hash-based function for use in the identifying matching data step.

60. A method of searching a repository of binary uninterpretted data for a location of common data to an input data comprising:
- analyzing segments of each of the repository and input data to determine a repository segment that is similar to an input segment, the analyzing step including searching an index of representation values of the repository data for matching representation values of the input in a time independent of a size of the repository and linear in a size of the input data; and
  
  analyzing the similar repository segment with respect to the input segment to determine their common data sections while utilizing at least some of the matching representation values for data alignment, in a time linear in a size of the input segment.

61. A method of indexing repository data comprising:
- generating distinguishing characteristics of input data;
  
  using the input data distinguishing characteristics for locating a similar data segment in the repository data;
  
  using the input data distinguishing characteristics for locating common data sections in the similar repository data segment;
  
  storing in the index at least some of the distinguishing characteristics of the input data; and
  
  storing at least some noncommon data sections of the input data in the repository data.

62. A method comprising:
- computing data characteristics for incoming data; and
  
  searching for elements of the incoming data characteristics within an index of repository data characteristics and declaring a similarity match between a portion of a repository and a portion of the new data if the matched characteristics pass a threshold.
- View Dependent Claims (63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73)
- - 63. The method of claim 62, wherein the index is stored in a memory faster than a memory storing the repository itself.
  - 64. The method of claim 63, wherein the computing involves only the faster memory.
  - 65. The method of claim 62, wherein one or more of computing time and memory are substantially independent of repository size.
  - 66. The method of claim 62, wherein the index characteristics are such that the number of matches indicates a degree of similarity.
  - 67. The method of claim 62, wherein the threshold is variable.
  - 68. The method of claim 62, wherein the threshold varies in response to a statistical analysis of prior results of the computing and/or searching steps.
  - 69. The method of claim 62, including a step of verifying the declared similarity.
  - 70. The method of claim 62, wherein the index includes a location within the repository of the similarity matched portion.
  - 71. The method of claim 62, including a step of acting upon a declared similarity by matching the similar data portions.
  - 72. The method of claim 62, including a step of data compression.
  - 73. The method of claim 62, including a step of updating the repository and the index.

74. A method for searching in repository data for parts that are sufficiently similar to an input data according to a similarity criterion, comprising:
- 1. processing the repository data by a. dividing the repository data into parts called repository chunks;
  
  b. for each of the repository chunks, calculating one or a plurality of repository distinguishing characteristics (RDCs), each RDC belonging an interval of integers called value range;
  
  c. creating pairs associating each RDC with a corresponding repository chunk; and
  
  d. maintaining an index storing the pairs;
  
  2. processing the input data by a. dividing the input data into parts called input chunks; and
  
  b. performing for each of the input chunks;
  
  i. calculating one or a plurality of input distinguishing characteristics (IDCs);
  
  ii. searching for the IDCs in the pairs stored in the index; and
  
  iii. if a threshold j of the IDCs has been found in the pairs stored in the index, declaring a match between the input chunk and the corresponding repository chunk(s) that are associated with the IDCs in the pairs.
- View Dependent Claims (75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94)
- - 75. The method of claim 74, wherein the dividing of the repository (1a) is done into disjoint parts.
  - 76. The method according to claim 74, wherein the RDCs are unambiguously determinable, efficiently calculable and uniformly distributed over the value range.
  - 77. The method according to claim 76, wherein the RDCs are obtained by partitioning the repository chunk into smaller parts called seeds;
    - 2. applying a hash function to each of the seeds, yielding one hash value for each of the seeds;
      
      3. selecting a subset of the hash values of the seeds;
      
      4. using the pairs in the index, identifying indices of seeds, called indices, corresponding to the hash in the subset;
      
      5. applying a relocation function to the indices to determine relocated indices; and
      
      6. defining the RDCs as the hash values of the seeds at the relocated indices.
  - 78. A method of claim 77, wherein the hash function is a modular function, modulo a prime number.
  - 79. The method of claim 77, wherein the subset of hash values includes values belong to a set including predetermined number of values that are largest, smallest or closest to a percentile among all the hash values.
  - 80. The method of claim 79, wherein the predetermined number of values is 8.
  - 81. The method of claim 79, wherein the percentile is a median value of the hash values.
  - 82. The method of claim 77, wherein the relocation function is adding a predetermined (positive or negative) constant to the indices.
  - 83. The method of claim 82, wherein the predetermined (positive or negative) constant is 1.
  - 84. The method of claim 74, wherein the index is maintained as a data structure selected from the group comprising hash tables, binary trees, B-trees and sorted lists.
  - 85. The method of claim 74, wherein the dividing of the input data (2a) is done into disjoint parts.
  - 86. The method of claim 74, wherein the dividing of the input data (2a) is done into overlapping parts.
  - 87. The method of claim 74, further comprising the step 2.b.iv.:
    - adding the input chunk to the repository by creating pairs associating the IDCs with the input chunk and inserting the pairs into the index.
  - 88. The method of claim 87, further comprising the step 2.b.v.:
    - if a match is declared in step 2.b.iii., updating the index by A. removing the corresponding repository chunks from the reference; and
      
      B. removing the pairs associating the RDCs with the corresponding repository chunks from the index.
  - 89. The method of claim 74, wherein the method is used for lossless data reduction comprising:
    - processing the data according to claim 74 to find similar data in the repository;
      
      comparing the input chunk to the similar repository data, and identifying as common factors variable size ranges of data in the input chunk that match exactly ranges in the similar repository data; and
      
      saving the data such that;
      
      the common factors are saved only once, and saving a data directory that shows a plurality of positions within the data where the common factors belong; and
      
      saving in full the data ranges not included in the common factors.
  - 90. The method of claim 89, wherein the number of RDCs per chunk, and the chunk size, are chosen to create less entries in the index than the entries count in the directory, enabling the directory to describe the common factors in high resolution, while the index is maintained small.
  - 91. The method of claim 89 that makes double use of the hash values, calculated for the data chunk, first for finding the similar data in the repository data, and second for identifying the common factors.
  - 92. The method of lossless data reduction of claim 89, wherein the method is used for space-saving data backup and data restore by saving data in a data repository, and restoring data from the data repository.
  - 93. The method of claim 92, where the lossless data reduction is done online, as received by the repository, and substantially in the order of receipt by the repository.
  - 94. The method of claim 92, where the lossless data reduction is done offline comprising:
    - saving the data in the data repository without processing it according to claim 89;
      
      marking or keeping a list of the not processed data;
      
      processing the data, according to claim 89, to achieve the lossless data reduction, according to;
      
      a. any predetermined schedule and order and/or b. when a repository management system designates a time to process the data based on one or more of;
      
      i. how busy is the system;
      
      ii. how much the data to be processed is accessed;
      
      iii. how much space is predicted for the data by applying the lossless data reduction; and
      
      iv. the used capacity of the repository and the unused capacity of the repository.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bitner, Haim, Asher, Ron, Aronovich, Lior, Hirsch, Michael, Klein, Shmuel T., Bachmat, Eitan

Granted Patent

US 7,523,098 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and methods for efficient data searching, storage and reduction

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

94 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for efficient data searching, storage and reduction

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

94 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links