Systems and methods for efficient data searching, storage and reduction

US 8,275,755 B2
Filed: 03/19/2009
Issued: 09/25/2012
Est. Priority Date: 09/15/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method enabling lossless data reduction comprising:

partitioning version data into;

a) data corresponding to data already stored in a repository; and

b) data not already stored in the repository;

wherein the data already stored in the repository comprise a plurality of repository chunks, wherein the version data comprise a plurality of version chunks,the method further comprising;

storing in an index a plurality of n repository distinguishing characteristics (RDCs) and a position in the repository of each of the plurality of repository chunks, where n is smaller than size m of the repository chunk, where m is a value representative of a number of bytes of the repository chunk; and

for each version chunk;

determining a plurality of k input distinguishing characteristics (IDCs) of the version chunk, where k is greater than or equal to n;

determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk, wherein the similarity determination includes searching for each of the k distinguishing characteristics of the version chunk in the index until at most n matches are found;

determining that one or more similar repository chunks exist where the number of matches satisfies a threshold;

determining differences between the version chunk and similar repository chunk by comparing full data of the respective chunks; and

storing the differences in the repository.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

Citations

19 Claims

1. A method enabling lossless data reduction comprising:
- partitioning version data into;
  
  a) data corresponding to data already stored in a repository; and
  
  b) data not already stored in the repository;
  
  wherein the data already stored in the repository comprise a plurality of repository chunks, wherein the version data comprise a plurality of version chunks,the method further comprising;
  
  storing in an index a plurality of n repository distinguishing characteristics (RDCs) and a position in the repository of each of the plurality of repository chunks, where n is smaller than size m of the repository chunk, where m is a value representative of a number of bytes of the repository chunk; and
  
  for each version chunk;
  
  determining a plurality of k input distinguishing characteristics (IDCs) of the version chunk, where k is greater than or equal to n;
  
  determining whether a similar repository chunk exists based on a plurality of matching distinguishing characteristics in the version chunk and similar repository chunk, wherein the similarity determination includes searching for each of the k distinguishing characteristics of the version chunk in the index until at most n matches are found;
  
  determining that one or more similar repository chunks exist where the number of matches satisfies a threshold;
  
  determining differences between the version chunk and similar repository chunk by comparing full data of the respective chunks; and
  
  storing the differences in the repository.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein the determining differences step includes use of a method selected from the group consisting of binary difference and byte-wise factoring.
  - 3. The method of claim 1, wherein the index includes a location in the repository of the distinguishing characteristics of the repository chunk.
  - 4. The method of claim 1, where in the determining similar repository chunk step includes searching until at least two matching distinguishing characteristics are found for the version chunk and one or more repository chunks.
  - 5. The method of claim 1, wherein the distinguishing characteristics are determined by a hash function.
  - 6. The method of claim 5, wherein the distinguishing characteristics are determined by a rolling hash function.
  - 7. The method of claim 6, wherein the distinguishing characteristics are determined by a modular hash function.
  - 8. The method of claim 1, wherein the index is stored as a binary tree, a B tree, a sorted list, or a hash table.
  - 9. The method of claim 8, wherein the index is stored as a hash table.
  - 10. The method of claim 1, wherein pointers are provided to data of the version chunk already stored in the repository.
  - 11. The method of claim 1, wherein the repository and version chunks each comprise a plurality of seeds, each seed being a consecutive sequence of base elements and having the same seed size s, and wherein the distinguishing characteristics are hash values of a selected subset of the seeds of the respective chunk.
  - 12. The method of claim 11, wherein the seeds comprise overlapping seeds.
  - 13. The method of claim 1, wherein the method is used for data factoring.
  - 14. The method of claim 1, wherein the method is used for data backup.
  - 15. The method of claim 1, wherein the method is used for data backup with a data repository of a size for storing up to one or more petabytes of data.
  - 16. The method of claim 1, wherein the determining similar repository chunk step is conducted in a time independent of a size of the repository and linear in a size of the version data.
  - 17. The method of claim 1, wherein a ratio of space needed to store the repository chunk to space needed to store the distinguishing characteristics of the repository chunk is up to 250,000:
    - 1.
  - 18. The method of claim 1, wherein the method includes modifying the index to include a selected n of the k distinguishing characteristics of the version chunk.
  - 19. The method of claim 18, wherein the method includes modifying the repository to include the differences.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hirsch, Michael, Bitner, Haim, Aronovich, Lior, Asher, Ron, Bachmat, Eitan, Klein, Shmuel T.
Primary Examiner(s)
Lewis, Alicia

Application Number

US12/407,765
Publication Number

US 20090234821A1
Time in Patent Office

1,286 Days
Field of Search

707/999.003, 707/999.006, 707/695, 707/696, 707/705, 707/609, 707/758, 707/769
US Class Current

707/687
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links