Methods and apparatus for content-aware data de-duplication

US 7,925,683 B2
Filed: 12/18/2009
Issued: 04/12/2011
Est. Priority Date: 12/18/2008
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

partitioning digital data into a plurality of blocks, including a first block, and additional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, andwherein a combination of the plurality of blocks and the additional data together represents all of the digital data;

generating a file identifier based at least in part on the digital data;

associating the file identifier with the digital data;

generating a block identifier based at least in part on the first block;

associating the block identifier with the first block;

determining if the first block has already been stored;

storing the first block if the first block has not already been stored;

determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, andif the block map associated with the file identifier has not already been stored;

creating the block map,storing the block map,associating the additional data with the block map, andassociating the file identifier with the block map.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The systems and methods partition digital data units in a content aware fashion without relying on any ancestry information, which enables one to find duplicate chunks in unrelated units of digital data even across millions of documents spread across thousands of computer systems.

Citations

20 Claims

1. A method comprising:
- partitioning digital data into a plurality of blocks, including a first block, and additional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, andwherein a combination of the plurality of blocks and the additional data together represents all of the digital data;
  
  generating a file identifier based at least in part on the digital data;
  
  associating the file identifier with the digital data;
  
  generating a block identifier based at least in part on the first block;
  
  associating the block identifier with the first block;
  
  determining if the first block has already been stored;
  
  storing the first block if the first block has not already been stored;
  
  determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, andif the block map associated with the file identifier has not already been stored;
  
  creating the block map,storing the block map,associating the additional data with the block map, andassociating the file identifier with the block map.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the digital data is at least one of a digital file, block based storage, binary large object (BLOB), and a data stream.
  - 3. The method of claim 1, wherein generating the file identifier is based on determining a checksum or hash of the digital data.
  - 4. The method of claim 1, wherein generating the block identifier is based on determining a checksum or hash of the first block.
  - 5. The method of claim 1, further comprising:
    - maintaining a reference count of references to the first block that are made by the block maps in a catalog;
      
      in response to a request to delete a stored block map, updating the reference count if the stored block map contained a block identifier associated with the first block; and
      
      deleting the first block if the updated reference count indicates that no other block maps contain block identifiers associated with the first block.
  - 6. The method of claim 1, further comprising:
    - maintaining a reference count of references that are made to the block map from other objects in a catalog;
      
      in response to a request to delete a file, the file having an associated file identifier, updating the reference count if the file identifier was associated with the stored block map; and
      
      deleting the stored block map if the updated reference count indicates that no other file identifiers are associated with the stored block map.
  - 7. The method of claim 1, further comprising storing with the block map additional information about where in the original file the additional data was located.
  - 8. The method of claim 1, further comprising storing with the block map additional information indicative of the format-specific transformations.
  - 9. The method of claim 1, wherein the additional data includes at least two of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations.
  - 10. The method of claim 1, where the partitioning, associating, and determining steps are performed on a set of digital files, and wherein the result is a set of de-duplicated digital files.

11. A system comprising:
- a memory capable of storing data; and
  
  a processor configured for;
  
  partitioning digital data into a plurality of blocks, including a first block, andadditional data, wherein the additional data includes at least one of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations, andwherein a combination of the plurality of blocks and the additional data together represents all of the digital data;
  
  generating a file identifier based at least in part on the digital data;
  
  associating the file identifier with the digital data;
  
  generating a block identifier based at least in part on the first block;
  
  associating a block identifier with the first block;
  
  determining if the first block has already been stored;
  
  storing the first block in the memory if the first block has not already been stored;
  
  determining if a block map associated with the file identifier has already been stored, wherein the block map includes block identifiers associated respectively with each block of which the digital data is comprised, and if the block map associated with the file identifier has not already been stored;
  
  creating the block map,storing the block map in the memory,associating the additional data with the block map, andassociating the file identifier with the block map.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, wherein the digital data is at least one of a digital file, block based storage, binary large object, and a data stream.
  - 13. The system of claim 11, wherein generating the file identifier is based on determining a checksum or hash of the digital data.
  - 14. The system of claim 11, wherein generating the block identifier is based on determining a checksum or hash of the first block.
  - 15. The system of claim 11, further comprising:
    - maintaining a reference count of references to the first block that are made by the block maps in a catalog;
      
      in response to a request to delete a stored block map, updating the reference count if the stored block map contained a block identifier associated with the first block; and
      
      deleting the first block if the updated reference count indicates that no other block maps contain block identifiers associated with the first block.
  - 16. The system of claim 11, further comprising:
    - maintaining a reference count of references that are made to the block map from other objects in a catalog;
      
      in response to a request to delete a file, the file having an associated file identifier, updating the reference count if the file identifier was associated with the stored block map; and
      
      deleting the stored block map if the updated reference count indicates that no other file identifiers are associated with the stored block map.
  - 17. The system of claim 11, further comprising storing with the block map additional information about where in the original file the additional data was located.
  - 18. The system of claim 11, further comprising storing with the block map additional information indicative of the format-specific transformations.
  - 19. The system of claim 11, wherein the additional data includes at least two of position-dependent data, instance-dependent data, format-specific headers or footers, and format-specific transformations.
  - 20. The system of claim 11, where the partitioning, associating, and determining steps are performed on a set of digital files, and wherein the result is a set of de-duplicated digital files.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Malikie Innovations Limited (Key Patent Innovations Limited)
Original Assignee
Copiun Incorporated (Blackberry Limited)
Inventors
Jain, Sanjay, Chaudhry, Puneesh
Primary Examiner(s)
Coby; Frantz

Application Number

US12/642,023
Publication Number

US 20100161608A1
Time in Patent Office

480 Days
Field of Search

707967-974, 707/609, 707/694, 707/737, 707821-828, 715/513, 711170-173, 710/1, 375/240.17
US Class Current

711/162
CPC Class Codes

G06F 16/1748   De-duplication implemented ...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

Y10S 707/968   Partitioning

Y10S 707/969   Horizontal partitioning

Y10S 707/97   Vertical partitioning

Y10S 707/971   Federated

Y10S 707/972   Partitioning

Y10S 707/973   Horizontal partitioning

Y10S 707/974   Vertical partitioning

Methods and apparatus for content-aware data de-duplication

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for content-aware data de-duplication

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links