ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS

US 20130339314A1
Filed: 06/13/2012
Published: 12/19/2013
Est. Priority Date: 06/13/2012
Status: Active Grant

First Claim

Patent Images

1. A method of detecting duplicates of a first digital object within a storage cluster, said method comprising:

receiving, within said storage cluster, a hash value of said first digital object and a unique identifier associated with said first digital object, said hash value falling within an address space made up of a plurality of non-overlapping ranges;

identifying a portion of said hash value whose value represents a first range of addresses within said address space;

accessing a page mapping table using said value indicated by said portion of said hash value to identify a computer node of said storage cluster, said page mapping table mapping each range of addresses of said address space to one of said computer nodes;

accessing a hash table of said computer node to determine whether said hash value and said unique identifier are represented in said hash table;

accessing a second digital object within said storage cluster when it is determined that said hash value is represented in an entry in said hash table but said unique identifier is not; and

comparing a hash value stored in association with said second digital object with said received hash value and determining that said second digital object is a duplicate of said first digital object when said hash values match.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Digital objects within a fixed-content storage cluster use a page mapping table and a hash-to-UID table to store a representation of each object. For each object stored within the cluster, a record in the hash-to-UID table stores the object'"'"'s hash value and its unique identifier (or portions thereof). To detect a duplicate of an object, a portion of its hash value is used as a key into the page mapping table. The page mapping table indicates a node holding a hash-to-UID table indicating currently stored objects in a particular page range. Finding the same hash value but with a different unique identifier in the table indicates that a duplicate of an object exists. Portions of the hash value and unique identifier may be used in the hash-to-UID table. Unneeded duplicate objects are deleted by copying their metadata to a manifest and then redirecting their unique identifiers to point at the manifest.

Citations

16 Claims

1. A method of detecting duplicates of a first digital object within a storage cluster, said method comprising:
- receiving, within said storage cluster, a hash value of said first digital object and a unique identifier associated with said first digital object, said hash value falling within an address space made up of a plurality of non-overlapping ranges;
  
  identifying a portion of said hash value whose value represents a first range of addresses within said address space;
  
  accessing a page mapping table using said value indicated by said portion of said hash value to identify a computer node of said storage cluster, said page mapping table mapping each range of addresses of said address space to one of said computer nodes;
  
  accessing a hash table of said computer node to determine whether said hash value and said unique identifier are represented in said hash table;
  
  accessing a second digital object within said storage cluster when it is determined that said hash value is represented in an entry in said hash table but said unique identifier is not; and
  
  comparing a hash value stored in association with said second digital object with said received hash value and determining that said second digital object is a duplicate of said first digital object when said hash values match.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method as recited in claim 1 further comprising:
    - identifying a second unique identifier of said entry; and
      
      accessing said second digital object using said second unique identifier.
  - 3. A method as recited in claim 1 further comprising:
    - accessing said hash table using a first part of said hash value to determine whether said hash value is represented in said entry.
  - 4. A method as recited in claim 1 further comprising:
    - determining that said hash value is represented by multiple entries in said hash table; and
      
      for each of said multiple entries, accessing a digital object within said storage cluster having a unique identifier corresponding to each of said entries.
  - 5. A method as recited in claim 1 further comprising:
    - identifying a second unique identifier of said entry; and
      
      accessing said page mapping table using a portion of said second unique identifier to identify a computer node of said storage cluster having an index indicating the location of said second digital object within said storage cluster.
  - 6. A method as recited in claim 1 further comprising:
    - identifying a second unique identifier of said entry; and
      
      broadcasting a message to all computer nodes of said storage cluster to find said second digital object identified by said second unique identifier.

7. A method of deleting a duplicate of a first digital object within a storage cluster, said method comprising:
- receiving a first unique identifier that identifies the location of a first digital object within said storage cluster;
  
  receiving a second unique identifier that identifies the location of a second digital object within said storage cluster, wherein said digital objects being duplicates;
  
  storing metadata associated with said second digital object in association with metadata associated with said first digital object in a metadata storage location;
  
  creating a reference associated with said metadata storage location that identifies said location of said first digital object;
  
  deleting said second digital object from said storage cluster; and
  
  redirecting said second unique identifier such that said second unique identifier now identifies said metadata storage location, whereby said second unique identifier identifies said first digital object via said metadata storage location and said reference.
- View Dependent Claims (8, 9)
- - 8. A method as recited in claim 7 wherein said deleting is performed by overwriting said second digital object, reclaiming disk space occupied by said second digital object, or redirecting said second unique identifier.
  - 9. A method as recited in claim 7 further comprising:
    - accessing a page mapping table using a value indicated by said portion of a hash value of said second unique identifier to identify a computer node of said storage cluster;
      
      accessing a hash table of said computer node to find an entry corresponding to said hash value and said second unique identifier; and
      
      deleting said entry, whereby said hash table no longer includes a representation that said second digital object is present within said storage cluster.

10. A method of storing a digital object in a storage cluster, said method comprising:
- receiving said digital object at a first computer node within said storage cluster having a plurality of computer nodes;
  
  calculating a hash value for said digital object that falls within an address space and storing said digital object on a disk of said first computer node, said address space made up of a plurality of non-overlapping ranges;
  
  generating a unique identifier for said digital object;
  
  identifying a portion of said hash value whose value represents a first range of addresses within said address space;
  
  accessing a page mapping table using said value indicated by said portion of said hash value to identify a second computer node, said page mapping table mapping each range of addresses of said address space to one of said computer nodes;
  
  sending said hash value and said unique identifier to said second computer node; and
  
  storing at least a first part of said hash value and at least a first part of said unique identifier in a table on said second computer node to indicate that said digital object is present within said storage cluster.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. A method as recited in claim 10 wherein for every digital object stored within said cluster having a hash value falling within said first range of addresses, said table stores a record indicating that said every digital object is stored within said cluster.
  - 12. A method as recited in claim 10 wherein said storage cluster is a write-once read-many (WORM) fixed-content stored cluster.
  - 13. A method as recited in claim 10 wherein said table maps said first part of said hash value to a plurality of unique identifiers.
  - 14. A method as recited in claim 10 wherein said page mapping table and said table are located in RAM.
  - 15. A method as recited in claim 10 wherein said unique identifier is generated using a hash function or by a random number generator.
  - 16. A method as recited in claim 10 wherein said digital object has associated metadata indicating that said digital object is likely unique and that searching for duplicates of said digital object is unnecessary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
DataCore Software Corporation (Flagship Ventures 2004 Fund LLC)
Original Assignee
Caringo Incorporated
Inventors
CARPENTIER, Paul R. M., TURPIN, Russell

Granted Patent

US 8,762,353 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links