ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS
First Claim
1. A method of detecting duplicates of a first digital object within a storage cluster, said method comprising:
- receiving, within said storage cluster, a hash value of said first digital object and a unique identifier associated with said first digital object, said hash value falling within an address space made up of a plurality of non-overlapping ranges;
identifying a portion of said hash value whose value represents a first range of addresses within said address space;
accessing a page mapping table using said value indicated by said portion of said hash value to identify a computer node of said storage cluster, said page mapping table mapping each range of addresses of said address space to one of said computer nodes;
accessing a hash table of said computer node to determine whether said hash value and said unique identifier are represented in said hash table;
accessing a second digital object within said storage cluster when it is determined that said hash value is represented in an entry in said hash table but said unique identifier is not; and
comparing a hash value stored in association with said second digital object with said received hash value and determining that said second digital object is a duplicate of said first digital object when said hash values match.
4 Assignments
0 Petitions
Accused Products
Abstract
Digital objects within a fixed-content storage cluster use a page mapping table and a hash-to-UID table to store a representation of each object. For each object stored within the cluster, a record in the hash-to-UID table stores the object'"'"'s hash value and its unique identifier (or portions thereof). To detect a duplicate of an object, a portion of its hash value is used as a key into the page mapping table. The page mapping table indicates a node holding a hash-to-UID table indicating currently stored objects in a particular page range. Finding the same hash value but with a different unique identifier in the table indicates that a duplicate of an object exists. Portions of the hash value and unique identifier may be used in the hash-to-UID table. Unneeded duplicate objects are deleted by copying their metadata to a manifest and then redirecting their unique identifiers to point at the manifest.
-
Citations
16 Claims
-
1. A method of detecting duplicates of a first digital object within a storage cluster, said method comprising:
-
receiving, within said storage cluster, a hash value of said first digital object and a unique identifier associated with said first digital object, said hash value falling within an address space made up of a plurality of non-overlapping ranges; identifying a portion of said hash value whose value represents a first range of addresses within said address space; accessing a page mapping table using said value indicated by said portion of said hash value to identify a computer node of said storage cluster, said page mapping table mapping each range of addresses of said address space to one of said computer nodes; accessing a hash table of said computer node to determine whether said hash value and said unique identifier are represented in said hash table; accessing a second digital object within said storage cluster when it is determined that said hash value is represented in an entry in said hash table but said unique identifier is not; and comparing a hash value stored in association with said second digital object with said received hash value and determining that said second digital object is a duplicate of said first digital object when said hash values match. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of deleting a duplicate of a first digital object within a storage cluster, said method comprising:
-
receiving a first unique identifier that identifies the location of a first digital object within said storage cluster; receiving a second unique identifier that identifies the location of a second digital object within said storage cluster, wherein said digital objects being duplicates; storing metadata associated with said second digital object in association with metadata associated with said first digital object in a metadata storage location; creating a reference associated with said metadata storage location that identifies said location of said first digital object; deleting said second digital object from said storage cluster; and redirecting said second unique identifier such that said second unique identifier now identifies said metadata storage location, whereby said second unique identifier identifies said first digital object via said metadata storage location and said reference. - View Dependent Claims (8, 9)
-
-
10. A method of storing a digital object in a storage cluster, said method comprising:
-
receiving said digital object at a first computer node within said storage cluster having a plurality of computer nodes; calculating a hash value for said digital object that falls within an address space and storing said digital object on a disk of said first computer node, said address space made up of a plurality of non-overlapping ranges; generating a unique identifier for said digital object; identifying a portion of said hash value whose value represents a first range of addresses within said address space; accessing a page mapping table using said value indicated by said portion of said hash value to identify a second computer node, said page mapping table mapping each range of addresses of said address space to one of said computer nodes; sending said hash value and said unique identifier to said second computer node; and storing at least a first part of said hash value and at least a first part of said unique identifier in a table on said second computer node to indicate that said digital object is present within said storage cluster. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
Specification