Fast and low-RAM-footprint indexing for data deduplication
First Claim
Patent Images
1. In a computing environment, method performed at least in part on at least one processor, comprising:
- maintaining a session cache at a primary storage device for a deduplication session, the session cache including a bundle of entries comprising hash values computed from chunks of deduplicated file data, and for each hash value, associated metadata corresponding to a storage location of the chunk from which that hash value was computed;
writing the bundle of entries from the session cache in the primary storage device into a hash index maintained at a secondary storage device at the end of the deduplication session for subsequent use in locating existing chunks; and
updating a compact index table in the primary storage device as the bundle of entries is written from the session cache to the hash index, the compact index table including compact signatures representative of the hash values.
2 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure is directed towards a data deduplication technology in which a hash index service'"'"'s index maintains a hash index in a secondary storage device such as a hard drive, along with a compact index table and look-ahead cache in RAM that operate to reduce the I/O to access the secondary storage device during deduplication operations. Also described is a session cache for maintaining data during a deduplication session, and encoding of a read-only compact index table for efficiency.
-
Citations
20 Claims
-
1. In a computing environment, method performed at least in part on at least one processor, comprising:
-
maintaining a session cache at a primary storage device for a deduplication session, the session cache including a bundle of entries comprising hash values computed from chunks of deduplicated file data, and for each hash value, associated metadata corresponding to a storage location of the chunk from which that hash value was computed; writing the bundle of entries from the session cache in the primary storage device into a hash index maintained at a secondary storage device at the end of the deduplication session for subsequent use in locating existing chunks; and updating a compact index table in the primary storage device as the bundle of entries is written from the session cache to the hash index, the compact index table including compact signatures representative of the hash values. - View Dependent Claims (2, 3)
-
-
4. In a computing environment, a system comprising:
-
a hash index maintained in a secondary storage device, in which entries of the hash index comprise hash values of deduplicated file chunks and associated chunk metadata with each hash value, the hash index updated by appending bundles of new entries from a session cache to the hash index; a compact index table in a primary storage device that includes compact signatures representative of the hash values, and for each compact signature, a pointer to a corresponding hash value, metadata entry in the hash index; and a hash index service configured to access the compact index table to look for up to one or more entries of the compact index table with corresponding compact signatures corresponding to a requested hash value to lookup, and if one or more matching compact signatures are matched, to follow each matching signature'"'"'s pointer to look for an entry in the hash index that possibly contains requested hash value. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
-
-
12. One or more computer storage devices having computer-executable instructions, which in response to execution by a computer, cause the computer to perform steps comprising:
-
(a) maintaining a hash index in a secondary storage device, in which entries of the hash index comprise hash values and associated metadata with each hash value; (b) maintaining a look-ahead cache in a primary storage device that includes hash values and metadata entries cached from the index table, (c) maintaining a compact index table in a primary storage device that includes compact signatures representative of the hash values, and for each compact signature, a pointer to a corresponding hash value, metadata entry in the hash index; (d) receiving a requested hash value in a request to return the metadata associated with the requested hash value or a not-found response; (e) accessing the look-ahead cache to lookup the hash value, and if found in the look-ahead cache, including the metadata in the result and advancing to step (h); (f) accessing the compact index table to lookup a compact signature corresponding to the requested hash value, and to include the not-found response and advance to step (h) if none of the compact signatures corresponding to the hash value and entries are found in the compact index table, (g) following the pointer to look for an entry in the hash index that matches the requested hash value, and including the metadata associated with the hash value in the result if a match is found in the hash index, or including the not-found response if no match is found in the hash index; and (h) returning the result in response to the request. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification