Using index partitioning and reconciliation for data deduplication
First Claim
1. An apparatus, comprising:
- a processor; and
a memory containing logic operative on the processor to cause the processor to perform a process comprising;
loading a subspace index comprising less than all index entries of a signature index service from a secondary media into a primary memory, wherein the loaded subspace index corresponds to a set of subspaces individually containing a set of signatures each corresponding to a data chunk stored in a data store, two of the subspaces having at least one signature representative of one subspace generally matching at least one signature representative of the other subspace; and
reconciling the two subspaces associated with the loaded subspace index to remove at least one duplicate chunk by;
using a resemblance metric to compare a signature of a data chunk in one of the two subspaces to multiple signatures of data chunks in the other of the two subspaces; and
in response to determining that the signature of the data chunk in one of the two subspaces matches another signature in the other of the two subspaces, marking a data chunk corresponding to the signature or the another signature for deletion from a corresponding data store.
0 Assignments
0 Petitions
Accused Products
Abstract
The subject disclosure is directed towards a data deduplication technology in which a hash index service'"'"'s index is partitioned into subspace indexes, with less than the entire hash index service'"'"'s index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace'"'"'s hashes.
128 Citations
20 Claims
-
1. An apparatus, comprising:
-
a processor; and a memory containing logic operative on the processor to cause the processor to perform a process comprising; loading a subspace index comprising less than all index entries of a signature index service from a secondary media into a primary memory, wherein the loaded subspace index corresponds to a set of subspaces individually containing a set of signatures each corresponding to a data chunk stored in a data store, two of the subspaces having at least one signature representative of one subspace generally matching at least one signature representative of the other subspace; and reconciling the two subspaces associated with the loaded subspace index to remove at least one duplicate chunk by; using a resemblance metric to compare a signature of a data chunk in one of the two subspaces to multiple signatures of data chunks in the other of the two subspaces; and in response to determining that the signature of the data chunk in one of the two subspaces matches another signature in the other of the two subspaces, marking a data chunk corresponding to the signature or the another signature for deletion from a corresponding data store. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method performed by a computing device in a network storage system, the method comprising:
-
selecting first and second subspaces of a global index for reconciliation, the first and second subspaces individually containing multiple (i) signatures of data chunks each stored in a data store in the network storage system and (ii) references to the data chunks stored in the corresponding data stores, wherein selecting the first and second subspaces including; using a resemblance metric to determine a subspace similarity between the first and second subspaces; and choosing the first and second subspaces for reconciliation based on the determined subspace similarity; and reconciling the first and second subspaces to remove at least one duplicate chunk from the first and second subspaces, comprising; determining whether a first signature in the first subspace matches a second signature in the second subspace; and in response to determining that the first signature in the first subspace matches the second signature in the second subspace, marking a data chunk referenced to either the first signature or the second signature for deletion from a corresponding data store. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. In a computing environment, a method performed on one or more processors, the method comprising:
-
loading a subspace index comprising less than all index entries of a hash index service from a secondary media into a primary memory, wherein the subspace index corresponds to a partitioning of a global index into separate subspaces; reconciling at least two of the subspaces to remove at least one duplicate chunk, including using a resemblance metric to compare a subspace and another subspace by determining a similarity between at least one signature representative of the subspace and at least one signature representative of the other subspace; and using the subspace index to deduplicate a dataset, including; chunking the dataset into one or more chunks; for each chunk, determining whether a hash value computed for that chunk matches a hash value of an entry in the primary memory; in response to determining that the hash value computed for the chunk does not match any hash value of an entry in the primary memory, storing the chunk and adding an entry for the computed hash value associated with the chunk into the subspace index in association with a reference to the chunk; and in response to determining that the hash value computed for the chunk matches a hash value of an entry in the primary memory, returning data representing information by which an existing chunk corresponding to the hash value of the entry in the primary memory is locatable. - View Dependent Claims (18, 19, 20)
-
Specification