Distributed deduplication in a distributed system of hybrid storage and compute nodes
First Claim
1. A non-transitory machine-readable medium having executable instructions to cause one or more processing units to perform a method to determine duplicative data in a distributed storage system, the method comprising:
- determining, at a first one of a plurality of storage controller servers, if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the second entity is stored on a second one of the plurality of storage controller servers in the distributed storage system, the distributed storage system includes the plurality of storage controller servers, and the determining if the first entity is duplicated includes,receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage,building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,looking up the first entity in the data deduplication table,if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that a virtual node associated with the second entity stores a duplicate of the first entity, wherein the virtual node is further mapped to the second one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node,wherein the building the data deduplication table comprises;
for each of a plurality of stored entities,computing a current fingerprint for each of the plurality of the stored entities,if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,
decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system.
4 Assignments
0 Petitions
Accused Products
Abstract
A distributed storage system called StorFS that performs distributed data deduplication is described. In an exemplary embodiment, a storage controller server determines if there is duplicative data in a distributed storage system. In this embodiment, the storage controller server determines if an entity is duplicated in the distributed storage system in line with an incoming input/output operation. The storage controller server determines if the entity is duplicated in the distributed storage system by receiving the entity and looking up the entity in a data deduplication table. If the entity exists in the data deduplication table, the storage controller server updates the metadata for the entity to point to the duplicate entity.
-
Citations
20 Claims
-
1. A non-transitory machine-readable medium having executable instructions to cause one or more processing units to perform a method to determine duplicative data in a distributed storage system, the method comprising:
-
determining, at a first one of a plurality of storage controller servers, if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the second entity is stored on a second one of the plurality of storage controller servers in the distributed storage system, the distributed storage system includes the plurality of storage controller servers, and the determining if the first entity is duplicated includes, receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage, building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system, looking up the first entity in the data deduplication table, if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that a virtual node associated with the second entity stores a duplicate of the first entity, wherein the virtual node is further mapped to the second one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node, wherein the building the data deduplication table comprises; for each of a plurality of stored entities, computing a current fingerprint for each of the plurality of the stored entities, if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,
decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computerized method that determines duplicative data in a distributed storage system, the method comprising:
-
determining, at the first one of a plurality of storage controller servers, if a first entity stored on a first virtual node is duplicated by a second entity in the distributed storage system for deduplication, wherein a second entity is stored on a second virtual node, the distributed storage system includes the plurality of storage controller servers and a plurality of virtual nodes, the plurality of virtual nodes includes the first and second virtual nodes, and the determining if the first entity is duplicated includes, receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage, building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system, looking up the first entity in the data deduplication table, if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that the second virtual node stores a duplicate of the first entity, wherein the second virtual node is further mapped to one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node, wherein the building the data deduplication table comprises; for each of a plurality of stored entities, computing a current fingerprint for each of the plurality of the stored entities, if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,
decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A distributed storage system to determine duplicative data in a distributed storage system, the distributed storage system comprising:
-
an interconnection network; and a plurality of storage servers, interconnected by the interconnection network, wherein each of the plurality of storage servers includes, a processing unit, a first set of instructions, executed by the processing unit, that determines if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the first entity is stored on a first one of the plurality of storage servers, the second entity is stored on a second one of the plurality storage servers in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast physical storage to persistent physical storage of the distributed storage system, and the first set of instructions includes, a second set of instructions that receives the first entity to be stored in the distributed storage system, and builds a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system, a third set of instructions that looks up the first entity in the data deduplication table, and a fourth set of instructions that updates metadata for the first entity to indicate a virtual node associated with the second entity that stores a duplicate of the first entity if the second entity exists in the data deduplication table, wherein the second virtual node is further mapped to one of the plurality of storage servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node wherein the second set of instructions includes instructions to cause the building the data deduplication table by; for each of a plurality of stored entities,
computing a current fingerprint for each of the plurality of the stored entities,
if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
incrementing a reference count for the current fingerprint, and,
if the current fingerprint is not in the top-K fingerprints,
decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
removing zero reference count fingerprints, anda fifth set of instructions that, if the first entity is duplicated, removes the first entity from the distributed storage system. - View Dependent Claims (18, 19, 20)
-
Specification