Distributed deduplication in a distributed system of hybrid storage and compute nodes

US 10,019,459 B1
Filed: 12/19/2013
Issued: 07/10/2018
Est. Priority Date: 12/19/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A non-transitory machine-readable medium having executable instructions to cause one or more processing units to perform a method to determine duplicative data in a distributed storage system, the method comprising:

determining, at a first one of a plurality of storage controller servers, if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the second entity is stored on a second one of the plurality of storage controller servers in the distributed storage system, the distributed storage system includes the plurality of storage controller servers, and the determining if the first entity is duplicated includes,receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage,building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,looking up the first entity in the data deduplication table,if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that a virtual node associated with the second entity stores a duplicate of the first entity, wherein the virtual node is further mapped to the second one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node,wherein the building the data deduplication table comprises;

for each of a plurality of stored entities,computing a current fingerprint for each of the plurality of the stored entities,if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,

incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,

decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and

removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A distributed storage system called StorFS that performs distributed data deduplication is described. In an exemplary embodiment, a storage controller server determines if there is duplicative data in a distributed storage system. In this embodiment, the storage controller server determines if an entity is duplicated in the distributed storage system in line with an incoming input/output operation. The storage controller server determines if the entity is duplicated in the distributed storage system by receiving the entity and looking up the entity in a data deduplication table. If the entity exists in the data deduplication table, the storage controller server updates the metadata for the entity to point to the duplicate entity.

Citations

20 Claims

1. A non-transitory machine-readable medium having executable instructions to cause one or more processing units to perform a method to determine duplicative data in a distributed storage system, the method comprising:
- determining, at a first one of a plurality of storage controller servers, if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the second entity is stored on a second one of the plurality of storage controller servers in the distributed storage system, the distributed storage system includes the plurality of storage controller servers, and the determining if the first entity is duplicated includes,receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage,building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,looking up the first entity in the data deduplication table,if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that a virtual node associated with the second entity stores a duplicate of the first entity, wherein the virtual node is further mapped to the second one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node,wherein the building the data deduplication table comprises;
  
  for each of a plurality of stored entities,computing a current fingerprint for each of the plurality of the stored entities,if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
  
  incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,
  
  decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
  
  removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The non-transitory machine-readable medium of claim 1, wherein the data deduplication table includes hints about which entities are stored in the distributed storage system.
  - 3. The non-transitory machine-readable medium of claim 2, wherein each of the hints is a fingerprint of an entity that is stored in the distributed storage system.
  - 4. The non-transitory machine-readable medium of claim 3, wherein the looking up the first entity comprises:
    - computing a fingerprint for the first entity when the first entity arrives in the distributed storage system; and
      
      looking up the fingerprint in the data deduplication table.
  - 5. The non-transitory machine-readable medium of claim 1, wherein the data deduplication table includes hints about which of a top-K entities are stored in the distributed storage system.
  - 6. The non-transitory machine-readable medium of claim 1, wherein the data deduplication table is globally available to the plurality of storage controller servers for data deduplication determinations.
  - 7. The non-transitory machine-readable medium of claim 1, wherein the building the data deduplication table further comprises, if the current fingerprint is not in the top-K fingerprints,adding the current fingerprint to the data deduplication table, andsetting a reference count for the current fingerprint to one.
  - 8. The non-transitory machine-readable medium of claim 7, wherein the adding includes replacing one of the removed zero reference count fingerprints with the current fingerprint.

9. A computerized method that determines duplicative data in a distributed storage system, the method comprising:
- determining, at the first one of a plurality of storage controller servers, if a first entity stored on a first virtual node is duplicated by a second entity in the distributed storage system for deduplication, wherein a second entity is stored on a second virtual node, the distributed storage system includes the plurality of storage controller servers and a plurality of virtual nodes, the plurality of virtual nodes includes the first and second virtual nodes, and the determining if the first entity is duplicated includes,receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage,building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,looking up the first entity in the data deduplication table, ifthe first entity exists in the data deduplication table,updating metadata for the first entity to indicate that the second virtual node stores a duplicate of the first entity, wherein the second virtual node is further mapped to one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node,wherein the building the data deduplication table comprises;
  
  for each of a plurality of stored entities,computing a current fingerprint for each of the plurality of the stored entities,if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
  
  incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints,
  
  decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
  
  removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The method of claim 9, wherein the data deduplication table includes hints about which entities are stored in the distributed storage system.
  - 11. The method of claim 10, wherein each of the hints is a fingerprint of an entity that is stored in the distributed storage system.
  - 12. The method of claim 11, wherein the looking up the first entity comprises:
    - computing a fingerprint for the first entity in line with an incoming input/output operation; and
      
      looking up the fingerprint in the data deduplication table.
  - 13. The method of claim 9, wherein the data deduplication table includes hints about which of a top-K entities are stored in the distributed storage system.
  - 14. The method of claim 9, wherein the data deduplication table is globally available to the plurality of storage controller servers for data deduplication determinations.
  - 15. The method of claim 9, wherein the building the data deduplication table further comprises, if the current fingerprint is not in the top-K fingerprints,adding the current fingerprint to the data deduplication table, andsetting a reference count for the current fingerprint to one.
  - 16. The method of claim 15, wherein the adding includes replacing one of the removed zero reference count fingerprints with the current fingerprint.

17. A distributed storage system to determine duplicative data in a distributed storage system, the distributed storage system comprising:
- an interconnection network; and
  
  a plurality of storage servers, interconnected by the interconnection network, wherein each of the plurality of storage servers includes,a processing unit,a first set of instructions, executed by the processing unit, that determinesif a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the first entity is stored on a first one of the plurality of storage servers, the second entity is stored on a second one of the plurality storage servers in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast physical storage to persistent physical storage of the distributed storage system, and the first set of instructions includes,a second set of instructions that receives the first entity to be stored in the distributed storage system, and builds a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,a third set of instructions that looks up the first entity in the data deduplication table, anda fourth set of instructions that updates metadata for the first entity to indicate a virtual node associated with the second entity that stores a duplicate of the first entity if the second entity exists in the data deduplication table, wherein the second virtual node is further mapped to one of the plurality of storage servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual nodewherein the second set of instructions includes instructions to cause the building the data deduplication table by;
  
  for each of a plurality of stored entities,
  
  computing a current fingerprint for each of the plurality of the stored entities,
  
  if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table,
  
  incrementing a reference count for the current fingerprint, and,
  
  if the current fingerprint is not in the top-K fingerprints,
  
  decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and
  
  removing zero reference count fingerprints, anda fifth set of instructions that, if the first entity is duplicated, removes the first entity from the distributed storage system.
- View Dependent Claims (18, 19, 20)
- - 18. The distributed storage system of claim 17, wherein the third set of instructions that looks up looks up by:
    - computing a fingerprint for the first entity when the first entity arrives in the distributed storage system; and
      
      looking up the fingerprint in the data deduplication table.
  - 19. The distributed storage system of claim 17, wherein the data deduplication table is globally available to the plurality of storage servers for data deduplication determinations.
  - 20. The distributed storage system of claim 17, wherein the second set of instructions builds the data deduplication table by, if the current fingerprint is not in the top-K fingerprints,adding the current fingerprint to the data deduplication table, andsetting a reference count for the current fingerprint to one.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Springpath, LLC. (Cisco Systems, Inc.)
Inventors
Agarwala, Sandip, Gaonkar, Shravan, Mahalingam, Mallikarjunan, Shah, Smit, Shaikh, Faraz, Vegulla, Praveen, Yadappanavar, Krishna
Primary Examiner(s)
Mackes, Kris
Assistant Examiner(s)
Nguyen, Merilyn

Application Number

US14/135,495
Time in Patent Office

1,664 Days
Field of Search
US Class Current
CPC Class Codes

G06F 12/0246   in block erasable memory, e...

G06F 12/0253   Garbage collection, i.e. re...

G06F 12/0292   using tables or multilevel ...

G06F 12/0811   with multilevel cache hiera...

G06F 16/00   Information retrieval; Data...

G06F 16/1727   Details of free space manag...

G06F 16/1748   De-duplication implemented ...

G06F 16/1752   based on file chunks

G06F 16/182   Distributed file systems

G06F 16/2365   Ensuring data consistency a...

G06F 3/0608   Saving storage space on sto...

G06F 3/061   Improving I/O performance

G06F 3/0614   Improving the reliability o...

G06F 3/0619   in relation to data integri...

G06F 3/0641   De-duplication techniques

G06F 3/065   Replication mechanisms

G06F 3/0652   Erasing, e.g. deleting, dat...

G06F 3/0659   Command handling arrangemen...

G06F 3/067   Distributed or networked st...

G06F 3/0689   Disk arrays, e.g. RAID, JBOD

G11C 7/1072 : for memories with random ac...

H04L 67/1097 : for distributed storage of ...

View All

Distributed deduplication in a distributed system of hybrid storage and compute nodes

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed deduplication in a distributed system of hybrid storage and compute nodes

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links