×

Distributed deduplication in a distributed system of hybrid storage and compute nodes

  • US 10,019,459 B1
  • Filed: 12/19/2013
  • Issued: 07/10/2018
  • Est. Priority Date: 12/19/2012
  • Status: Expired due to Fees
First Claim
Patent Images

1. A non-transitory machine-readable medium having executable instructions to cause one or more processing units to perform a method to determine duplicative data in a distributed storage system, the method comprising:

  • determining, at a first one of a plurality of storage controller servers, if a first entity is duplicated by a second entity in the distributed storage system for deduplication, wherein the second entity is stored on a second one of the plurality of storage controller servers in the distributed storage system, the distributed storage system includes the plurality of storage controller servers, and the determining if the first entity is duplicated includes,receiving the first entity to be stored in the distributed storage system, wherein the determination if the first entity is for deduplication occurs when the first entity is flushed from a write log in fast storage to persistent storage,building a data deduplication table indicating a top-K entities in the distributed storage system, wherein a number of the top-K entities is less than a total number of entities in the distributed storage system,looking up the first entity in the data deduplication table,if the first entity exists in the data deduplication table, updating metadata for the first entity to indicate that a virtual node associated with the second entity stores a duplicate of the first entity, wherein the virtual node is further mapped to the second one of the plurality of storage controller servers, the virtual node stores a collection of a plurality of objects that includes the second entity, and the metadata for the first entity is stored in another virtual node,wherein the building the data deduplication table comprises;

    for each of a plurality of stored entities,computing a current fingerprint for each of the plurality of the stored entities,if the current fingerprint is in top-K fingerprints as indicated in the data deduplication table, 

    incrementing a reference count for the current fingerprint, andif the current fingerprint is not in the top-K fingerprints, 

    decrementing reference counts of all other fingerprints indicated in the top-K fingerprints, and 

    removing zero reference count fingerprints, andif the first entity is duplicated, removing the first entity from the distributed storage system.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×