Deduplicating storage with enhanced frequent-block detection

US 9,177,028 B2
Filed: 04/30/2012
Issued: 11/03/2015
Est. Priority Date: 04/30/2012
Status: Active Grant

First Claim

Patent Images

1. A method for detecting data duplication, comprising:

maintaining a fingerprint directory comprising one or more entries, each entry including a data fingerprint and a data location for a data chunk;

associating each said entry with a seen-count attribute which is an indication of how often a data fingerprint has been seen in arriving data chunks to be written in a storage system, and distinguishes multiply-seen entries for data fingerprints present in at least two data chunks from once-seen entries for data fingerprints present in no more than a single data chunk;

retaining higher-frequency entries, while also taking into account recency of data accesses for the higher-frequency entries based on the seen-count attribute and the data access age; and

detecting that the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory,wherein a policy is applied for distinguishing multiple seen-count categories based on tracking data access ages of entries in the fingerprint directory for different seen-count categories.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Detecting data duplication comprises maintaining a fingerprint directory including one or more entries, each entry including a data fingerprint and a data location for a data chunk. Each entry is associated with a seen-count attribute which is an indication of how often the fingerprint has been seen in arriving data chunks. Higher-frequency entries in the directory are retained, while also taking into account recency of data accesses. A data duplication detector detects that the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory.

41 Citations

View as Search Results

20 Claims

1. A method for detecting data duplication, comprising:
- maintaining a fingerprint directory comprising one or more entries, each entry including a data fingerprint and a data location for a data chunk;
  
  associating each said entry with a seen-count attribute which is an indication of how often a data fingerprint has been seen in arriving data chunks to be written in a storage system, and distinguishes multiply-seen entries for data fingerprints present in at least two data chunks from once-seen entries for data fingerprints present in no more than a single data chunk;
  
  retaining higher-frequency entries, while also taking into account recency of data accesses for the higher-frequency entries based on the seen-count attribute and the data access age; and
  
  detecting that the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory,wherein a policy is applied for distinguishing multiple seen-count categories based on tracking data access ages of entries in the fingerprint directory for different seen-count categories.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein:
    - the fingerprint directory comprises a multiply-seen entry which has been found, and a once-seen entry which is inserted more recently, and the fingerprint module discards the once-seen entry substantially sooner than the multiply-seen entry;
      
      the seen-count attribute provides the distinction between a multiply-seen entry and a once-seen entry; and
      
      the data access ages of entries in the fingerprint directory are tracked for distinguishing the multiple seen-count categories based on a fixed ratio of age-at-eviction between multiple seen-count categories.
  - 3. The method of claim 2, further comprising:
    - maintaining a probabilistic shadow list comprising a record of fingerprint values not contained in the fingerprint directory;
      
      maintaining a shadow list module including the shadow list;
      
      detecting that the data fingerprint for a new chunk is contained in the shadow list;
      
      removing the data fingerprint for said new chunk from the shadow list; and
      
      adding to the fingerprint directory an entry containing the data fingerprint and the data location of the new chunk.
  - 4. The method of claim 3, further comprising:
    - adding to the shadow list the data fingerprint for a new chunk whose data fingerprint was not found in the fingerprint directory by the duplicate detection module.
  - 5. The method of claim 3, further comprising:
    - discarding a once-seen entry from the fingerprint directory and adding to the shadow list the data fingerprint from the discarded entry.
  - 6. The method of claim 3, wherein:
    - the shadow list further comprises a probabilistic set-object data structure with a bounded error rate.
  - 7. The method of claim 6, wherein:
    - the probabilistic set-object data structure comprises one of a set-object data structure on a collapsed key and a Bloom filter.
  - 8. The method of claim 6, wherein the data access ages of the entries in the fingerprint directory are tracked in one of time units and total input/output operations performed by the storage system.

9. A computer program product for detecting data duplication, the computer program product comprising:
- a non-transitory tangible storage medium readable by a computer system and storing instructions for execution by the computer system for performing a method comprising;
  
  maintaining a fingerprint directory comprising one or more entries, each entry including a data fingerprint and a data location for a data chunk;
  
  associating each said entry with a seen-count attribute which is an indication of how often the data fingerprint has been seen in arriving data chunks to be written in a storage system and is used for distinguishing multiply-seen entries for data fingerprints present in at least two data chunks from once-seen entries for data fingerprints present in no more than a single data chunk;
  
  retaining higher-frequency entries, while also taking into account recency of data accesses for the higher-frequency entries based on the seen-count attribute and data access age; and
  
  detecting that the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory,wherein a policy is applied for distinguishing multiple seen-count categories based on tracking data access ages of entries in the fingerprint directory for different seen-count categories.
- View Dependent Claims (10, 11, 12)
- - 10. The computer program product of claim 9, wherein:
    - the fingerprint directory comprises a multiply-seen entry which has been found, and a once-seen entry which is inserted more recently, and the fingerprint module discards the once-seen entry substantially sooner than the multiply-seen entry; and
      
      the seen-count attribute provides the distinction between a multiply-seen entry and a once-seen entry.
  - 11. The computer program product of claim 10, further comprising:
    - maintaining a probabilistic shadow list comprising a record of fingerprint values not contained in the fingerprint directory;
      
      maintaining a shadow list module including the shadow list;
      
      detecting that the data fingerprint for a new chunk is contained in the shadow list;
      
      removing the data fingerprint for said new chunk from the shadow list; and
      
      adding to the fingerprint directory an entry containing the data fingerprint and the data location of the new chunk.
  - 12. The computer program product of claim 11, wherein:
    - the shadow list further comprises a probabilistic set-object data structure; and
      
      the data access ages of entries in the fingerprint directory are tracked for distinguishing multiple seen-count categories based on a fixed ratio of age-at-eviction between multiple seen-count categories.

13. A system for detecting data duplication, comprising:
- a memory device;
  
  a fingerprint controller coupled to the memory device, the fingerprint controller maintains a fingerprint directory comprising one or more entries, each entry including a data fingerprint and a data location for a data chunk in a storage device;
  
  wherein each entry is associated with a seen-count attribute which is an indication of how often the fingerprint has been seen in arriving data chunks to be written in the system, and distinguishes multiply-seen entries for data fingerprints present in at least two data chunks from once-seen entries for data fingerprints present in no more than a single data chunk, and wherein the fingerprint controller retains higher-frequency entries, while also taking into account recency of data accesses for the higher-frequency entries based on the seen-count attribute and data access age; and
  
  a duplicate detector that detects if the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory,wherein a policy is applied for distinguishing multiple seen-count categories based on tracking data access ages of entries in the fingerprint directory for different seen-count categories.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The system of claim 13, wherein:
    - the fingerprint directory comprises a multiply-seen entry which has been found, and a once-seen entry which is inserted more recently, and the fingerprint controller discards the once-seen entry substantially sooner than the multiply-seen entry; and
      
      the seen-count attribute provides the distinction between a multiply-seen entry and a once-seen entry.
  - 15. The system of claim 14, further comprising:
    - a shadow list controller coupled to the memory device, the shadow list controller maintains a probabilistic shadow list comprising a record of fingerprint values not contained in the fingerprint directory, wherein the shadow list controller detects that the data fingerprint for a new chunk is contained in the shadow list, removes the data fingerprint for said new chunk from the shadow list, and adds to the fingerprint directory an entry containing the data fingerprint and the data location of the new chunk.
  - 16. The system of claim 15, wherein:
    - the shadow list controller adds to the shadow list the data fingerprint for a new chunk whose data fingerprint was not found in the fingerprint directory by the duplicate detector.
  - 17. The system of claim 15, wherein:
    - the fingerprint controller discards a once-seen entry from the fingerprint directory and adds to the shadow list the data fingerprint from the discarded entry.
  - 18. The system of claim 15, wherein:
    - the shadow list further comprises a probabilistic set-object data structure with a bounded error rate; and
      
      the data access ages of entries in the fingerprint directory are tracked for distinguishing the multiple seen-count categories based on a fixed ratio of age-at-eviction between multiple seen-count categories.
  - 19. The system of claim 18, wherein the probabilistic set-object data structure comprises one of a set-object data structure on a collapsed key and a Bloom filter.
  - 20. The system of claim 18, wherein the data access ages of the entries in the fingerprint directory are tracked in one of time units and total input/output operations performed by the system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chambliss, David D., Constantinescu, Mihail C., Glider, Joseph S., Lu, Maohua
Primary Examiner(s)
Truong, Dennis

Application Number

US13/460,653
Publication Number

US 20130290277A1
Time in Patent Office

1,282 Days
Field of Search

707/691, 707/662, 707/663, 707/664, 707/666, 707/813, 707/814, 707/692
US Class Current

1/1
CPC Class Codes

G06F 16/2365   Ensuring data consistency a...

G06F 16/24556   Aggregation; Duplicate elim...

G06F 16/955   using information identifie...

Deduplicating storage with enhanced frequent-block detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

41 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Deduplicating storage with enhanced frequent-block detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others