METHOD FOR REMOVING DUPLICATE DATA FROM A STORAGE ARRAY

US 20130086006A1
Filed: 09/30/2011
Published: 04/04/2013
Est. Priority Date: 09/30/2011
Status: Active Grant

First Claim

Patent Images

1. A computer system comprising:

a data storage medium;

a first deduplication table comprising a first plurality of entries and a second deduplication table comprising a second plurality of entries, wherein each entry of the first and the second plurality of entries includes a hash corresponding to a data component; and

a data storage controller configured to;

store at least one entry in the first deduplication table rather than the second deduplication table based at least in part on a prediction that the at least one entry has a likelihood of being deduplicated that exceeds a given threshold;

search the first deduplication table based on a first hash corresponding to a storage access request prior to any search of the second deduplication table with the first hash;

initiate additional deduplication processing steps, in response to detecting a hit in the first deduplication table during the search; and

forego said additional deduplication processing steps, in response to detecting a miss in the first deduplication table during the search.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for efficiently removing duplicate data blocks at a fine-granularity from a storage array. A data storage subsystem supports multiple deduplication tables. Table entries in one deduplication table have the highest associated probability of being deduplicated. Table entries may move from one deduplication table to another as the probabilities change. Additionally, a table entry may be evicted from all deduplication tables if a corresponding estimated probability falls below a given threshold. The probabilities are based on attributes associated with a data component and attributes associated with a virtual address corresponding to a received storage access request. A strategy for searches of the multiple deduplication tables may also be determined by the attributes associated with a given storage access request.

Citations

21 Claims

1. A computer system comprising:
- a data storage medium;
  
  a first deduplication table comprising a first plurality of entries and a second deduplication table comprising a second plurality of entries, wherein each entry of the first and the second plurality of entries includes a hash corresponding to a data component; and
  
  a data storage controller configured to;
  
  store at least one entry in the first deduplication table rather than the second deduplication table based at least in part on a prediction that the at least one entry has a likelihood of being deduplicated that exceeds a given threshold;
  
  search the first deduplication table based on a first hash corresponding to a storage access request prior to any search of the second deduplication table with the first hash;
  
  initiate additional deduplication processing steps, in response to detecting a hit in the first deduplication table during the search; and
  
  forego said additional deduplication processing steps, in response to detecting a miss in the first deduplication table during the search.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system as recited in claim 1, wherein the storage access request is a write request, and wherein the data storage controller is configured to store a data component corresponding to the write request in the data storage medium in response to detecting a miss in the first deduplication table during the search, wherein the data component is stored in the data storage medium prior to any search of the second deduplication table with the first hash.
  - 3. The system as recited in claim 2, wherein the first deduplication table and the second deduplication table are stored across a hierarchy of different storage media.
  - 4. The system as recited in claim 3, wherein the data storage controller is further configured to restrict a search of the first deduplication table or the second deduplication table to a subset of the hierarchy of different storage media and a number of accesses to a given level of the hierarchy based on attributes of a corresponding data component and the characteristics of the storage media.
  - 5. The system as recited in claim 1, wherein the data storage controller is further configured to determine whether the search should continue in one or more additional deduplication tables in response to detecting a miss in the first deduplication table during the search, wherein the determination is based on resource and performance issues.
  - 6. The system as recited in claim 1, wherein in response to detecting a match between a first fingerprint and a second fingerprint, the data storage controller is configured to compare a first data component corresponding to the first fingerprint with a second data component corresponding to the second fingerprint.
  - 7. The system as recited in claim 6, wherein in response to detecting that the content of the first data component and the content of the second data component are the same, the data storage controller is further configured to:
    - forego a write to storage of the first data component;
      
      store a virtual-to-physical address mapping corresponding to the write in a first mapping table; and
      
      store a reverse physical-to-virtual address mapping corresponding to the write in a second mapping table.

8. A method comprising:
- maintaining a first deduplication table comprising a first plurality of entries and a second deduplication table comprising a second plurality of entries in a computer system, wherein each entry of the first and the second plurality of entries includes a hash corresponding to a data component;
  
  storing at least one entry in the first deduplication table rather than the second deduplication table based at least in part on a prediction that the at least one entry has a likelihood of being deduplicated that exceeds a given threshold;
  
  searching the first deduplication table based on a first hash corresponding to a storage access request prior to any search of the second deduplication table with the first hash;
  
  initiating additional deduplication processing steps, in response to detecting a hit in the first deduplication table during the search; and
  
  foregoing said additional deduplication processing steps, in response to detecting a miss in the first deduplication table during the search.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method as recited in claim 8, wherein the storage access request is a write request, the method further comprising storing a data component corresponding to the write request in the data storage medium in response to detecting a miss in the first deduplication table during the search, wherein the data component is stored in the data storage medium prior to any search of the second deduplication table with the first hash.
  - 10. The method as recited in claim 9, wherein the first deduplication table and the second deduplication table are stored across a hierarchy of different storage media.
  - 11. The method as recited in claim 10, further comprising restricting a search of the first deduplication table or the second deduplication table to a subset of the hierarchy of different storage media and a number of accesses to a given level of the hierarchy based on attributes of a corresponding data component and the characteristics of the storage media.
  - 12. The method as recited in claim 8, further comprising determining whether the search should continue in one or more additional deduplication tables in response to detecting a miss in the first deduplication table during the search, wherein the determination is based on resource and performance issues.
  - 13. The method as recited in claim 8, wherein in response to detecting a match between a first fingerprint and a second fingerprint, the method further comprises comparing a first data component corresponding to the first fingerprint with a second data component corresponding to the second fingerprint.
  - 14. The method as recited in claim 13, wherein in response to detecting that the content of the first data component and the content of the second data component are the same, the method further comprises:
    - foregoing a write to storage of the first data component;
      
      storing a virtual-to-physical address mapping corresponding to the write in a first mapping table; and
      
      storing a reverse physical-to-virtual address mapping corresponding to the write in a second mapping table.

15. A non-transitory computer readable storage medium comprising program instructions, wherein said program instructions are executable to:
- maintain a first deduplication table comprising a first plurality of entries and a second deduplication table comprising a second plurality of entries in a computer system, wherein each entry of the first and the second plurality of entries includes a hash corresponding to a data component;
  
  store at least one entry in the first deduplication table rather than the second deduplication table based at least in part on a prediction that the at least one entry has a likelihood of being deduplicated that exceeds a given threshold;
  
  search the first deduplication table based on a first hash corresponding to a storage access request prior to any search of the second deduplication table with the first hash;
  
  initiate additional deduplication processing steps, in response to detecting a hit in the first deduplication table during the search; and
  
  forego said additional deduplication processing steps, in response to detecting a miss in the first deduplication table during the search.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The storage medium as recited in claim 15, wherein the storage access request is a write request, and wherein the program instructions are further executable to store a data component corresponding to the write request in the data storage medium in response to detecting a miss in the first deduplication table during the search, wherein the data component is stored in the data storage medium prior to any search of the second deduplication table with the first hash.
  - 17. The storage medium as recited in claim 16, wherein the first deduplication table and the second deduplication table are stored across a hierarchy of different storage media.
  - 18. The storage medium as recited in claim 17, wherein the program instructions are further executable to restrict a search of the first deduplication table or the second deduplication table to a subset of the hierarchy of different storage media and a number of accesses to a given level of the hierarchy based on attributes of a corresponding data component and the characteristics of the storage media.
  - 19. The storage medium as recited in claim 15, wherein the program instructions are further executable to determine whether the search should continue in one or more additional deduplication tables in response to detecting a miss in the first deduplication table during the search, wherein the determination is based on resource and performance issues.
  - 20. The storage medium as recited in claim 15, wherein in response to detecting a match between a first fingerprint and a second fingerprint, the wherein the program instructions are further executable to compare a first data component corresponding to the first fingerprint with a second data component corresponding to the second fingerprint.
  - 21. The storage medium as recited in claim 20, wherein in response to detecting that the content of the first data component and the content of the second data component are the same, the wherein the program instructions are further executable to:
    - forego a write to storage of the first data component;
      
      store a virtual-to-physical address mapping corresponding to the write in a first mapping table; and
      
      store a reverse physical-to-virtual address mapping corresponding to the write in a second mapping table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Pure Storage, Inc.
Original Assignee
Ethan Miller
Inventors
Colgrove, John, Hayes, John, Miller, Ethan, Sandvig, Cary, Hasbani, Joseph S.

Granted Patent

US 8,930,307 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 16/137   Hash-based content-based in...

G06F 16/1752   based on file chunks

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0688   Non-volatile semiconductor ...

METHOD FOR REMOVING DUPLICATE DATA FROM A STORAGE ARRAY

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR REMOVING DUPLICATE DATA FROM A STORAGE ARRAY

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links