Granular partial recall of deduplicated files

US 10,180,943 B2
Filed: 02/28/2013
Issued: 01/15/2019
Est. Priority Date: 02/28/2013
Status: Active Grant

First Claim

Patent Images

1. A computing device comprising:

one or more processing units; and

one or more computer-readable media comprising computer-executable instructions, which, when executed by the one or more processing units, cause the computing device to;

detect a writing of data into a deduplicated file that comprises references to chunks of data in a chunk store; and

separately modify, in response to the detecting, each of at least two different data structures that are hierarchically arranged, wherein the computer-executable instructions that cause the computing device to perform the separate modifications comprise computer-executable instructions that cause the computing device to;

modify one or more entries in a main recall table to identify as dirty one or more ranges of data of the deduplicated file that comprise the written data, wherein the main recall table is a hierarchically lower one of the at least two different data structures such that each of the one or more entries in the main recall table identifies whether a corresponding single one of the one or more ranges of data of the deduplicated file is either clean or dirty; and

modify one or more entries in a recall index table to identify one or more blocks of multiple entries in the main recall table as comprising at least one entry identifying that a corresponding range of data of the deduplicated file is dirty, wherein the recall index table is a hierarchically higher one of the at least two different data structures such that a single entry of the recall index table identifies whether a corresponding block of multiple entries in the main recall table either comprises only entries that identify corresponding ranges of data of the deduplicated file as clean, or includes at least one entry that identifies a corresponding range of data of the deduplicated file as dirty;

wherein a deduplicated file metadata that is stored as part of a file structure of the deduplicated file comprises a root recall index table; and

wherein further the main recall table is stored externally to the deduplicated file metadata that is stored as part of the file structure of the deduplicated file.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject disclosure is directed towards partially recalling file ranges of deduplicated files based on tracking dirty (write modified) ranges (user writes) in a way that eliminates or minimizes reading and writing already-optimized adjacent data. The granularity of the ranges does not depend on any file-system granularity for tracking ranges. In one aspect, lazy flushing of tracking data that preserves data-integrity and crash-consistency is provided. In one aspect, also described is supporting granular partial recall on an open file while a data deduplication system is optimizing that file.

Citations

20 Claims

1. A computing device comprising:
- one or more processing units; and
  
  one or more computer-readable media comprising computer-executable instructions, which, when executed by the one or more processing units, cause the computing device to;
  
  detect a writing of data into a deduplicated file that comprises references to chunks of data in a chunk store; and
  
  separately modify, in response to the detecting, each of at least two different data structures that are hierarchically arranged, wherein the computer-executable instructions that cause the computing device to perform the separate modifications comprise computer-executable instructions that cause the computing device to;
  
  modify one or more entries in a main recall table to identify as dirty one or more ranges of data of the deduplicated file that comprise the written data, wherein the main recall table is a hierarchically lower one of the at least two different data structures such that each of the one or more entries in the main recall table identifies whether a corresponding single one of the one or more ranges of data of the deduplicated file is either clean or dirty; and
  
  modify one or more entries in a recall index table to identify one or more blocks of multiple entries in the main recall table as comprising at least one entry identifying that a corresponding range of data of the deduplicated file is dirty, wherein the recall index table is a hierarchically higher one of the at least two different data structures such that a single entry of the recall index table identifies whether a corresponding block of multiple entries in the main recall table either comprises only entries that identify corresponding ranges of data of the deduplicated file as clean, or includes at least one entry that identifies a corresponding range of data of the deduplicated file as dirty;
  
  wherein a deduplicated file metadata that is stored as part of a file structure of the deduplicated file comprises a root recall index table; and
  
  wherein further the main recall table is stored externally to the deduplicated file metadata that is stored as part of the file structure of the deduplicated file.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 17)
- - 2. The computing device of claim 1, wherein, if the single entry of the recall index table identifies that the corresponding block of multiple entries in the main recall table includes the at least one entry that identifies the corresponding range of data of the deduplicated file as dirty, then the single entry of the recall index table comprises an identification of which copy of the main recall table is currently active;
    - and wherein further the modifying the one or more entries in the main recall table comprises modifying one or more entries in alternating copies of the main recall table such that a last-modified copy of the main recall table is the currently active copy of the main recall table as identified by entries of the recall index table.
  - 3. The computing device of claim 1, wherein the single entry of the recall index table further identifies whether the corresponding block of multiple entries in the main recall table comprises only entries that identify the corresponding ranges of data of the deduplicated file as dirty.
  - 4. The computing device of claim 1, wherein a single entry of the root recall index table corresponds to a block of multiple entries of the recall index table in a hierarchically same manner as the single entry of the recall index table corresponds to the block of multiple entries of the main recall table.
  - 5. The computing device of claim 1, wherein the root recall index table is stored in a reparse point of the dedeuplicated file.
  - 6. The computing device of claim 1, wherein the computer-readable media comprise further computer-executable instructions which, when executed by the one or more processing units, cause the computing device to:
    - increment an update sequence number after each modifying of the main recall table.
  - 7. The computing device of claim 1, wherein changes to at least one of the main recall table or the recall index table are flushed to disk only after a delay of a predetermined duration.
  - 8. The computing device of claim 1, wherein changes to at least one of the main recall table or the recall index table are flushed to disk upon at least one of:
    - a file flush, a volume flush, a file handle close, or a write-through file modification.
  - 9. The computing device of claim 1, wherein the computer-readable media comprise further computer-executable instructions which, when executed by the one or more processing units, cause the computing device to:
    - modify a second main recall table to identify as dirty one or more other ranges of data of the deduplicated file that comprise the data written by a second, subsequent writing of data into the deduplicated file; and
      
      modify a second recall index table to identify one or more blocks of multiple entries in the main recall table as comprising at least one entry identifying that a corresponding range of data of the deduplicated file is dirty;
      
      wherein the second main index table and the second recall index table are hierarchically arranged such that the second main recall table is hierarchically lower and the second recall index table is hierarchically higher; and
      
      wherein further the second main index table and the second recall index table provide for distinguishing between writes that occur before optimization processing of a region of the deduplicated file and writes that occur after the optimization processing of the region of the deduplicated file.
  - 10. The computing device of claim 1, wherein the computer-readable media comprise further computer-executable instructions which, when executed by the one or more processing units, cause the computing device to:
    - receive a request for a first read of data that is to be truncated as part of a deduplication optimization, the request for the first read being received prior to a truncation start;
      
      increment a first counter in response to a commencement of the first read, the first read obtaining data from the file structure of the deduplicated file;
      
      decrement the first counter in response to the first read completing;
      
      receive a request for a second read of data that is also to be truncated as part of the deduplication optimization, the request for the second read being received subsequent to the truncation start;
      
      leave the first counter unchanged in response to a commencement of the second read, the second read obtaining data from the chunk store; and
      
      delay truncation until the first counter is zero, the truncation comprising zeroing out the data being truncated such that the file structure of the deduplictaed file instead comprises pointers to chunks in the chunk store for the data that was truncated.
  - 17. The computing device of claim 1, wherein the recall index table is the root recall index table.

11. A method of partially deduplicating data files at a finer granularity to increase data access performance, the method comprising:
- detecting a writing of data into a deduplicated file that comprises references to chunks of data in a chunk store;
  
  separately modifying, in response to the detecting, each of at least two different data structures that are hierarchically arranged, wherein the separately modifying comprises;
  
  modifying one or more entries in a main recall table to identify as dirty one or more ranges of data of the deduplicated file that comprise the written data, wherein the main recall table is a hierarchically lower one of the at least two different data structures such that each of the one or more entries in the main recall table identifies whether a corresponding single one of the one or more ranges of data of the deduplicated file is either clean or dirty; and
  
  modifying one or more entries in a recall index table to identify one or more blocks of multiple entries in the main recall table as comprising at least one entry identifying that a corresponding range of data of the deduplicated file is dirty, wherein the recall index table is a hierarchically higher one of the at least two different data structures such that a single entry of the recall index table identifies whether a corresponding block of multiple entries in the main recall table either comprises only entries that identify corresponding ranges of data of the deduplicated file as clean, or includes at least one entry that identifies a corresponding range of data of the deduplicated file as dirty;
  
  wherein a deduplicated file metadata that is stored as part of a file structure of the deduplicated file comprises a root recall index table; and
  
  wherein further the main recall table is stored externally to the deduplicated file metadata that is stored as part of the file structure of the deduplicated file.
- View Dependent Claims (12, 13, 14, 15, 16, 18)
- - 12. The method of claim 11, wherein, if the single entry of the recall index table identifies that the corresponding block of multiple entries in the main recall table includes the at least one entry that identifies the corresponding range of data of the deduplicated file as dirty, then the single entry of the recall index table comprises an identification of which copy of the main recall table is currently active;
    - and wherein further the modifying the one or more entries in the main recall table comprises modifying one or more entries in alternating copies of the main recall table such that a last-modified copy of the main recall table is the currently active copy of the main recall table as identified by entries of the recall index table.
  - 13. The method of claim 11, wherein a single entry of the root recall index table corresponds to a block of multiple of the recall index table in a hierarchically same manner as the single entry of the recall index table corresponds to the block of multiple entries of the main recall table.
  - 14. The method of claim 11, wherein the root recall index table is stored in a reparse point of the dedeuplicated file.
  - 15. The method of claim 11, further comprising:
    - modifying a second main recall table to identify as dirty one or more other ranges of data of the deduplicated file that comprise the data written by a second, subsequent writing of data into the deduplicated file; and
      
      modifying a second recall index table to identify one or more blocks of multiple entries in the main recall table as comprising at least one entry identifying that a corresponding range of data of the deduplicated file is dirty;
      
      wherein the second main index table and the second recall index table are hierarchically arranged such that the second main recall table is hierarchically lower and the second recall index table is hierarchically higher; and
      
      wherein further the second main index table and the second recall index table provide for distinguishing between writes that occur before optimization processing of a region of the deduplicated file and writes that occur after the optimization processing of the region of the deduplicated file.
  - 16. The method of claim 11, further comprising:
    - receiving a request for a first read of data that is to be truncated as part of a deduplication optimization, the request for the first read being received prior to a truncation start;
      
      incrementing a first counter in response to a commencement of the first read, the first read obtaining data from the file structure of the deduplicated file;
      
      decrementing the first counter in response to the first read completing;
      
      receiving a request for a second read of data that is also to be truncated as part of the deduplication optimization, the request for the second read being received subsequent to the truncation start;
      
      leaving the first counter unchanged in response to a commencement of the second read, the second read obtaining data from the chunk store; and
      
      delaying truncation until the first counter is zero, the truncation comprising zeroing out the data being truncated such that the file structure of the deduplictaed file instead comprises pointers to chunks in the chunk store for the data that was truncated.
  - 18. The method of claim 11, wherein the single entry of the recall index table further identifies whether the corresponding block of multiple entries in the main recall table comprises only entries that identify the corresponding ranges of data of the deduplicated file as dirty.

19. A computing device comprising:
- one or more processing units; and
  
  one or more computer-readable media comprising computer-executable instructions, which, when executed by the one or more processing units, cause the computing device to;
  
  receive a request to read a first set of data from a file that is only partially deduplicated, the file comprising;
  
  (1) pointers to chunks of data stored externally to the file in a chunk store and (2) dirtied file data comprising data that was changed after the file was last deduplicated into the chunks of data;
  
  determine from which portion of a file system to obtain the first set of data, in response to the request, by referencing a set of recall tables that are hierarchically arranged, the set of recall tables comprising;
  
  a main recall table that is a hierarchically lower table of the set of recall tables, wherein each entry of the main recall table identifies whether a corresponding range of data of the file is either clean or dirty; and
  
  a recall index table that is a hierarchically higher table of the set of recall tables, wherein each entry of the recall index table identifies whether a corresponding block of multiple entries in the main recall table either comprises only entries that identify corresponding ranges of data of the file as clean, or includes at least one entry that identifies a corresponding range of data of the file as dirty;
  
  source, in response to the read request, a first subset of the first set of data from the dirtied file data stored with the file if the set of recall tables indicate that the first subset is dirty; and
  
  source, in response to the read request, a second subset of the first set of data from one or more of the chunks of data stored externally to the file if the set of recall tables indicate that the second subset is clean.
- View Dependent Claims (20)
- - 20. The computing device of claim 19, wherein a root recall index table, being a hierarchically highest table of the set of recall tables, is stored in a reparse point of the file, and wherein further either the recall index table is the root recall index table, or a single entry of the root recall index table corresponds to a block of multiple entries of the recall index table in a hierarchically same manner as the single entry of the recall index table corresponds to the block of multiple entries of the main recall table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Xie, Ping, Cheung, Chun Ho, Hasan, Kashif, Gupta, Abhishek, Kalach, Ran, Hefenbrock, Daniel
Primary Examiner(s)
Gurmu, Muluemebet

Application Number

US13/781,585
Publication Number

US 20140244601A1
Time in Patent Office

2,147 Days
Field of Search

707692
US Class Current
CPC Class Codes

G06F 16/162 Delete operations erasing i...

G06F 16/1748 De-duplication implemented ...

Granular partial recall of deduplicated files

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Granular partial recall of deduplicated files

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links