HIGH PERFORMANCE DATA DEDUPLICATION IN A VIRTUAL TAPE SYSTEM

US 20090049260A1
Filed: 08/12/2008
Published: 02/19/2009
Est. Priority Date: 08/13/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method for data deduplication comprising:

receiving a plurality of backup datasets, each backup dataset comprising of a plurality of data blocks;

storing metadata in a plurality of metadata disk segments (meta-segment(s));

storing the received data blocks in a plurality of data disk segments (data-segment(s));

identifying one or more data-segment(s) comprising of duplicate data, wherein the duplicate data in a data-segment is identical to data from one or more previous data-segment(s), and for each identified data-segment modifying metadata corresponding to duplicate data to correspond to the identical data, and releasing the identified data-segment; and

updating metadata for each data-segment checked for data deduplication.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data deduplication in a storage system, achieving high performance due to minimal overhead during a backup operation, reduced disk read operations to locate duplicate data and minimal impact for restore operations involving deduplicated data.

156 Citations

23 Claims

1. A method for data deduplication comprising:
- receiving a plurality of backup datasets, each backup dataset comprising of a plurality of data blocks;
  
  storing metadata in a plurality of metadata disk segments (meta-segment(s));
  
  storing the received data blocks in a plurality of data disk segments (data-segment(s));
  
  identifying one or more data-segment(s) comprising of duplicate data, wherein the duplicate data in a data-segment is identical to data from one or more previous data-segment(s), and for each identified data-segment modifying metadata corresponding to duplicate data to correspond to the identical data, and releasing the identified data-segment; and
  
  updating metadata for each data-segment checked for data deduplication.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1 wherein the step of storing metadata in a plurality of meta-segment(s) comprises of:
    - storing metadata for the received data blocks;
      
      parsing the received data blocks for directory and file information; and
      
      storing metadata for each parsed directory and file.
  - 3. The method of claim 2 wherein storing metadata for the received data blocks comprises of storing metadata for each received data block and wherein storing metadata for each data block further comprises of storing metadata for a plurality of span of data such that the plurality of span of data together comprises the span of data for the data block.
  - 4. The method of claim 3 wherein storing metadata for a span of data comprises of:
    - storing location information for the span of data;
      
      storing the size of the span of data;
      
      storing the compression state of the span of data;
      
      storing the size of the data block; and
      
      storing information of the data-segment wherein the span of data is stored.
  - 5. The method of claim 4 wherein the metadata for a span of data is a BlkEntry.
  - 6. The method of claim 2 wherein storing a parsed directory information comprises of storing the name of the directory and the location of the metadata for the directory in the metadata of its corresponding parent directory.
  - 7. The method of claim 6 wherein the metadata for a directory is a DEntryHeader and the directory information stored in the metadata of a parent directory is in a DEntry.
  - 8. The method of claim 2 wherein storing a parsed file information comprises of storing the name of the file and location of the metadata for the file in the metadata of its corresponding parent directory.
  - 9. The method of claim 8 wherein the metadata for a file is a FEntry and the file information stored in the metadata of a parent directory is in a DEntry.
  - 10. The method of claim 6 and claim 8 wherein a parent directory corresponding to a directory or a file is the parent directory determined from the parsed directory information of a dataset and if the directory or file has no parent directory the parent directory is the dataset directory of the backup dataset.
  - 11. The method of claim 10 wherein a dataset directory for a backup dataset is a directory created for each backup dataset received and wherein the name of the dataset directory corresponds to the time when the backup dataset was received.
  - 12. The method of claim 2 wherein parsing the data blocks for file information further comprises of parsing for file segment information corresponding to the file, and for each parsed file segment:
    - computing fingerprint information for the data corresponding to the file segment;
      
      storing the computed fingerprint information in the metadata corresponding to the file segment;
      
      storing information of the location of the file segment data in the metadata corresponding to the file segment; and
      
      storing information of the location of the metadata corresponding to the file segment in the metadata corresponding to the file.
  - 13. The method of claim 12 wherein the metadata for a file segment is a DDLookup.
  - 14. The method of claim 1 wherein the step of identifying a data-segment with duplicate data further comprises of:
    - traversing file and directory information stored for each backup dataset and for each file traversed, locating a previous file with an identical traversal path and if found, comparing fingerprint information for each file segment of the file with the fingerprint information of the corresponding file segment in the previous file and, for each file segment of the file with identical fingerprint information identifying data-segment(s) for the data corresponding to the file segment.
  - 15. The method of claim 1 wherein the step of modifying metadata in an identified data-segment further comprises of:
    - locating the metadata corresponding to the duplicate data in the identified data-segment and modifying the metadata to correspond to the identical data in the previous data-segment(s);
      
      locating metadata corresponding to non duplicate data in the identified data-segment and, copying the non duplicate data to another data-segment and modifying the metadata to correspond to the location where the data was copied.
  - 16. The method of claim 15 wherein the step of modifying metadata to correspond to the identical data in the previous data-segments(s) further comprises of:
    - modifying metadata to correspond to the location of the identical data;
      
      modifying metadata to correspond to the data-segment of the identical data;
      
      modifying metadata indicating an offset within a plurality of data blocks corresponding to the start of the span of the identical data; and
      
      modifying metadata to indicate the compression state of the plurality of data blocks.
  - 17. The method of claim 14 where in the step of traversing file and directory information comprises of:
    - reading the stored root directories information and, for each stored root directory information reading its stored dataset directories information and, for each stored dataset directory information traversing its subdirectories and, for each file in a directory, reading its corresponding file information and, traversing the file segment information for each file segment corresponding to the file.
  - 18. The method of claim 14 where in the step of locating a previous file information with an identical traversal path for a file comprises of:
    - locating a previous dataset directory with a dataset time lesser than the dataset time of the dataset directory corresponding to the file;
      
      starting from the subdirectories of the previous dataset directory and the dataset directory for the file, locating a previous directory information in the previous dataset directory with the names of the directories traversed identical to the name of directories traversed for the file; and
      
      locating a file in the previous directory information wherein the names of the two files identical.
  - 19. The method of claim 18 wherein the dataset time is determined by the name of the dataset directory.
  - 20. The method of claim 1 where in the step of releasing the identified data-segment further comprises of decrementing a reference to the corresponding disk segment.
  - 21. The method of claim 1 wherein the step of updating metadata for each data-segment checked for data deduplication comprises of:
    - locating metadata corresponding to file segment(s) which end in the data-segment and updating the metadata corresponding to each such file segment indicating that data deduplication has been performed for the file segment;
      
      locating file information for which the metadata corresponding to all the file segments of the file indicate that data deduplication check has been performed and updating the metadata for the file indicating that data deduplication has been performed for the file; and
      
      locating directory information for which the metadata corresponding to all subdirectories and files indicate that data deduplication has been performed and updating the metadata for the directory indicating that data deduplication has been performed for the directory.

22. A system configured for data deduplication, the system comprising:
- means for receiving a plurality of backup datasets, each backup dataset comprising of a plurality of data blocks;
  
  means for storing metadata in a plurality of metadata disk segments (meta-segment(s));
  
  means for storing the received data blocks in a plurality of data disk segments (data-segment(s));
  
  means for identifying one or more data-segment(s) comprising of duplicate data, wherein the duplicate data in a data-segment is identical to data from one or more previous data-segment(s), and for each identified data-segment means for modifying metadata corresponding to duplicate data to correspond to the identical data and releasing the identified data-segment; and
  
  means updating metadata for each data-segment checked for data deduplication.

23. A computer readable medium for data deduplication, the computer readable medium including program instructions for performing the steps of:
- receiving a plurality of backup datasets, each backup dataset comprising of a plurality of data blocks;
  
  storing metadata in a plurality of metadata disk segments (meta-segment(s));
  
  storing the received data blocks in a plurality of data disk segments (data-segment(s));
  
  identifying one or more data-segment(s) comprising of duplicate data, wherein the duplicate data in a data-segment is identical to data from one or more previous data-segment(s), and for each identified data-segment modifying metadata corresponding to duplicate data to correspond to the identical data, and releasing the identified data-segment; and
  
  updating metadata for each data-segment checked for data deduplication.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Shivarama Narasimha Murthy Upadhyayula
Original Assignee
Shivarama Narasimha Murthy Upadhyayula
Inventors
Upadhyayula, Shivarama Narasimha Murthy

Application Number

US12/190,019
Publication Number

US 20090049260A1
Time in Patent Office

Days
Field of Search
US Class Current

711/162
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 11/1456   Hardware arrangements for b...

G06F 2201/83   the solution involving sign...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/0686   Libraries, e.g. tape librar...

HIGH PERFORMANCE DATA DEDUPLICATION IN A VIRTUAL TAPE SYSTEM

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

156 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

HIGH PERFORMANCE DATA DEDUPLICATION IN A VIRTUAL TAPE SYSTEM

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

156 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links