Scalable segment-based data de-duplication system and method for incremental backups

US 8,397,080 B2
Filed: 07/29/2010
Issued: 03/12/2013
Est. Priority Date: 07/29/2010
Status: Active Grant

First Claim

Patent Images

1. A scalable segment-based data de-duplication system for incremental backups, comprising:

a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location;

wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a range and content-based locality-preserved fingerprint container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system in accordance with exemplary embodiments may provide a scalable segment-based data de-duplication for incremental backups. In the system, a master device on a secondary-storage node side may receive at least incremental changes, fingerprints, mapping entities, and distribute de-duplication functionality to at least a slave device, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for the incremental changes, varied sampling rates of a plurality of segments by having a fixed sampling rate for stable segments and by assigning a lower sampling rate for a plurality of unstable target files of de-duplication, and a per-segment summary structure to avoid unnecessary I/Os involved in de-duplication.

Citations

20 Claims

1. A scalable segment-based data de-duplication system for incremental backups, comprising:
- a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a range and content-based locality-preserved fingerprint container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system as claimed in claim 1, said system further includes said at least a slave device, and performs parallel de-duplication to leverage computing power of said at least a slave device.
  - 3. The system as claimed in claim 2, wherein each of said at least a slave device further includes:
    - a fingerprint manager for maintaining said fingerprint cache except the replacement of fingerprints; and
      
      a container manager for maintaining said container cache except the replacement of containers.
  - 4. The system as claimed in claim 3, wherein each of said at least a slave device further includes a container updater for writing of a plurality of sub-containers on a local data node.
  - 5. The system as claimed in claim 2, wherein the parallel de-duplication is segment-based de-duplication, and is based on consistent hashing of each of input fingerprints to a data node on said data node side.
  - 6. The system as claimed in claim 1, wherein a plurality of fingerprint containers are stored and retrieved from said secondary-storage node side in a distributed fashion, and a fingerprint cache and a container cache are also distributed and partitioned in different data nodes.
  - 7. The system as claimed in claim 1, wherein said container is based on a plurality of pseudo-physical block addresses to preserve data locality in per-volume logical block address and differentiate block addresses from different volumes.
  - 8. The system as claimed in claim 1, wherein said per-segment summary structure is accomplished by the followings:
    - each of said plurality of segments larger than a threshold having a whole-segment fingerprint; and
      
      matching of said whole-segment fingerprint indicating that de-duplication within the segment is done and no individual block fingerprint within the segment needs to be checked.

9. A scalable segment-based data de-duplication system for incremental backups, comprising:
- a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication;
  
  wherein said varied sampling rates are based on a least recently used order.

10. A scalable segment-based data de-duplication system for incremental backups, comprising:
- a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication, and wherein said at least a distributer further includes;
  
  a fingerprint distributer that hashes each of said plurality of fingerprints for determining a destination data node, and waits for a response from said destination data node; and
  
  a container distributer that stripes a conceptual container as a plurality of sub-containers using a consistent hashing so that container stripes, a fingerprint cache and a container cache reside on a same data node.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The system as claimed in claim 10, wherein said conceptual container is striped into N files, N is the number of data nodes on said data node side, and each of N data nodes is responsible to store one stripe of said conceptual container, and one stripe of said conceptual container is denoted as a sub-container.
  - 12. The system as claimed in claim 10, wherein said master device further includes a replacement engine to store at least a least recently used list for a fingerprint cache and a container cache to replace containers and fingerprints centrally on said master device.
  - 13. The system as claimed in claim 10, wherein said master device further includes a mapping updater to map at least a metadata from logical block address to physical location of a plurality of blocks based on output of the de-duplication.
  - 14. The system as claimed in claim 10, wherein said master device further includes a unit detector to detect said conceptual container of the input fingerprints on the fly or recognize the boundary of a de-duplication unit over time.

15. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
- receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  clustering said plurality of fingerprints in a range and content-based locality-preserved fingerprint container for the incremental changes;
  
  assigning varied sampling rates for said plurality of segments; and
  
  constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication.
- View Dependent Claims (16, 17)
- - 16. The method as claimed in claim 15, wherein said method further includes:
    - performing parallel de-duplication by at least a distributer of said master device to leverage computing power of at least a participating node on a data node side.
  - 17. The method as claimed in claim 16, wherein said parallel de-duplication further includes:
    - computing a plurality of fingerprints of a plurality of incremental changed blocks using a cryptographic hashing;
      
      distributing said plurality of fingerprints of said plurality of incremental changed blocks based on a consistent hashing to store in a sampled fingerprint index cache and a plurality of sub-containers;
      
      distributing a per-container segment index based on said consistent hashing to at least a participating node; and
      
      distributing said per-segment summary structure based on said consistent hashing to said at least a participating node.

18. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
- receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  clustering said plurality of fingerprints in a data locality unit called container for the incremental changes;
  
  assigning varied sampling rates for said plurality of segments; and
  
  constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication;
  
  wherein said method further includes;
  
  partitioning said plurality of incremental changes of a plurality of blocks as said plurality of input segments;
  
  querying an in-memory sampled fingerprint index for each fingerprint in each of said plurality of input segments, and returning with a pair of container ID and segment ID;
  
  when there is a stored segment the same as the input segment, then querying an in-memory segment-based summary with said pair of container ID and segment ID to determine if a corresponding container is fetched in, otherwise, query an in-memory container store cache to determine if said corresponding container is cached; and
  
  loading said corresponding container from an on-disk container store when it is not cached, and querying a per-container segment index for an offset of a particular fingerprint routed to said corresponding container for retrieving the segment information of said particular fingerprint.

19. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
- receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  clustering said plurality of fingerprints in a data locality unit called container for the incremental changes;
  
  assigning varied sampling rates for said plurality of segments; and
  
  constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication;
  
  wherein a basic sharing unit (BSU) segment is defined as a store segment that does not get modified after its creation, and said varied sampling rates of said plurality of input segments are accomplished by sampling only one fingerprint of each of BSU segments of said plurality of input segments, and assigning different sampling rates for other non-BSU segments.

20. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
- receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location;
  
  clustering said plurality of fingerprints in a data locality unit called container for the incremental changes;
  
  assigning varied sampling rates for said plurality of segments; and
  
  constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication;
  
  wherein said method further includes;
  
  performing parallel de-duplication by at least a distributer of said master device to leverage computing power of at least a participating node on a data node side;
  
  wherein said parallel de-duplication further includes;
  
  computing a plurality of fingerprints of a plurality of incremental changed blocks using a cryptographic hashing;
  
  distributing said plurality of fingerprints of said plurality of incremental changed blocks based on a consistent hashing to store in a sampled fingerprint index cache and a plurality of sub-containers;
  
  distributing a per-container segment index based on said consistent hashing to at least a participating node; and
  
  distributing said per-segment summary structure based on said consistent hashing to said at least a participating node; and
  
  wherein said per-segment summary structure is distributed and partitioned to said at least a participating node on said data node side, and said plurality of sub-containers are stored in said at least a participating node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Industrial Technology Research Institute
Original Assignee
Industrial Technology Research Institute
Inventors
Lu, Maohua, Chiueh, Tzi-Cker
Primary Examiner(s)
Smithers, Matthew

Application Number

US12/846,817
Publication Number

US 20120030477A1
Time in Patent Office

957 Days
Field of Search

713/189
US Class Current

713/189
CPC Class Codes

G06F 11/1453 using de-duplication of the...

G06F 11/1458 Management of the backup or...

Scalable segment-based data de-duplication system and method for incremental backups

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable segment-based data de-duplication system and method for incremental backups

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links