Scalable segment-based data de-duplication system and method for incremental backups
First Claim
1. A scalable segment-based data de-duplication system for incremental backups, comprising:
- a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location;
wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a range and content-based locality-preserved fingerprint container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication.
1 Assignment
0 Petitions
Accused Products
Abstract
A system in accordance with exemplary embodiments may provide a scalable segment-based data de-duplication for incremental backups. In the system, a master device on a secondary-storage node side may receive at least incremental changes, fingerprints, mapping entities, and distribute de-duplication functionality to at least a slave device, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for the incremental changes, varied sampling rates of a plurality of segments by having a fixed sampling rate for stable segments and by assigning a lower sampling rate for a plurality of unstable target files of de-duplication, and a per-segment summary structure to avoid unnecessary I/Os involved in de-duplication.
-
Citations
20 Claims
-
1. A scalable segment-based data de-duplication system for incremental backups, comprising:
-
a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location; wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a range and content-based locality-preserved fingerprint container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A scalable segment-based data de-duplication system for incremental backups, comprising:
-
a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location; wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication;
wherein said varied sampling rates are based on a least recently used order.
-
-
10. A scalable segment-based data de-duplication system for incremental backups, comprising:
-
a master device on a secondary-storage node side that receives at least a plurality of incremental changes, a plurality of fingerprints of a plurality of segments to be de-duplicated, mapping entities from logical block address to physical location; wherein said master device further includes at least a distributer to distribute at least a de-duplication functionality to at least a slave device on a data node side, and performs data de-duplication on said plurality of segments via a way to cluster a plurality of fingerprints in a data locality unit called container for said plurality of incremental changes, varied sampling rates for said plurality of segments, and a per-segment summary structure to avoid unnecessary inputs or outputs involved in de-duplication, and wherein said at least a distributer further includes; a fingerprint distributer that hashes each of said plurality of fingerprints for determining a destination data node, and waits for a response from said destination data node; and a container distributer that stripes a conceptual container as a plurality of sub-containers using a consistent hashing so that container stripes, a fingerprint cache and a container cache reside on a same data node. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
-
receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location; clustering said plurality of fingerprints in a range and content-based locality-preserved fingerprint container for the incremental changes; assigning varied sampling rates for said plurality of segments; and constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication. - View Dependent Claims (16, 17)
-
-
18. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
-
receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location; clustering said plurality of fingerprints in a data locality unit called container for the incremental changes; assigning varied sampling rates for said plurality of segments; and constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication; wherein said method further includes; partitioning said plurality of incremental changes of a plurality of blocks as said plurality of input segments; querying an in-memory sampled fingerprint index for each fingerprint in each of said plurality of input segments, and returning with a pair of container ID and segment ID; when there is a stored segment the same as the input segment, then querying an in-memory segment-based summary with said pair of container ID and segment ID to determine if a corresponding container is fetched in, otherwise, query an in-memory container store cache to determine if said corresponding container is cached; and loading said corresponding container from an on-disk container store when it is not cached, and querying a per-container segment index for an offset of a particular fingerprint routed to said corresponding container for retrieving the segment information of said particular fingerprint.
-
-
19. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
-
receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location; clustering said plurality of fingerprints in a data locality unit called container for the incremental changes; assigning varied sampling rates for said plurality of segments; and constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication; wherein a basic sharing unit (BSU) segment is defined as a store segment that does not get modified after its creation, and said varied sampling rates of said plurality of input segments are accomplished by sampling only one fingerprint of each of BSU segments of said plurality of input segments, and assigning different sampling rates for other non-BSU segments.
-
-
20. A scalable segment-based data de-duplication method for incremental backups, executed by a master device on a secondary-storage node side, and comprising:
-
receiving at least a plurality of incremental changes, a plurality of fingerprints of a plurality of input segments to be de-duplicated, mapping entities from logical block address to physical location; clustering said plurality of fingerprints in a data locality unit called container for the incremental changes; assigning varied sampling rates for said plurality of segments; and constructing a per-segment summary structure to avoid unnecessary inputs or outputs involved in the data de-duplication; wherein said method further includes; performing parallel de-duplication by at least a distributer of said master device to leverage computing power of at least a participating node on a data node side; wherein said parallel de-duplication further includes; computing a plurality of fingerprints of a plurality of incremental changed blocks using a cryptographic hashing; distributing said plurality of fingerprints of said plurality of incremental changed blocks based on a consistent hashing to store in a sampled fingerprint index cache and a plurality of sub-containers; distributing a per-container segment index based on said consistent hashing to at least a participating node; and distributing said per-segment summary structure based on said consistent hashing to said at least a participating node; and wherein said per-segment summary structure is distributed and partitioned to said at least a participating node on said data node side, and said plurality of sub-containers are stored in said at least a participating node.
-
Specification