Probabilistic summary data structure based encoding for garbage collection in backup systems
First Claim
Patent Images
1. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising:
- a set of one or more active storage trees each that are maintained within the set of storage devices, the set of active storage trees representing a version of multiple data, each leaf node of the set of active storage trees representing a block of data from the multiple data that has been backed up onto the set of storage devices; and
a garbage collection logic, executing on a processor, responsive to deletion of a version of the multiple data, to generate a probabilistic summary data structure, wherein the probabilistic summary data structure represents blocks of data that are referenced by the set of active storage trees, wherein the garbage collection logic includes an encoder to generate the probabilistic summary data structure, and wherein the encoder includes a hash module to perform a first plurality of hash functions on the referenced blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the plurality of hash functions.
12 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for different embodiments of probabilistic summary data structure based encoding for garbage collection are described. In one embodiment, a method comprises generating a probabilistic summary data structure that represents active blocks of data within a storage device based on identifications of the active blocks or the data within the active blocks. The method also includes performing garbage collection of at least a portion of the storage device based on the probabilistic summary data structure.
242 Citations
38 Claims
-
1. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising:
-
a set of one or more active storage trees each that are maintained within the set of storage devices, the set of active storage trees representing a version of multiple data, each leaf node of the set of active storage trees representing a block of data from the multiple data that has been backed up onto the set of storage devices; and a garbage collection logic, executing on a processor, responsive to deletion of a version of the multiple data, to generate a probabilistic summary data structure, wherein the probabilistic summary data structure represents blocks of data that are referenced by the set of active storage trees, wherein the garbage collection logic includes an encoder to generate the probabilistic summary data structure, and wherein the encoder includes a hash module to perform a first plurality of hash functions on the referenced blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the plurality of hash functions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising:
-
a set of one or more active storage trees each representing a version of multiple data, each leaf node of the set of active storage trees representing a block of data from the multiple data that has been backed up onto the set of storage devices; and a garbage collection logic, executing on a processor, responsive to deletion of a version of the multiple data, to walk at least a portion of the set of active storage trees and record in a probabilistic summary data structure based on a Bloom filter which of the blocks of data identified by the part of the set of active storage trees that was walked are still referenced within the multiple data, wherein the garbage collection logic includes an encoder for generating the probabilistic summary data structure based on a Bloom filter, and wherein the encoder includes a Bloom filter to perform a first set of one or more hash functions on the still referenced blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the first set of hash functions. - View Dependent Claims (15, 16, 17, 18)
-
-
19. An apparatus for performing garbage collection of a range of addresses within an allocated address space of a set of one or more storage devices comprising:
a backup system on a computer to back up a file system, said backup system comprising; a backup logic to generate and maintain a plurality of active storage trees, each active storage tree comprising a plurality of leaf nodes each representing blocks of data stored in the set of storage devices; a determination module to determine which range of addresses is to be cleaned within the allocated address space; a first encoder to encode blocks of data within the range of addresses into a first probabilistic summary data structure, wherein the first encoder includes a first Bloom filter to perform a first set of hash functions on each block of data within the range of addresses to be cleaned to generate a set of offset values and to set bits within the first probabilistic summary data structure that are at offsets equal to the set of offset values generated by the first set of hash functions; a locator to identify blocks of data that are referenced by the plurality of active storage trees by walking the plurality of active storage trees; an evaluator to determine blocks of data that are both referenced by the set of active storage trees and within the range of addresses to be cleaned based on the first probabilistic summary data structure; a second encoder to encode the blocks of data that are both referenced and within the range of addresses to be cleaned into a second probabilistic summary data structure, wherein the second encoder includes a second Bloom filter to perform a second set of one or more hash functions on each of the referenced blocks of data that are both referenced and within the range of addresses to be cleaned to generate a set of offset values and to set bits in the second probabilistic summary data structure that are at offsets equal to the set of offset values generated by the second set of hash functions; and a collector to clean the range of addresses within the allocated address space based on the second probabilistic summary data structure to mark as unallocated blocks of data that are no longer referenced. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
30. An apparatus comprising:
a backup system on a computer to back up a file system, said backup system including, a backup logic to generate a set of trees each representing backup snapshots of said file system at different times by recording references to blocks of backed up data stored in a set of one or more storage devices; and a garbage collection logic coupled to access said set of trees to at least approximate garbage collection of unreferenced ones of said blocks of data by tracking, with a Bloom filter, unreferenced ones of said blocks, wherein the garbage collection logic is to generate a summary data structure based on a tracking of the unreferenced ones of said blocks, wherein a size of the summary data structure is less than a size of a local memory of the backup system, wherein the garbage collection logic includes a hash module to perform a first plurality of hash functions on different ones of the blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the plurality of hash functions. - View Dependent Claims (31)
-
32. An apparatus for performing garbage collection in a range of addresses to be cleaned in an allocated address space of one or more storage devices, wherein the allocated address space has stored therein blocks of data that are referenced by a set of one or more active storage trees and blocks of data that are no longer referenced by the set of one or more active storage trees comprising:
-
a backup logic to generate the set of one or more active storage trees, each storage tree comprising one or more leaf nodes representing blocks of data stored within the one or more storage devices on a computer; and a first encoder to determine the range of addresses to be cleaned and to encode blocks of data within the range of addresses to be cleaned into a first probabilistic summary data structure, wherein the first encoder includes, a hash module to perform a first set of one or more hash functions on each block of data within the range of addresses to be cleaned, wherein each of the first set of one or more hash functions generates a set of one or more offset values, and a write logic to set bits within the first probabilistic summary data structure that are at offsets equal to the set of one or more offset values generated by the first set of one or more hash functions; a locator to identify blocks of data that are referenced by the set of one or more storage trees by walking the set of one or more active storage trees; an evaluator to determine blocks of data that are both referenced by the set of one or more storage trees and within the range of addresses to be cleaned based on the first probabilistic summary data structure; a second encoder to encode those blocks of data that are both referenced and within the range of addresses to be cleaned into a second probabilistic summary data structure, wherein the second encoder includes, a hash module to perform a second set of one or more hash functions on each of the blocks of data that are both referenced and within the range of addresses to be cleaned, wherein each of the second set of one or more hash functions generates a set of one or more offset values; and a write logic to set bits in the second probabilistic summary data structure that are at offsets equal to the set of one or more offset values generated by the second set of one or more hash functions; and a collector to clean the range of addresses to be cleaned within the allocated address space based on the second probabilistic summary data structure to mark as unallocated those blocks of data that are no longer referenced. - View Dependent Claims (33, 34, 35, 36, 37, 38)
-
Specification