System and method for storing and accessing data using a plurality of probabilistic data structures
First Claim
1. A computer-implemented method for storing and accessing data using a plurality of probabilistic data structures, the method comprising:
- identifying, by a processor, a dataset comprising a plurality of values stored on a storage device;
assembling, by a processor, a plurality of probabilistic data structures, each probabilistic data structure being associated with at least one other probabilistic data structure in the plurality of probabilistic data structures, wherein the plurality of probabilistic data structures are associated to form a tree structure comprising a plurality of ordered levels, each level being associated with at least one of the plurality of probabilistic data structures, wherein each probabilistic data structure corresponds to a portion of the storage device storing the dataset;
generating, by the processor, a plurality of keys corresponding to the plurality of values;
inserting, by the processor, each key of the plurality of keys into each probabilistic data structure that corresponds to the portion of the storage device storing the value the key was generated from;
storing, by the processor, the plurality of probabilistic data structures in memory; and
querying, by the processor, a first probabilistic data structure of the plurality of probabilistic data structures with an item to determine whether the item exists in the first probabilistic data structure, and if the item exists in the first probabilistic data structure, then querying, by the processor, a second probabilistic data structure of the plurality of probabilistic data structures with the item to determine whether the item exists in the second probabilistic data structure.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method are described for storing and accessing data using a plurality of probabilistic data structures. In one embodiment, a plurality of probabilistic data structures are identified, each probabilistic data structure being associated with at least one other probabilistic data structure. The plurality of probabilistic data structures each correspond to a portion of a storage device storing a dataset. The dataset may include a plurality of values, such as a plurality of data blocks. A plurality of keys may be generated from the plurality of values, such as a plurality of fingerprints. Each key may be inserted into the probabilistic data structures which correspond to the portion of the storage device storing the value the key was generated from. The plurality of probabilistic data structures are stored in a memory and may be queried with an item to determine if the item exists in the plurality of probabilistic data structures.
129 Citations
29 Claims
-
1. A computer-implemented method for storing and accessing data using a plurality of probabilistic data structures, the method comprising:
-
identifying, by a processor, a dataset comprising a plurality of values stored on a storage device; assembling, by a processor, a plurality of probabilistic data structures, each probabilistic data structure being associated with at least one other probabilistic data structure in the plurality of probabilistic data structures, wherein the plurality of probabilistic data structures are associated to form a tree structure comprising a plurality of ordered levels, each level being associated with at least one of the plurality of probabilistic data structures, wherein each probabilistic data structure corresponds to a portion of the storage device storing the dataset; generating, by the processor, a plurality of keys corresponding to the plurality of values; inserting, by the processor, each key of the plurality of keys into each probabilistic data structure that corresponds to the portion of the storage device storing the value the key was generated from; storing, by the processor, the plurality of probabilistic data structures in memory; and querying, by the processor, a first probabilistic data structure of the plurality of probabilistic data structures with an item to determine whether the item exists in the first probabilistic data structure, and if the item exists in the first probabilistic data structure, then querying, by the processor, a second probabilistic data structure of the plurality of probabilistic data structures with the item to determine whether the item exists in the second probabilistic data structure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 26, 27)
-
-
13. A computer-implemented method for performing a deduplication operation using a plurality of probabilistic data structures to represent a fingerprint database, the method comprising:
-
identifying, by a processor, a plurality of probabilistic data structures representing a fingerprint database comprising of a plurality of fingerprints and a plurality of offsets, the plurality of offsets identifying a location of a plurality of data blocks from which the plurality of fingerprints were generated, each probabilistic data structure being associated with at least one other probabilistic data structure in the plurality of probabilistic data structures, wherein the plurality of probabilistic data structures are associated to form a tree structure comprising a plurality of ordered levels, each level being associated with at least one of the plurality of probabilistic data structures, wherein each probabilistic data structure corresponds to a portion of a storage device storing the plurality of data blocks; generating, by the processor, a fingerprint of a received data block to be deduplicated; querying, by the processor, a first probabilistic data structure of the plurality of probabilistic data structures with the generated fingerprint to determine if the generated fingerprint exists in the first probabilistic data structure, and if the item exists in the first probabilistic data structure, then querying, by the processor, a second probabilistic data structure of the plurality of probabilistic data structures with the generated fingerprint to determine whether the generated fingerprint exists in the second probabilistic data structure; receiving, by the processor, an offset identifying a location of a stored data block from the plurality of probabilistic data structures, if the generated fingerprint exists in the plurality of probabilistic data structures; retrieving, by the processor, the stored data block from the location identified by the offset; comparing, by the processor, the stored data block with the received data block to determine if the received data block is a duplicate of the stored data block; and deduplicating, by the processor, the received data block if the received data block is a duplicate of the stored data block. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A system for storing and accessing data using a plurality of probabilistic data structures, the system comprising:
-
a storage device operative to store a dataset comprising of a plurality of values; a memory operative to store a plurality of probabilistic data structures, each probabilistic data structure being associated with at least one other probabilistic data structure in the plurality of probabilistic data structures, wherein the plurality of probabilistic data structures are associated to form a tree structure comprising a plurality of ordered levels, each level being associated with at least one of the plurality of probabilistic data structures, wherein each probabilistic data structure corresponds to a portion of the storage device storing the dataset; and a processor operatively connected to the memory and operative to; generate a plurality of keys corresponding to the plurality of values; insert each key of the plurality of keys into each probabilistic data structure that corresponds to the portion of the storage device storing the value the key was generated from; store the plurality of probabilistic data structures in the memory; and query a first probabilistic data structure of the plurality of probabilistic data structures with an item to determine whether the item exists in the first probabilistic data structure, and if the item exists in the first probabilistic data structure, then query the second probabilistic data structure of the plurality of probabilistic data structures with the item to determine whether the item exists in the second probabilistic data structure. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
-
-
28. A computer-implemented method for storing and accessing data using a plurality of probabilistic data structures, the method comprising:
-
identifying, by a processor, a dataset comprising a plurality of values stored on a storage device; assembling, by a processor, a plurality of probabilistic data structures, each probabilistic data structure being associated with at least one other probabilistic data structure in the plurality of probabilistic data structures, wherein the plurality of probabilistic data structures are associated to form a tree structure comprising a plurality of ordered levels, each level being associated with at least one of the plurality of probabilistic data structures, wherein each probabilistic data structure corresponds to a portion of the storage device storing the dataset, wherein ordering of the probabilistic data structures corresponds to ordering of the values; generating, by the processor, a plurality of keys corresponding to the plurality of values; inserting, by the processor, each key of the plurality of keys into each probabilistic data structure that corresponds to the portion of the storage device storing the value the key was generated from; storing, by the processor, the plurality of probabilistic data structures in memory; and querying, by the processor, the plurality of probabilistic data structures with an item to determine whether the item exists in the first probabilistic data structure. - View Dependent Claims (29)
-
Specification