Systems and methods for byte-level or quasi byte-level single instancing

US 9,158,787 B2
Filed: 05/13/2014
Issued: 10/13/2015
Est. Priority Date: 11/26/2008
Status: Active Grant

First Claim

Patent Images

1. A method of deduplicating data performed by one or more computing systems which are coupled to one or more storage devices via a network, wherein the one or more computing system each comprise at least one processor and memory, and the one or more storage devices include a searchable data structure and a first set of data, the method comprising:

receiving a second set of data;

dividing the second set of data into at least one block,wherein the one block includes a total number of bytes;

accessing the searchable data structure;

determining whether one or more bytes of the one block are included in a portion of the first set of data in the searchable data structure,wherein the number of the one or more bytes is less than the total number of bytes of the one block;

replacing the one or more bytes with a reference to the portion of the first set of data when the one or more bytes of the one block are included in the portion of the first set of data in the searchable data structure; and

causing the one block to be stored using the one or more storage devices.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described in detail herein are systems and methods for deduplicating data using byte-level or quasi byte-level techniques. In some embodiments, a file is divided into multiple blocks. A block includes multiple bytes. Multiple rolling hashes of the file are generated. For each byte in the file, a searchable data structure is accessed to determine if the data structure already includes an entry matching a hash of a minimum sequence length. If so, this indicates that the corresponding bytes are already stored. If one or more bytes in the file are already stored, then the one or more bytes in the file are replaced with a reference to the already stored bytes. The systems and methods described herein may be used for file systems, databases, storing backup data, or any other use case where it may be useful to reduce the amount of data being stored.

Citations

20 Claims

1. A method of deduplicating data performed by one or more computing systems which are coupled to one or more storage devices via a network, wherein the one or more computing system each comprise at least one processor and memory, and the one or more storage devices include a searchable data structure and a first set of data, the method comprising:
- receiving a second set of data;
  
  dividing the second set of data into at least one block,wherein the one block includes a total number of bytes;
  
  accessing the searchable data structure;
  
  determining whether one or more bytes of the one block are included in a portion of the first set of data in the searchable data structure,wherein the number of the one or more bytes is less than the total number of bytes of the one block;
  
  replacing the one or more bytes with a reference to the portion of the first set of data when the one or more bytes of the one block are included in the portion of the first set of data in the searchable data structure; and
  
  causing the one block to be stored using the one or more storage devices.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising generating powers of 2 rolling hashes for the second set of data.
  - 3. The method of claim 1,wherein the searchable data structure includes a hierarchical data structure that includes multiple nodes, andwherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 4. The method of claim 1, further comprising compressing the second set of data.

5. A system for deduplicating data, the system comprising:
- at least one processor;
  
  memory communicatively coupled to the at least one processor;
  
  means for receiving a file including multiple bytes;
  
  means for accessing at least some of multiple blocks of data in a data structure;
  
  wherein the multiple blocks of data have a first size,wherein the multiple blocks of data represent a set of data having a second size that is greater than the first size,wherein a block of data is associated with multiple first identifiers, andwherein the multiple blocks of data are identified in the data structure by the associated multiple first identifiers;
  
  means for determining whether one or more of the multiple bytes are already stored based at least partly upon accessing of the data structure,wherein the number of the one or more bytes is less than a number of bytes in a block of data; and
  
  means for causing bytes that are not already stored using a storage device to be stored.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The system of claim 5, further comprisingmeans for generating multiple second identifiers for the file,wherein the multiple second identifiers for the file include powers of 2 rolling hashes.
  - 7. The system of claim 5, further comprisingmeans for generating multiple second identifiers for the file,wherein the multiple second identifiers include powers of 2 rolling hashes, andwherein means for determining whether the one or more of the multiple bytes are already stored includes means for comparing the multiple second identifiers for the file with the multiple first identifiers associated with the multiple blocks of data.
  - 8. The system of claim 5,wherein the data structure includes a hierarchical data structure that includes multiple nodes, andwherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 9. The system of claim 5 wherein at least some of the multiple blocks of data are compressed.

10. A method of deduplicating data performed by a computing system having a processor and memory, the method comprising:
- dividing a first set of data into at least one block, wherein the one block includes a total number of bytes;
  
  accessing a searchable data structure, wherein the searchable data structure includes a second set of data;
  
  determining whether one or more bytes of the one block are included in a portion of the second set of data in the searchable data structure,wherein the number of the one or more bytes is less than the total number of bytes of the one block;
  
  replacing the one or more bytes with a reference to the portion of the second set of data, when the one or more bytes of the one block are included in the portion of the second set of data in the searchable data structure; and
  
  causing the block to be stored.
- View Dependent Claims (11, 12, 13)
- - 11. The method of claim 10, further comprising generating powers of 2 rolling hashes.
  - 12. The method of claim 10,wherein the searchable data structure includes a hierarchical data structure that includes multiple nodes, andwherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 13. The method of claim 10, further comprising compressing data.

14. A method of building a search data structure for deduplicating data, wherein the method is performed by a computing system having a processor and memory, the method comprising:
- receiving a set of data;
  
  dividing the set of data into at least a first block,wherein the first block includes a number of bytes;
  
  accessing a search data structure containing zero or more nodes,wherein each node represents a block including the number of bytes or represents a portion of a block, andwherein each node contains a reference to a stored block, to a portion of a stored block, or to another node;
  
  first determining whether the first block is identical to a second block represented by a first node in the search data structure;
  
  when a result of the first determining is positive,creating a node in the search data structure representing the first block and containing a reference to the first node; and
  
  when a result of the first determining is negative;
  
  second determining whether a portion of the first block is identical to a portion of a third block represented by a second node in the search data structure;
  
  when a result of the second determining is positive;
  
  storing a portion of the first block that is not identical to the portion of the third block;
  
  creating a third node in the search data structure representing the portion of the third block and containing a reference to the stored portion of the third block; and
  
  creating a node in the search data structure representing the first block and containing a reference to the third node and a reference to the stored portion of the first block; and
  
  when a result of the second determining is negative, storing the first block;
  
  creating a node in the search data structure representing the first block and containing a reference to the stored first block.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The method of claim 14, wherein the search data structure is implemented as a skipped list or a red-black tree.
  - 16. The method of claim 14, wherein a reference to a portion of a stored block comprises a reference to the stored block, an offset within the stored block, and a length.
  - 17. The method of claim 14, wherein each node in the search data structure contains metadata, including a number of references to the node or a hash code.
  - 18. The method claim 17,wherein when a node contains a reference to a stored block or a portion of a stored block, the hash code is computed from the stored block or the portion of the stored block, andwherein when a node contains a reference to another node, the hash code is computed from a stored block or a portion of a stored block to which the reference is resolved.
  - 19. The method of claim 18,wherein the first determining is performed based on the hash code contained in the first block and the hash code contained in the second block, andwherein the second determining is performed using the hash code contained in the first block and the hash code contained in the third block.
  - 20. The method of claim 17, wherein the number of references to a third node is updated when a node that contains a reference to the third node is added to or deleted from the search data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CommVault Systems Incorporated
Original Assignee
CommVault Systems Incorporated
Inventors
Klose, Michael F.
Primary Examiner(s)
Coby, Frantz

Application Number

US14/276,622
Publication Number

US 20140250088A1
Time in Patent Office

518 Days
Field of Search

707615-616, 707634-635, 707/637, 707/692, 707/705, 711/103, 711/114, 711/129, 711/161, 711/165
US Class Current

1/1
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/113   Details of archiving lifecy...

G06F 16/13   File access structures, e.g...

G06F 16/1748   De-duplication implemented ...

G06F 16/1752   based on file chunks

G06F 16/18   File system types

G06F 16/184   implemented as replicated f...

G06F 16/185   Hierarchical storage manage...

Systems and methods for byte-level or quasi byte-level single instancing

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for byte-level or quasi byte-level single instancing

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links