Systems and methods for byte-level or quasi byte-level single instancing

US 8,412,677 B2
Filed: 11/27/2009
Issued: 04/02/2013
Est. Priority Date: 11/26/2008
Status: Active Grant

First Claim

Patent Images

1. A method of deduplicating data, wherein the method is performed by a computing system having a processor and memory, the method comprising:

storing multiple blocks of data in a data structure, wherein;

the multiple blocks of data have an first size,the multiple blocks of data represent a set of data having a second size that is greater than the first size,a block of data is associated with multiple first identifiers, andthe multiple blocks of data are sorted in the data structure by the associated multiple first identifiers;

receiving a file, the file including multiple bytes;

accessing the data structure;

based upon accessing the data structure, determining, by the computing system, whether one or more of the multiple bytes are already stored, wherein the number of the one or more bytes is less than a number of bytes in a block of data; and

storing bytes that are not already stored using a storage device.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described in detail herein are systems and methods for deduplicating data using byte-level or quasi byte-level techniques. In some embodiments, a file is divided into multiple blocks. A block includes multiple bytes. Multiple rolling hashes of the file are generated. For each byte in the file, a searchable data structure is accessed to determine if the data structure already includes an entry matching a hash of a minimum sequence length. If so, this indicates that the corresponding bytes are already stored. If one or more bytes in the file are already stored, then the one or more bytes in the file are replaced with a reference to the already stored bytes. The systems and methods described herein may be used for file systems, databases, storing backup data, or any other use case where it may be useful to reduce the amount of data being stored.

Citations

13 Claims

1. A method of deduplicating data, wherein the method is performed by a computing system having a processor and memory, the method comprising:
- storing multiple blocks of data in a data structure, wherein;
  
  the multiple blocks of data have an first size,the multiple blocks of data represent a set of data having a second size that is greater than the first size,a block of data is associated with multiple first identifiers, andthe multiple blocks of data are sorted in the data structure by the associated multiple first identifiers;
  
  receiving a file, the file including multiple bytes;
  
  accessing the data structure;
  
  based upon accessing the data structure, determining, by the computing system, whether one or more of the multiple bytes are already stored, wherein the number of the one or more bytes is less than a number of bytes in a block of data; and
  
  storing bytes that are not already stored using a storage device.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising generating multiple second identifiers for the file, wherein the multiple second identifiers for the file include powers of 2 rolling hashes.
  - 3. The method of claim 1, further comprising generating multiple second identifiers for the file, wherein the multiple second identifiers include powers of 2 rolling hashes, and wherein determining whether the one or more of the multiple bytes are already stored includes comparing the multiple second identifiers for the file with the multiple first identifiers associated with the multiple blocks of data.
  - 4. The method of claim 1 wherein the data structure includes a hierarchical data structure that includes multiple nodes, and wherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 5. The method of claim 1 wherein multiple blocks of data are compressed.

6. A computing system for deduplicating data, the computing system comprising:
- a processor;
  
  a memory;
  
  means for receiving a first set of data;
  
  means for dividing the first set of data into at least one block, the block including a total number of bytes;
  
  means for accessing a searchable data structure, the searchable data structure including a second set of data;
  
  means for determining whether one or more bytes of the block are included in a portion of the second set of data in the searchable data structure, wherein the number of the one or more bytes is less than the total number of bytes of the block;
  
  means for replacing the one or more bytes with a reference to the portion of the second set of data, wherein the means for replacing replaces the one or more bytes with a reference to the portion of the second set of data if the one or more bytes of the block are included in the portion of the second set of data in the searchable data structure; and
  
  means for storing the block.
- View Dependent Claims (7, 8, 9)
- - 7. The computing system of claim 6, further comprising means for generating powers of 2 rolling hashes.
  - 8. The computing system of claim 6 wherein the searchable data structure includes a hierarchical data structure that includes multiple nodes, and wherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 9. The computing system of claim 6, further comprising means for compressing data.

10. A system for deduplicating data in a data storage network, wherein the data storage network includes one or more storage devices coupled via a network, and wherein the network also couples to one or more computing systems, the system comprising:
- one or more storage devices storing a searchable data structure, the searchable data structure including a first set of data; and
  
  a computing system configured to;
  
  receive a second set of data;
  
  divide the set of data into at least one block, the block including a total number of bytes;
  
  access the searchable data structure;
  
  determine whether one or more bytes of the block are included in a portion of the first set of data in the searchable data structure, wherein the number of the one or more bytes is less than the total number of bytes of the block;
  
  replace the one or more bytes with a reference to the portion of the second set of data if the one or more bytes of the block are included in the portion of the first set of data in the searchable data structure; and
  
  store the block using the one or more storage devices.
- View Dependent Claims (11, 12, 13)
- - 11. The system of claim 10 wherein the computing system is further configured to generate powers of 2 rolling hashes for the second set of data.
  - 12. The system of claim 10 wherein the searchable data structure includes a hierarchical data structure that includes multiple nodes, and wherein a first node can reference data in any other node excepting nodes that are descendants of the first node.
  - 13. The system of claim 10 wherein the computing system is further configured to compress the second set of data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CommVault Systems Incorporated
Original Assignee
CommVault Systems Incorporated
Inventors
Klose, Michael F.
Primary Examiner(s)
Coby, Frantz

Application Number

US12/626,839
Publication Number

US 20100169287A1
Time in Patent Office

1,222 Days
Field of Search

707/634, 707/692, 707/705, 711/114, 711/129, 711/161, 711/162, 711/103, 711/165
US Class Current

707/634
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/113   Details of archiving lifecy...

G06F 16/13   File access structures, e.g...

G06F 16/1748   De-duplication implemented ...

G06F 16/1752   based on file chunks

G06F 16/18   File system types

G06F 16/184   implemented as replicated f...

G06F 16/185   Hierarchical storage manage...

Systems and methods for byte-level or quasi byte-level single instancing

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for byte-level or quasi byte-level single instancing

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links