Block-level single instancing

US 9,058,117 B2
Filed: 10/09/2013
Issued: 06/16/2015
Est. Priority Date: 05/22/2009
Status: Active Grant

First Claim

Patent Images

1. A system for storing a single instance of a block of data in a data storage network, wherein the data storage network includes multiple storage devices coupled via a computer network, and wherein the computer network also couples to one or more computing devices having file systems on which files are stored, the system comprising:

one or more storage devices storing multiple blocks of data in one or more container files, wherein a file stored on a file system of a computing device is comprised of one or more blocks of data;

one or more single instance databases storing, for at least some of the multiple blocks of data, an identifier of the stored block of data, and a location of the stored block of data in a container file;

one or more index files storing, for at least some of the multiple blocks of data, a single flag indicating whether the stored block of data is referred to in one or more metadata files on the one or more storage devices; and

a secondary storage computing device configured to—

receive data corresponding to one or more data storage jobs from the one or more computing devices,wherein a data storage job is performed on one or more files stored on the file systems of the one or more computing devices, and wherein the received data includes multiple blocks of data; and

for at least some of the multiple blocks of data in the received data—

determine an identifier of the received block of data;

determine if one of the single instance databases already stores the identifier;

when one of the single instance databases already stores the identifier,determine the corresponding block of data in a container file,store a reference to the corresponding block of data in one of the metadata files, andupdating the flag for the corresponding block of data in one of the index files; and

when none of the single instance databases already stores the identifier,store the received block of data in a container file,

wherein a container file includes stored blocks of data from more than one file stored on the one or more computing devices,storing a reference to the received block of data in one of the metadata files, andcreating a new entry for the received block in one of the index files.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described in detail herein are systems and methods for single instancing blocks of data in a data storage system. For example, the data storage system may include multiple computing devices (e.g., client computing devices) that store primary data. The data storage system may also include a secondary storage computing device, a single instance database, and one or more storage devices that store copies of the primary data (e.g., secondary copies, tertiary copies, etc.). The secondary storage computing device receives blocks of data from the computing devices and accesses the single instance database to determine whether the blocks of data are unique (meaning that no instances of the blocks of data are stored on the storage devices). If a block of data is unique, the single instance database stores it on a storage device. If not, the secondary storage computing device can avoid storing the block of data on the storage devices.

Citations

17 Claims

1. A system for storing a single instance of a block of data in a data storage network, wherein the data storage network includes multiple storage devices coupled via a computer network, and wherein the computer network also couples to one or more computing devices having file systems on which files are stored, the system comprising:
- one or more storage devices storing multiple blocks of data in one or more container files, wherein a file stored on a file system of a computing device is comprised of one or more blocks of data;
  
  one or more single instance databases storing, for at least some of the multiple blocks of data, an identifier of the stored block of data, and a location of the stored block of data in a container file;
  
  one or more index files storing, for at least some of the multiple blocks of data, a single flag indicating whether the stored block of data is referred to in one or more metadata files on the one or more storage devices; and
  
  a secondary storage computing device configured to—
  
  receive data corresponding to one or more data storage jobs from the one or more computing devices,wherein a data storage job is performed on one or more files stored on the file systems of the one or more computing devices, and wherein the received data includes multiple blocks of data; and
  
  for at least some of the multiple blocks of data in the received data—
  
  determine an identifier of the received block of data;
  
  determine if one of the single instance databases already stores the identifier;
  
  when one of the single instance databases already stores the identifier,determine the corresponding block of data in a container file,store a reference to the corresponding block of data in one of the metadata files, andupdating the flag for the corresponding block of data in one of the index files; and
  
  when none of the single instance databases already stores the identifier,store the received block of data in a container file,
  
  wherein a container file includes stored blocks of data from more than one file stored on the one or more computing devices,storing a reference to the received block of data in one of the metadata files, andcreating a new entry for the received block in one of the index files.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein the secondary storage computing device receives the data corresponding to the one or more data storage jobs from the one or more computing devices in one or more data streams, and wherein a data stream includes:
    - multiple stream data items, wherein the multiple stream data items include multiple blocks of data indicated to be eligible for single instancing and multiple items of data not indicated to be eligible for single instancing;
      
      multiple stream header items, each stream header item containing metadata describing an associated stream data item, the metadata including an indication of whether the stream data item includes data indicated to be eligible for single instancing; and
      
      multiple identifiers of the multiple blocks of data indicated to be eligible for single instancing.
  - 3. The system of claim 1, further comprising the one or more computing devices, wherein the one or more computing devices are configured to:
    - access files on which the one or more data storage jobs are performed;
      
      determine a first set of files that are not eligible for single instancing and a second set of files that are eligible for single instancing, wherein the first set of files includes multiple items of data, and wherein the second set of files are comprised in multiple blocks of data;
      
      for at least some of the multiple blocks of data—
      
      generate the identifier for the block of data; and
      
      associate the identifier with the block of data; and
      
      provide the first and second sets of files to the secondary storage computing device.
  - 4. The system of claim 1,wherein the one or more single instance databases either store a reference count of a number of references to blocks of data or information from which the reference count can be derived,wherein the one or more storage devices store the multiple blocks of data in the one or more container files on one or more physical media, andwherein the secondary storage computing device is further configured to:
    - receive an indication to delete one or more blocks of data stored in the one or more container files; and
      
      for at least some of the blocks of data indicated to be deleted—
      
      determine the reference count of the block of data; and
      
      when the reference count of the block of data is zero, update one of the index files to indicate that the block of data is not referred to; and
      
      when a threshold number of contiguous blocks of data in a container file that are not referred to is reached, make available for storage portions of the one or more physical media corresponding to the threshold number of contiguous blocks of data.
  - 5. The system of claim 1,wherein the one or more single instance databases either store a reference count of a number of references to locations of blocks of data or information from which the reference count can be derived, andwherein the secondary storage computing device is further configured to:
    - receive an indication to delete one or more blocks of data stored in the one or more container files; and
      
      for at least some of the blocks of data indicated to be deleted—
      
      determine the reference count of the block of data; and
      
      when the reference count of the block of data is zero, update one of the index files to indicate that the block of data is not referred to; and
      
      when a threshold number of contiguous blocks of data at an extremity of a container file that are not referred to is reached, delete a portion of the container file corresponding to the threshold number of contiguous blocks of data.
  - 6. The system of claim 1,wherein the one or more single instance databases either store a reference count of a number of references to locations of blocks of data or information from which the reference count can be derived, andwherein the secondary storage computing device is further configured to:
    - receive an indication to delete one or more blocks of data stored in the one or more container files; and
      
      for at least some of the blocks of data indicated to be deleted—
      
      determine the reference count of the block of data; and
      
      when the reference count of the block of data is zero, update one of the index files to indicate that the block of data is not referred to; and
      
      when none of the blocks of data in a container file is referred to, delete the container file.
  - 7. The system of claim 1,wherein the secondary storage computing device includes one or more memory buffers,wherein each of the memory buffers has a size that is greater than a size of a block of data but is less than ten times the size of the block of data, andwherein the secondary storage computing device is further configured to store multiple blocks of data indicated to be eligible for single instancing in the one or more memory buffers.
  - 8. The system of claim 1, wherein the secondary storage computing device is further configured to, for at least some of the multiple blocks of data indicated to be eligible for single instancing, generate the identifier for the block of data.
  - 9. The system of claim 1, wherein the one or more single instance databases maintain:
    - a first data structure storing, for at least some of the blocks of data, the identifier of the block of data and the location of the block of data in a container file; and
      
      a second data structure storing, for at least some of the blocks of data, a location of a reference to the block of data.
  - 10. The system of claim 1, wherein the secondary storage computing device is further configured to:
    - determine if a container file contains any referenced blocks of data; and
      
      when the container file does not contain any referenced blocks of data, delete the container file.

11. A method of single instancing multiple blocks of data, wherein the method is performed by a first computing device having a processor and memory, the method comprising:
- for at least some of multiple blocks of data included in data received from a set of one or more computing devices distinct from the first computing device,wherein the one or more computing devices have file systems storing files, andwherein each file comprises at least one block of data—
  
  determining an identifier of each received block of data;
  
  accessing, by the first computing device, one or more data structures that store, for each block of data stored in one or more logical containers on one or more storage devices,an identifier of the stored block of data,a location of the stored block of data in a logical container, anda single indicator of whether the stored block of data is referred to in one or more metadata containers on the one or more storage devices,wherein the logical container includes stored blocks of data from more than one file received from the set of one or more computing devices;
  
  determining, based upon the identifier of the received block of data and based upon access to the one or more data structures, if the received block of data should be stored;
  
  when the received block of data should not be stored, thendetermining an already stored instance of the received block of data in a logical container,storing a reference to that instance in one of the metadata containers, andupdating the indicator for that instance in the one or more data structures; and
  
  when the received block of data should be stored, thenstoring the received block of data in one of the logical containers,storing a reference to the received block of data in one of the metadata containers, andcreating a new entry for the received block in the one or more data structures.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method of claim 11,wherein the received data includes multiple items of data not indicated to be eligible for single instancing, andwherein the method further comprises, for at least some of the multiple items of data not indicated to be eligible for single instancing, storing the items of data in the one or more metadata containers.
  - 13. The method of claim 11, wherein the data includes the identifiers for at least some of the multiple blocks of data indicated to be eligible for single instancing.
  - 14. The method of claim 11, further comprising for at least some of the multiple blocks of data indicated to be eligible for single instancing:
    - storing the block of data in a buffer;
      
      determining if the block of data should be stored;
      
      when the block of data should be stored, copying the block of data from the buffer to one of the logical containers; and
      
      when the block of data should not be stored, making the buffer available for storage of another block of data.
  - 15. The method of claim 11, further comprising:
    - receiving an indication to delete one or more blocks of data stored in the one or more logical containers;
      
      for at least some of the blocks of data in the one or more logical containers—
      
      determining if the block of data is referred to; and
      
      when the block of data is not referred to, updating one of the data structures to indicate that the block of data is not referred to; and
      
      upon reaching a threshold number of contiguous blocks of data in a logical container that are not referred to, making available storage space corresponding to the contiguous blocks of data in the logical container that are not referred to.
  - 16. The method of claim 11, further comprising:
    - receiving an indication to delete one or more blocks of data stored in the one or more logical containers;
      
      for at least some of the blocks of data in the one or more logical containers—
      
      determining if the block of data is referred to; and
      
      when the block of data is not referred to, updating one of the data structures to indicate that the block of data is not referred to; and
      
      when none of the blocks of data in a logical container is referred to, deleting the logical container.

17. A method for reducing duplication of stored data, wherein the method is performed by a computing system having a processor and memory, the method comprising:
- receiving an indication to perform a data storage operation on data from one or more files stored at one or more computing devices, wherein the data includes one or more blocks;
  
  for at least some of the blocks—
  
  determining whether a block is eligible for single instancing;
  
  if the block is not eligible for single instancing, then storing the block in a first container file,wherein the first container file stores blocks that are not eligible for single instancing, andwherein the first container file also includes at least one data structure that stores references to blocks that are eligible for single instancing;
  
  if the block is eligible for single instancing, then determining if an instance of the block has already been stored on a storage device distinct from the computing device;
  
  if an instance of the block has already been stored on the storage device, then storing in the first container file a reference to the already stored instance of the block in the at least one data structure; and
  
  if an instance of the block has not already been stored on the storage device, then—
  
  storing the block in a second container file,wherein the second container file stores only a single instance of each block,wherein the second container file stores blocks from more than one file stored at one or more computing devices,wherein the second container file includes multiple portions available for storing blocks, andwherein the block is stored in one or more portions;
  
  storing in the first container file a reference to the block in the second container file,wherein the reference to the block is stored in the at least one data structure; and
  
  storing in the at least one data structure an indication that the one or more portions in the second container file are not available for storing blocks.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CommVault Systems Incorporated
Original Assignee
CommVault Systems Incorporated
Inventors
Attarde, Deepak Raghunath, Kottomtharayil, Rajiv, Vijayan, Manoj Kumar
Primary Examiner(s)
BIRKHIMER, CHRISTOPHER D

Application Number

US14/049,463
Publication Number

US 20140040582A1
Time in Patent Office

615 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 11/1435   using file system or storag...

G06F 11/1453   using de-duplication of the...

G06F 11/1464   for networked environments

G06F 11/1469   Backup restoration techniques

G06F 16/1752   based on file chunks

G06F 2201/80   Database-specific techniques

G06F 2201/84   Using snapshots, i.e. a log...

G06F 3/0617   in relation to availability

G06F 3/064   Management of blocks

G06F 3/067   Distributed or networked st...

Block-level single instancing

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Block-level single instancing

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links