System and method for storing redundant information

US 7,953,706 B2
Filed: 03/28/2008
Issued: 05/31/2011
Est. Priority Date: 12/22/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method in a source computer for reducing redundant storage of a data object, the method comprising:

receiving at the source computer a first request from a server computer to perform a storage operation on multiple data objects;

processing at the source computer the data objects specified by the first request to produce a hash, a size, and security information of each data object, wherein the hash of each data object provides an identifier of the data object, and wherein the identifier, the size, and the security information is compared with identifiers, sizes, and security information of other data objects to determine if the data objects match;

sending from the source computer in response to the first request the hash, the size, and the security information of each data object produced by the source computer; and

receiving at the source computer a second request from the server computer to send each data object for which the hash, the size and the security information sent does not identify a data object previously processed by the server computer and not to send each data object for which the hash, the size and the security information sent identifies a data object previously processed by the server computer,wherein, for every data object, the server computer utilizes the hash, the size, and the security information to determine whether the server computer previously processed the data object,wherein the likelihood of collisions, which occur when two data objects containing different data have the same hash, is reduced.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for reducing storage requirements and speeding up storage operations by reducing the storage of redundant data includes receiving a request that identifies one or more data objects to which to apply a storage operation. For each data object, the storage system determines if the data object contains data that matches another data object to which the storage operation was previously applied. If the data objects do not match, then the storage system performs the storage operation in a usual manner. However, if the data objects do match, then the storage system may avoid performing the storage operation.

197 Citations

23 Claims

1. A method in a source computer for reducing redundant storage of a data object, the method comprising:
- receiving at the source computer a first request from a server computer to perform a storage operation on multiple data objects;
  
  processing at the source computer the data objects specified by the first request to produce a hash, a size, and security information of each data object, wherein the hash of each data object provides an identifier of the data object, and wherein the identifier, the size, and the security information is compared with identifiers, sizes, and security information of other data objects to determine if the data objects match;
  
  sending from the source computer in response to the first request the hash, the size, and the security information of each data object produced by the source computer; and
  
  receiving at the source computer a second request from the server computer to send each data object for which the hash, the size and the security information sent does not identify a data object previously processed by the server computer and not to send each data object for which the hash, the size and the security information sent identifies a data object previously processed by the server computer,wherein, for every data object, the server computer utilizes the hash, the size, and the security information to determine whether the server computer previously processed the data object,wherein the likelihood of collisions, which occur when two data objects containing different data have the same hash, is reduced.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the processing comprises detecting that the data object has changed and computing a new hash of the data object.
  - 3. The method of claim 1 wherein receiving the first request comprises receiving a request to create a single instanced copy of the data object by copying those data objects that differ from data objects already copied and not copying those data objects that are the same as data objects already copied.
  - 4. The method of claim 1 wherein receiving the first request comprises receiving a request to perform a storage operation from a storage manager that manages copying files from multiple source computer systems to one or more storage servers.
  - 5. The method of claim 1 wherein the hash is computed using a SHA hashing algorithm, and wherein the server determines whether the data object is an instance of a previously processed data object by comparing the hash values.
  - 6. The method of claim 1 wherein the second request includes a request to send a reference for each data object for which the hash sent identifies a data object previously processed by the server computer.
  - 7. The method of claim 1 wherein the processing at the source computer of the data object to produce a hash of the data object is performed before the first request is received to spread out the resource load of responding the first request.

8. A system for reducing redundant copies of files in a storage environment, the system comprising:
- a hash receiving component configured to receive digest values, file sizes, and security information from client computer systems, wherein the digest values provide a summary of one or more files stored on the client computer systems, wherein the digest values are computed before a request to perform a data storage operation on the one or more files is received;
  
  a hash indexing component configured to maintain an index of digest values, file sizes, and security information for files managed by the system;
  
  a hash comparison component configured to compare received digest values, file sizes, and security information from a client computer with digest values, file sizes, and security information maintained by the index; and
  
  a storage operation component configured to perform storage operations based on the result of the comparison of the digest values,wherein the hash comparison component compares a received digest value, a file size, and security information with an index digest value, a file size, and security information to determine whether a file stored on a client computer is an instance of a file tracked within the index,wherein each comparison performed by the hash comparison component utilizes the digest value, the file size, and the security information,wherein the hash comparison component determines that a file stored on a client computer is an instance of a file tracked within the index if the received digest value, file size, and security information match the index digest value, file size, and security information, andwherein the hash comparison component determines that a file stored on a client computer is not an instance of a file tracked within the index if the received digest value matches the index digest value but the received file size does not match the index file size or the received security information does not match the index security information,wherein the likelihood of collisions, which occur when two files containing different data have the same digest value, is reduced, andwherein the storage operation component stores data based at least in part on the result of the comparison of the digest values.
- View Dependent Claims (9, 10, 11, 12, 19, 20, 21)
- - 9. The system of claim 8 wherein the storage operation component performs a copy operation and wherein when the digest value, file size, and security information for a file to be copied matches a digest value, file size, and security information in the index, the copy operation only creates a reference to the file.
  - 10. The system of claim 8 wherein the hash receiving component precomputes the digest values as each file is modified on the client computer systems.
  - 11. The system of claim 8 wherein the hash indexing component receives the digest values produced by a client computer system when the hash comparison component indicates that a particular file does not match any file within the index.
  - 12. The system of claim 8 wherein the hash indexing component maintains an index of digest values for files stored by at least one of a server managed by the system, a tape media, a client computer system, and an offsite storage location.
  - 19. The system of claim 8, wherein a reference count of the number of instances of a file is stored with the index or with a stored instance of the file.
  - 20. The system of claim 8, further comprising a sequential media drive configured to accept one or more sequential media, and wherein the storage operation component is further configured to:
    - store files on which storage operations have been performed on the one or more sequential media; and
      
      store the index of digest values, file sizes, and security information with the files on the one or more sequential media.
  - 21. The system of claim 8, further comprising a storage device, and wherein the storage operation component is further configured to:
    - store files on which storage operations have been performed on the storage device; and
      
      store the index of digest values, file sizes, and security information with the files on the storage device.

13. A non-transitory computer-readable storage medium containing instructions for controlling a computer system to reduce redundant data, by a method comprising:
- receiving a list of files from a client computer, wherein the list contains information for each file for determining if other instances of the file are stored within the system, wherein the information comprises at least a hash value, a file size, and security information;
  
  comparing the list of files and the hash values, file sizes, and security information received with an index of files stored by the system, wherein the index contains hash values, file sizes, and security information for the first instance of each of the files stored by the system, wherein the comparison utilizes the hash value, file size, and security information for each file in the list;
  
  for each file in the list of files for which the hash value, the file size, and the security information of the file matches a hash value, a file size, and security information in the index, storing a reference to the file at a destination location; and
  
  for each file in the list of files for which the hash value, the file size, and the security information of the file does not match any hash value, file size, and security information in the index, storing the file at the destination location and updating the index,wherein the likelihood of collisions, which occur when two files containing different data have the same digest value, is reduced.
- View Dependent Claims (14, 15, 16, 17, 18, 22, 23)
- - 14. The non-transitory computer-readable storage medium of claim 13 wherein updating the index comprises adding the hash value of the file and information describing the destination location where the file can be located to the index.
  - 15. The non-transitory computer-readable storage medium of claim 13 wherein receiving a list of files from a client computer comprises receiving a list of files that have changed after a specified event.
  - 16. The non-transitory computer-readable storage medium of claim 13 including storing the files at the destination location on sequential media.
  - 17. The non-transitory computer-readable storage medium of claim 13 wherein receiving a list of files comprises receiving hash values for portions of each file and wherein comparing the list of files and the hash values received with an index of files stored by the system comprises comparing hash values for portions of each file with hash values for portions of the files stored by the system to determine if at least part of two files match.
  - 18. The non-transitory computer-readable storage medium of claim 13 wherein storing a reference to the file at a destination location comprises determining that the file is stored on sequential media and storing an identifier of the sequential media and offset within the sequential media to the file.
  - 22. The non-transitory computer-readable storage medium of claim 13, wherein each file stored at the destination location has associated header information that includes:
    - a reference count of the number of identical files;
      
      the date the file was first stored; and
      
      an expiration date after which the file can be deleted.
  - 23. The non-transitory computer-readable storage medium of claim 13, wherein each item stored at the destination location has associated header information that includes an indication of the type of the item, either file or reference.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CommVault Systems Incorporated
Original Assignee
CommVault Systems Incorporated
Inventors
Retnamma, Manoj Kumar Vijayan, Gokhale, Parag, Attarde, Deepak R., Prahlad, Anand, Kottomtharayil, Rajiv
Primary Examiner(s)
Mahmoudi; Tony
Assistant Examiner(s)
Hu; Jensen

Application Number

US12/058,178
Publication Number

US 20080243957A1
Time in Patent Office

1,159 Days
Field of Search

707/204, 707/665
US Class Current

707/665
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/1748   De-duplication implemented ...

G06F 3/061   Improving I/O performance

G06F 3/0638   Organizing or formatting or...

G06F 3/065   Replication mechanisms

G06F 3/067   Distributed or networked st...

G11B 5/86   Re-recording, i.e. transcri...

System and method for storing redundant information

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

197 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for storing redundant information

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

197 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links