PERFORMING DATA STORAGE OPERATIONS WITH A CLOUD ENVIRONMENT, INCLUDING CONTAINERIZED DEDUPLICATION, DATA PRUNING, AND DATA TRANSFER

US 20100332454A1
Filed: 03/31/2010
Published: 12/30/2010
Est. Priority Date: 06/30/2009
Status: Active Grant

First Claim

Patent Images

1. A method for storing, on a cloud storage site, a secondary copy of an original data set, the method comprising:

receiving a primary copy of an original data set;

updating a content index to reflect at least some of data content in the original data set;

identifying a target cloud storage site on which to store a secondary copy of the original data set,wherein a network connection is to be established between the target cloud storage site and a media file system agent, andwherein the established network connection has an associated latency and bandwidth;

determining a size for a container file to utilize when deduplicating the primary copy of the original data setwherein the container file size is determined at least in part on the latency, bandwidth, or both, associated with the network connection to be established;

deduplicating at least some of the data content in the primary copy in order to create one or more container files containing deduplicated data, wherein at least one of the container files has the determined size;

establishing the network connection between the target cloud storage site and the media file system agent; and

transferring the one or more container files to the target cloud storage site.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment. The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP. Methods are disclosed for content indexing data stored within a cloud environment to facilitate later searching, including collaborative searching. Methods are also disclosed for performing containerized deduplication to reduce the strain on a system namespace, effectuate cost savings, etc. Methods are disclosed for identifying suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy. Further, systems and methods for providing a cloud gateway and a scalable data object store within a cloud environment are disclosed, along with other features.

523 Citations

27 Claims

1. A method for storing, on a cloud storage site, a secondary copy of an original data set, the method comprising:
- receiving a primary copy of an original data set;
  
  updating a content index to reflect at least some of data content in the original data set;
  
  identifying a target cloud storage site on which to store a secondary copy of the original data set,wherein a network connection is to be established between the target cloud storage site and a media file system agent, andwherein the established network connection has an associated latency and bandwidth;
  
  determining a size for a container file to utilize when deduplicating the primary copy of the original data setwherein the container file size is determined at least in part on the latency, bandwidth, or both, associated with the network connection to be established;
  
  deduplicating at least some of the data content in the primary copy in order to create one or more container files containing deduplicated data, wherein at least one of the container files has the determined size;
  
  establishing the network connection between the target cloud storage site and the media file system agent; and
  
  transferring the one or more container files to the target cloud storage site.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein transferring the one or more container files to the target cloud storage site further comprises:
    - establishing a data buffer to permit buffering of data for later transmission to the target cloud storage site;
      
      repeating the following steps while the data buffer is not full;
      
      receiving a file system write request to write a set of data to the target cloud storage site, andadding the set of data to the buffer;
      
      once the data buffer is full, converting a file system request to one or more application program interface (API) calls associated with the target cloud storage site; and
      
      transmitting the set of data in the buffer to the target cloud storage site via the network connection.
  - 3. The method of claim 1, wherein determining the size for the container file further comprises:
    - determining at least two of the following factors;
      
      an estimate of the latency associated with the network connection established to the identified target cloud storage site;
      
      an estimate of the bandwidth associated with the network connection established to the identified target cloud storage site;
      
      whether the target cloud storage site imposes a restriction on a namespace associated with the target cloud storage site;
      
      whether the target cloud storage site permits sparsification of data files;
      
      a pricing structure used by the target cloud storage site,a maximum container file size specified by a user or by a storage policy; and
      
      a minimum container file size specified by a user or by a storage policy; and
      
      performing an optimization to establish a container size reflecting the determined two or more factors.
  - 4. The method of claim 1, wherein determining the size for the container file further comprises determining the size of the container at least in part based on pricing associated with the target cloud storage site.
  - 5. The method of claim 1, wherein determining the size for the container file further comprises determining a smaller sized container when the estimated bandwidth associated with the network connection is sufficiently low to inhibit acceptable upload times within a backup window or inhibit acceptable download times to retrieve the container file to meet a desired recovery time.
  - 6. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy of the original data set further comprises:
    - identifying two or more candidate cloud storage sites;
      
      accessing a storage policy having a set of preferences and storage criteria, wherein the set of preferences and storage criteria includes at least two of the following;
      
      one or more preferred cloud storage sites;
      
      one or more preferred classes or quality of cloud storage sites;
      
      requirements regarding deduplication of the original data set,requirements regarding encryption of the original data set,requirements regarding compression of the original data set,quality of a network connection available to the cloud storage site;
      
      one or more data retention periods;
      
      data characteristics of at least some data in the original data set;
      
      estimated or historic usage associated with operating one or more system components;
      
      frequency with which the original data set was accessed or modified during a particular time period;
      
      a specified level of fault tolerance; and
      
      one or more geographical locations or political state in which data storage devices for a cloud storage site exist; and
      
      selecting at least one of the two or more of the candidate cloud storage sites based at least in part on the set of preferences and storage criteria in the storage policy.
  - 7. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises selecting a cloud storage site based at least in part on data storage devices for the cloud storage site being in locations receiving electrical power from two different power grids.
  - 8. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises selecting a cloud storage site based at least in part on an operator of the cloud storage site being environmentally conscientious or having a particular political or social agenda.
  - 9. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises avoiding a cloud storage site based at least in part on an operator of the cloud storage site having operations in an embargoed or sanctioned country.
  - 10. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises selecting a cloud storage site based at least in part on an operator of the cloud storage site having operations in a developing country.
  - 11. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises selecting a first cloud storage site to store at least some of the original data set for a first time period, and selecting a second cloud storage site to store at least some of the original data set for a second, longer time period.
  - 12. The method of claim 1, wherein identifying the target cloud storage site on which to store the secondary copy further comprises selecting a first cloud storage site to store at least a first portion of the original data set, and selecting a second cloud storage site to store at least a second portion of the original data set, wherein the first portion of the original data set includes emails or data objects from a particular organization group.
  - 13. The method of claim 1, wherein the primary copy is received at a media file system agent associated with a file system before deduplicating at least some of the data content in the primary copy.

14. A system for storing, on a cloud storage site, a secondary copy of an original data set, the system comprising:
- means for identifying a cloud storage site on which to store a secondary copy of an primary data set;
  
  means for updating an index of content to reflect at least some data content in the primary data set;
  
  means for deduplicating at least some of the data content in the primary data set;
  
  means for creating one or more container files containing the deduplicated data; and
  
  means for transferring the one or more container files to the cloud storage site.

15. A tangible computer-readable storage medium whose contents cause a data storage system to perform a method of migrating data from local primary storage to secondary storage located on a remote cloud storage site, the method comprising:
- identifying no more than n−
  
  1 data blocks, located within the local primary storage, that satisfy a criteria, wherein the n−
  
  1 data blocks represent a portion of a data file consisting of n blocks and the n blocks contain data written by a file system associated with the local primary storage; and
  
  determining a size for a container file in which to store some or all of the no more than n−
  
  1 data blocks;
  
  transferring data contained by the identified no more than n−
  
  1 data blocks from the primary storage to the secondary storage located on a cloud storage site, wherein transferring data comprises writing data first to a container file of the determined size; and
  
  updating an index with information associating the transferred data with information identifying blocks within the secondary storage that contain the transferred data, wherein the information includes at least one uniform resource locator or logical address that identifies at least one logical location from which the transferred data may be accessed.
- View Dependent Claims (16)
- - 16. The tangible computer-readable storage medium of claim 15 wherein the index further comprises information associating the transferred data with information identifying tape offsets for secondary storage that contain the transferred data.

17. A computer system for indexing and searching multiple content items, the computer system comprising:
- a processor;
  
  a memory;
  
  a secondary copy component configured to select or access at least one secondary copy of the multiple content items,wherein the secondary copy of the multiple content items is a copy of the multiple content items and is not a primary copy of the multiple content items,wherein the primary copy is available by the computer system over a local area network, andwherein the at least one secondary copy is stored at a cloud storage site located geographically remote from the computer system;
  
  a content indexing component configured to, for at least some of the multiple content items included in the secondary copy;
  
  analyze content of a content item, including analyzing a summary of the content item as well as analyzing additional content of the content item; and
  
  based upon the analysis, generate metadata corresponding to the content item, wherein the metadata includes at least a logical address to the cloud storage site for accessing the content item; and
  
  store in a content index the generated metadata of the content, wherein the content index is not stored at the cloud storage site, but is locally accessible by the computer system; and
  
  an index searching component configured to identify one or more indexed content items based on a search query and the metadata stored within the content index.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The computer system of claim 17 wherein the content index further comprises a file name for the content item, a logical descriptor for a client computer that originated the content item, and a size of the content item.
  - 19. The computer system of claim 17 wherein the content index further comprises a token to uniquely identify each of the multiple content items for the cloud storage site, and wherein the content index indexes content items accessible via the cloud storage site over an HTTP protocol.
  - 20. The computer system of claim 17 wherein the secondary copy component is further configured to identify and select from an index of secondary copies one secondary copy that is on storage medium having higher availability than a secondary copy stored on magnetic tape storage medium.
  - 21. The computer system of claim 17 wherein the secondary copy component is further configured to identify and select from an index of secondary copies an unencrypted secondary copy versus an encrypted secondary copy.

22. A computer-implemented method for copying multiple files at a cloud storage site, wherein the cloud storage site is coupled to a computer executing a file system for accessing a secondary storage computing device, the method comprising:
- receiving a copy operation request to copy n number of files at the cloud storage site,wherein each of the n number of files includes metadata and data, andwherein the n number of files exceeds a threshold;
  
  establishing a container size reflecting one or more factors, wherein the factors include;
  
  a latency associated with a network connection to the secondary storage computing device;
  
  a bandwidth associated with a network connection to the secondary storage computing device;
  
  whether the cloud storage site imposes a restriction on a namespace associated with the computer or the file system;
  
  whether the cloud storage site permits sparsification of data files;
  
  a pricing structure associated with the cloud storage site;
  
  a maximum specified container file size; and
  
  a minimum specified container file size;
  
  processing the n number of files by—
  
  copying the metadata of each of the n number of files to a first container;
  
  copying at least a portion of the data for the n number of files into a second container, wherein the second container is separate from the first container; and
  
  updating a data structure, wherein the data structure—
  
  tracks, for each of the n number of files, a location of the metadata for that file in the first container, andtracks, for the at least a portion of the data for the n number of files, a location of the data in the second container,and wherein the size of at least one of the first and second containers is no greater than the established container size.
- View Dependent Claims (23, 24)
- - 23. A computer-implemented method of claim 22 wherein the threshold is a number of files that the file system can operate on without system degradation.
  - 24. A computer-implemented method of claim 22 wherein the threshold is related to at least of one of the factors.

25. A non-transitory computer-readable medium storing instructions that when executed by a processor perform a method of deduplicating multiple data objects that is performed by one or more computing systems, each computing system including a processor and memory, the method comprising:
- receiving an indication to perform a storage operation to store data to at least one cloud storage location;
  
  receiving a set of data objects involved in the storage operation;
  
  for at least some of the data objects in the set, by the one or more computing systems;
  
  determining if an instance of the data object has already been stored at the at least one cloud storage location;
  
  if an instance of the data object has already been stored, then;
  
  determining the location of the instance of the data object; and
  
  storing a reference to the location of the instance of the data object in a first file in a chunk folder, wherein the first file stores multiple references, each reference referring to a location of an instance of a data object; and
  
  wherein a reference may comprise a universal resource locator or logical address to the cloud storage location; and
  
  if an instance of the data object has not already been stored, then storing the data object in a second file in the chunk folder, wherein the second file stores only a single instance of each data object; and
  
  instructing the storage of the first and second files at the cloud storage location.

26. A method of pruning files containing data that is performed by one or more computing systems, each computing system including a processor and memory, the method comprising:
- receiving an indication to delete a first file, wherein the first file includes a first set of data, and where-in the first file is stored at a cloud storage location;
  
  determining, by the one or more computing systems, if the first set of data references a second set of data included in a second file located at the cloud storage location;
  
  if the first set of data references the second set of data, then;
  
  causing to be deleted any references to the second set of data by the first set of data at the cloud storage location; and
  
  causing to be deleted the second file at the cloud storage location;
  
  determining, by the one or more computing systems, if the first set of data is referenced by at least a third set of data included in a third file at the cloud storage location; and
  
  if the first set of data is referenced by at least the third set of data, then;
  
  deleting any references to the first set of data by the third set of data at the cloud storage location; and
  
  storing an indication to delete the first file at the cloud storage location.
- View Dependent Claims (27)
- - 27. The method of claim 26 method comprising:
    - causing the second file to be deleted by converting one or more generic file system commands to one or more vendor-specific calls for the cloud storage site.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CommVault Systems Incorporated
Original Assignee
CommVault Systems Incorporated
Inventors
Retnemma, Manoj Vijayan, Gokhale, Parag, Prahlad, Anand, Kottomtharayil, Rajiv, Kavuri, Srinivas, Muller, Marcus S.

Granted Patent

US 8,407,190 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/654
CPC Class Codes

G06F 11/3485   for I/O devices

G06F 16/122   using management policies b...

G06F 16/1748   De-duplication implemented ...

G06F 16/1827   Management specifically ada...

G06F 16/1844   Management specifically ada...

G06F 16/41   Indexing; Data structures t...

G06F 3/06   Digital input from, or digi...

G06F 3/0605   by facilitating the interac...

G06F 3/061   Improving I/O performance

G06F 3/0626   Reducing size or complexity...

G06F 3/0631   by allocating resources to ...

G06F 3/0641   De-duplication techniques

G06F 3/0649   Lifecycle management

G06F 3/0667   at data level, e.g. file, r...

G06F 3/067   Distributed or networked st...

G06Q 30/02   Marketing; Price estimation...

G06Q 30/0206   Price or cost determination...

G06Q 50/188   Electronic negotiation

H04L 63/0428   wherein the data content is...

H04L 67/02   based on web technology, e....

H04L 67/06 : specially adapted for file ...

H04L 67/1095 : Replication or mirroring of...

H04L 67/1097 : for distributed storage of ...

H04L 67/535 : Tracking the activity of th...

H04L 67/56 : Provisioning of proxy servi...

H04L 67/5682 : Policies or rules for updat...

H04L 69/08 : Protocols for interworking;...

Y04S 40/20 : Information technology spec...

View All

PERFORMING DATA STORAGE OPERATIONS WITH A CLOUD ENVIRONMENT, INCLUDING CONTAINERIZED DEDUPLICATION, DATA PRUNING, AND DATA TRANSFER

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

523 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

PERFORMING DATA STORAGE OPERATIONS WITH A CLOUD ENVIRONMENT, INCLUDING CONTAINERIZED DEDUPLICATION, DATA PRUNING, AND DATA TRANSFER

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

523 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links