PERFORMING DATA STORAGE OPERATIONS WITH A CLOUD ENVIRONMENT, INCLUDING CONTAINERIZED DEDUPLICATION, DATA PRUNING, AND DATA TRANSFER
First Claim
1. A method for storing, on a cloud storage site, a secondary copy of an original data set, the method comprising:
- receiving a primary copy of an original data set;
updating a content index to reflect at least some of data content in the original data set;
identifying a target cloud storage site on which to store a secondary copy of the original data set,wherein a network connection is to be established between the target cloud storage site and a media file system agent, andwherein the established network connection has an associated latency and bandwidth;
determining a size for a container file to utilize when deduplicating the primary copy of the original data setwherein the container file size is determined at least in part on the latency, bandwidth, or both, associated with the network connection to be established;
deduplicating at least some of the data content in the primary copy in order to create one or more container files containing deduplicated data, wherein at least one of the container files has the determined size;
establishing the network connection between the target cloud storage site and the media file system agent; and
transferring the one or more container files to the target cloud storage site.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment. The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP. Methods are disclosed for content indexing data stored within a cloud environment to facilitate later searching, including collaborative searching. Methods are also disclosed for performing containerized deduplication to reduce the strain on a system namespace, effectuate cost savings, etc. Methods are disclosed for identifying suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy. Further, systems and methods for providing a cloud gateway and a scalable data object store within a cloud environment are disclosed, along with other features.
523 Citations
27 Claims
-
1. A method for storing, on a cloud storage site, a secondary copy of an original data set, the method comprising:
-
receiving a primary copy of an original data set; updating a content index to reflect at least some of data content in the original data set; identifying a target cloud storage site on which to store a secondary copy of the original data set, wherein a network connection is to be established between the target cloud storage site and a media file system agent, and wherein the established network connection has an associated latency and bandwidth; determining a size for a container file to utilize when deduplicating the primary copy of the original data set wherein the container file size is determined at least in part on the latency, bandwidth, or both, associated with the network connection to be established; deduplicating at least some of the data content in the primary copy in order to create one or more container files containing deduplicated data, wherein at least one of the container files has the determined size; establishing the network connection between the target cloud storage site and the media file system agent; and transferring the one or more container files to the target cloud storage site. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for storing, on a cloud storage site, a secondary copy of an original data set, the system comprising:
-
means for identifying a cloud storage site on which to store a secondary copy of an primary data set; means for updating an index of content to reflect at least some data content in the primary data set; means for deduplicating at least some of the data content in the primary data set; means for creating one or more container files containing the deduplicated data; and means for transferring the one or more container files to the cloud storage site.
-
-
15. A tangible computer-readable storage medium whose contents cause a data storage system to perform a method of migrating data from local primary storage to secondary storage located on a remote cloud storage site, the method comprising:
-
identifying no more than n−
1 data blocks, located within the local primary storage, that satisfy a criteria, wherein the n−
1 data blocks represent a portion of a data file consisting of n blocks and the n blocks contain data written by a file system associated with the local primary storage; anddetermining a size for a container file in which to store some or all of the no more than n−
1 data blocks;transferring data contained by the identified no more than n−
1 data blocks from the primary storage to the secondary storage located on a cloud storage site, wherein transferring data comprises writing data first to a container file of the determined size; andupdating an index with information associating the transferred data with information identifying blocks within the secondary storage that contain the transferred data, wherein the information includes at least one uniform resource locator or logical address that identifies at least one logical location from which the transferred data may be accessed. - View Dependent Claims (16)
-
-
17. A computer system for indexing and searching multiple content items, the computer system comprising:
-
a processor; a memory; a secondary copy component configured to select or access at least one secondary copy of the multiple content items, wherein the secondary copy of the multiple content items is a copy of the multiple content items and is not a primary copy of the multiple content items, wherein the primary copy is available by the computer system over a local area network, and wherein the at least one secondary copy is stored at a cloud storage site located geographically remote from the computer system; a content indexing component configured to, for at least some of the multiple content items included in the secondary copy; analyze content of a content item, including analyzing a summary of the content item as well as analyzing additional content of the content item; and based upon the analysis, generate metadata corresponding to the content item, wherein the metadata includes at least a logical address to the cloud storage site for accessing the content item; and store in a content index the generated metadata of the content, wherein the content index is not stored at the cloud storage site, but is locally accessible by the computer system; and an index searching component configured to identify one or more indexed content items based on a search query and the metadata stored within the content index. - View Dependent Claims (18, 19, 20, 21)
-
-
22. A computer-implemented method for copying multiple files at a cloud storage site, wherein the cloud storage site is coupled to a computer executing a file system for accessing a secondary storage computing device, the method comprising:
-
receiving a copy operation request to copy n number of files at the cloud storage site, wherein each of the n number of files includes metadata and data, and wherein the n number of files exceeds a threshold; establishing a container size reflecting one or more factors, wherein the factors include; a latency associated with a network connection to the secondary storage computing device; a bandwidth associated with a network connection to the secondary storage computing device; whether the cloud storage site imposes a restriction on a namespace associated with the computer or the file system; whether the cloud storage site permits sparsification of data files; a pricing structure associated with the cloud storage site; a maximum specified container file size; and a minimum specified container file size; processing the n number of files by— copying the metadata of each of the n number of files to a first container; copying at least a portion of the data for the n number of files into a second container, wherein the second container is separate from the first container; and updating a data structure, wherein the data structure— tracks, for each of the n number of files, a location of the metadata for that file in the first container, and tracks, for the at least a portion of the data for the n number of files, a location of the data in the second container, and wherein the size of at least one of the first and second containers is no greater than the established container size. - View Dependent Claims (23, 24)
-
-
25. A non-transitory computer-readable medium storing instructions that when executed by a processor perform a method of deduplicating multiple data objects that is performed by one or more computing systems, each computing system including a processor and memory, the method comprising:
-
receiving an indication to perform a storage operation to store data to at least one cloud storage location; receiving a set of data objects involved in the storage operation; for at least some of the data objects in the set, by the one or more computing systems; determining if an instance of the data object has already been stored at the at least one cloud storage location; if an instance of the data object has already been stored, then; determining the location of the instance of the data object; and storing a reference to the location of the instance of the data object in a first file in a chunk folder, wherein the first file stores multiple references, each reference referring to a location of an instance of a data object; and
wherein a reference may comprise a universal resource locator or logical address to the cloud storage location; andif an instance of the data object has not already been stored, then storing the data object in a second file in the chunk folder, wherein the second file stores only a single instance of each data object; and instructing the storage of the first and second files at the cloud storage location.
-
-
26. A method of pruning files containing data that is performed by one or more computing systems, each computing system including a processor and memory, the method comprising:
-
receiving an indication to delete a first file, wherein the first file includes a first set of data, and where-in the first file is stored at a cloud storage location; determining, by the one or more computing systems, if the first set of data references a second set of data included in a second file located at the cloud storage location; if the first set of data references the second set of data, then; causing to be deleted any references to the second set of data by the first set of data at the cloud storage location; and causing to be deleted the second file at the cloud storage location; determining, by the one or more computing systems, if the first set of data is referenced by at least a third set of data included in a third file at the cloud storage location; and if the first set of data is referenced by at least the third set of data, then; deleting any references to the first set of data by the third set of data at the cloud storage location; and storing an indication to delete the first file at the cloud storage location. - View Dependent Claims (27)
-
Specification