REDUCING DATA DUPLICATION IN CLOUD STORAGE

US 20120136834A1
Filed: 11/29/2010
Published: 05/31/2012
Est. Priority Date: 11/29/2010
Status: Active Grant

First Claim

Patent Images

1. A method for reducing data duplication in cloud storage, the method comprising:

receiving at least one first snapshot of one or more remote volumes via a network, the at least one first snapshot including at least one copy of the one or more remote volumes at a first instant in time, individual ones of the one or more remote volumes including a plurality of clusters, individual ones of the plurality of clusters being identified as valid or invalid, valid clusters containing data to be backed up, and invalid clusters being devoid of data to be backed up;

identifying, responsive to and based on the at least one first snapshot, unique clusters and duplicate clusters among the valid clusters, the duplicate clusters being valid clusters in the one or remote volumes containing identical data;

storing, in a backup file, the unique clusters and single instances of the duplicate clusters such that the backup file is devoid of duplicate clusters;

receiving at least one second snapshot of the one or more remote volumes via the network, the at least one second snapshot including at least one copy of the one or more remote volumes at a second instant in time, the second instant in time being after the first instant in time;

identifying, responsive to and based on the at least one second snapshot, valid clusters in the one or more remote volumes not yet stored in the backup file and clusters in the backup file that are no longer valid; and

utilizing, responsive to the at least one second snapshot, the clusters in the backup file that are no longer valid to store the valid clusters in the one or more remote volumes not yet stored in the backup file.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data duplication may be reduced in cloud storage. First snapshots of one or more remote volumes may be received via a network. The first snapshots may be copies of the one or more remote volumes at a first instant in time. Responsive to and/or based on the first snapshots, unique clusters and duplicate clusters may be identified among the valid clusters of the remote volumes. The unique clusters and single instances of the duplicate clusters may be stored in a backup file, such that the backup file is devoid of duplicate clusters. Second snapshots of the one or more remote volumes may be received via the network. The second snapshots may be copies of the one or more remote volumes at a second instant in time, wherein the second instant in time is after the first instant in time. Responsive to the second snapshots, the clusters in the backup file that are no longer valid may be utilized to store the valid clusters in the one or more remote volumes not yet stored in the backup file.

40 Citations

View as Search Results

22 Claims

1. A method for reducing data duplication in cloud storage, the method comprising:
- receiving at least one first snapshot of one or more remote volumes via a network, the at least one first snapshot including at least one copy of the one or more remote volumes at a first instant in time, individual ones of the one or more remote volumes including a plurality of clusters, individual ones of the plurality of clusters being identified as valid or invalid, valid clusters containing data to be backed up, and invalid clusters being devoid of data to be backed up;
  
  identifying, responsive to and based on the at least one first snapshot, unique clusters and duplicate clusters among the valid clusters, the duplicate clusters being valid clusters in the one or remote volumes containing identical data;
  
  storing, in a backup file, the unique clusters and single instances of the duplicate clusters such that the backup file is devoid of duplicate clusters;
  
  receiving at least one second snapshot of the one or more remote volumes via the network, the at least one second snapshot including at least one copy of the one or more remote volumes at a second instant in time, the second instant in time being after the first instant in time;
  
  identifying, responsive to and based on the at least one second snapshot, valid clusters in the one or more remote volumes not yet stored in the backup file and clusters in the backup file that are no longer valid; and
  
  utilizing, responsive to the at least one second snapshot, the clusters in the backup file that are no longer valid to store the valid clusters in the one or more remote volumes not yet stored in the backup file.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising recording, in a map file associated with a given remote volume, identifiers associated with individual ones of the plurality of clusters included in the given remote volume, the sequence of the identifiers in the map file corresponding to the sequence of the associated clusters in the given remote volume, identifiers associated with valid clusters being assigned incrementally increasing values along the map file, and identifiers associated with invalid clusters being assigned a static value.
  - 3. The method of claim 2, further comprising updating the map file responsive to identification of the duplicate clusters such that identifiers in the map file that are associated with duplicate clusters are assigned the same value.
  - 4. The method of claim 2, further comprising compressing the map file.
  - 5. The method of claim 2, wherein the map file associated with the given remote volume has the same number of units as the number of clusters in the given remote volume.
  - 6. The method of claim 1, wherein identifying the duplicate clusters based on the at least one first snapshot includes determining hash values for individual ones of the valid clusters such that valid clusters having identical hash values are identified as duplicate clusters.
  - 7. The method of claim 6, wherein a dynamic multilevel index is used to determine the hash values.
  - 8. The method of claim 7, wherein the dynamic multilevel index is a B+ tree.
  - 9. The method of claim 6, further comprising storing the hash values in an index file.
  - 10. The method of claim 1, further comprising:
    - receiving at least one third snapshot of the one or more remote volumes via the network, the at least one third snapshot including at least one copy of the one or more remote volumes at a third instant in time, the third instant in time being after the second instant in time;
      
      identifying, responsive to and based on the at least one third snapshot, valid clusters in the one or more remote volumes not yet stored in the backup file and clusters in the backup file that are no longer valid; and
      
      utilizing, responsive to the at least one third snapshot, the clusters in the backup file that are no longer valid to store the valid clusters in the one or more remote volumes not yet stored in the backup file.
  - 11. The method of claim 1, further comprising managing a purge file configured to track clusters in the backup file that are no longer valid such that the clusters in the backup file that are no longer valid are available for storing valid clusters from the one or more remote volumes.

12. A system for reducing data duplication in cloud storage, the system comprising:
- one or more processors configured to execute computer program modules, the computer program modules comprising;
  
  a snapshot retrieval module configured to receive at least one first snapshot of one or more remote volumes via a network, the at least one first snapshot including at least one copy of the one or more remote volumes at a first instant in time, individual ones of the one or more remote volumes including a plurality of clusters, individual ones of the plurality of clusters being identified as valid or invalid, valid clusters containing data to be backed up, and invalid clusters being devoid of data to be backed up;
  
  a cluster identification module configured to identify, responsive to and based on the at least one first snapshot, unique clusters and duplicate clusters among the valid clusters, the duplicate clusters being valid clusters in the one or remote volumes containing identical data; and
  
  a backup module configured to store, in a backup file, the unique clusters and single instances of the duplicate clusters such that the backup file is devoid of duplicate clusters;
  
  wherein the snapshot retrieval module is further configured to receive at least one second snapshot of the one or more remote volumes via the network, the at least one second snapshot including at least one copy of the one or more remote volumes at a second instant in time, the second instant in time being after the first instant in time;
  
  wherein the cluster identification module is further configured to identify, responsive to and based on the at least one second snapshot, valid clusters in the one or more remote volumes not yet stored in the backup file and clusters in the backup file that are no longer valid; and
  
  wherein the backup module is further configured to utilize, responsive to the at least one second snapshot, the clusters in the backup file that are no longer valid to store the valid clusters in the one or more remote volumes not yet stored in the backup file.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The system of claim 12, wherein the computer program modules further comprise a mapping module configured to record, in a map file associated with a given remote volume, identifiers associated with individual ones of the plurality of clusters included in the given remote volume, the sequence of the identifiers in the map file corresponding to the sequence of the associated clusters in the given remote volume, identifiers associated with valid clusters being assigned incrementally increasing values along the map file, and identifiers associated with invalid clusters being assigned a static value.
  - 14. The system of claim 13, wherein the mapping module is further configured to update the map file responsive to identification of the duplicate clusters such that identifiers in the map file that are associated with duplicate clusters are assigned the same value.
  - 15. The system of claim 13, wherein the mapping module is further configured to compress the map file.
  - 16. The system of claim 13, wherein the map file associated with the given remote volume has the same number of units as the number of clusters in the given remote volume.
  - 17. The system of claim 12, wherein the cluster identification module is configured to identify the duplicate clusters based on the at least one first snapshot by determining hash values for individual ones of the valid clusters such that valid clusters having identical hash values are identified as duplicate clusters.
  - 18. The system of claim 17, wherein the cluster identification module utilizes a dynamic multilevel index to determine the hash values.
  - 19. The system of claim 18, wherein the dynamic multilevel index is a B+ tree.
  - 20. The system of claim 17, wherein the cluster identification module is further configured to store the hash values in an index file.
  - 21. The system of claim 12, wherein the backup module is further configured to manage a purge file configured to track clusters in the backup file that are no longer valid such that the clusters in the backup file that are no longer valid are available for storing valid clusters from the one or more remote volumes.
  - 22. The system of claim 12, wherein the snapshot retrieval module is further configured to receive at least one third snapshot of the one or more remote volumes via the network, the at least one third snapshot including at least one copy of the one or more remote volumes at a third instant in time, the third instant in time being after the second instant in time;
    - wherein the cluster identification module is further configured to identify, responsive to and based on the at least one third snapshot, valid clusters in the one or more remote volumes not yet stored in the backup file and clusters in the backup file that are no longer valid; and
      
      wherein the backup module is further configured to utilize, responsive to the at least one third snapshot, the clusters in the backup file that are no longer valid to store the valid clusters in the one or more remote volumes not yet stored in the backup file.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Computer Associates Think Inc. (Broadcom, Inc.)
Inventors
ZHAO, Hui

Granted Patent

US 8,583,599 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/649
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 11/1464   for networked environments

G06F 2201/84   Using snapshots, i.e. a log...

REDUCING DATA DUPLICATION IN CLOUD STORAGE

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

REDUCING DATA DUPLICATION IN CLOUD STORAGE

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links