Eliminating duplicate data by sharing file system extents
First Claim
1. A method performed by a computer system, comprising:
- receiving a data set written to a virtual storage device;
partitioning the received data set into multiple sections of data, the partitioning including identifying at least one anchor point surrounded by a data region identified for potential deduplication within the received data set for partitioning into one or more variable sized sections of data, and partitioning remaining portions of the received data set that do not identify an anchor into one or more fixed sized sections of data;
for each section of data,determining whether the section of data includes the duplicate data, wherein the duplicate data is data that was previously written to another virtual storage device and stored on a physical storage device at a storage location;
in response to determining that a section of data includes the duplicate data, performing deduplication in relation to the duplicate data; and
generating a descriptor, for the received data set, that identifies the storage location of each section of data previously written to another virtual storage device and stored on the physical storage device, rather than storing the corresponding duplicate data at another storage location on the physical storage device.
1 Assignment
0 Petitions
Accused Products
Abstract
A hardware and/or software facility to enable emulated storage devices to share data stored on physical storage resources of a storage system. The facility may be implemented on a virtual tape library (VTL) system configured to back up data sets that have a high level of redundancy on multiple virtual tapes. The facility organizes all or a portion of the physical storage resources according to a common store data layout. By enabling emulated storage devices to share data stored on physical storage resources, the facility enables deduplication across the emulated storage devices irrespective of the emulated storage device to which the data is or was originally written, thereby eliminating duplicate data on the physical storage resources and improving the storage consumption of the emulated storage devices on the physical storage resources.
13 Citations
19 Claims
-
1. A method performed by a computer system, comprising:
-
receiving a data set written to a virtual storage device; partitioning the received data set into multiple sections of data, the partitioning including identifying at least one anchor point surrounded by a data region identified for potential deduplication within the received data set for partitioning into one or more variable sized sections of data, and partitioning remaining portions of the received data set that do not identify an anchor into one or more fixed sized sections of data; for each section of data, determining whether the section of data includes the duplicate data, wherein the duplicate data is data that was previously written to another virtual storage device and stored on a physical storage device at a storage location; in response to determining that a section of data includes the duplicate data, performing deduplication in relation to the duplicate data; and generating a descriptor, for the received data set, that identifies the storage location of each section of data previously written to another virtual storage device and stored on the physical storage device, rather than storing the corresponding duplicate data at another storage location on the physical storage device. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A virtual tape library (VTL) system comprising:
-
a processor; a component coupled to the processor to; receive a data set written to a virtual storage device; partition the received data set into multiple sections of data, the partitioning including identifying at least one anchor point surrounded by a data region identified for potential deduplication within the received data set for partitioning into one or more variable sized sections of data, and partitioning remaining portions of the received data set that do not identify an anchor into one or more fixed sized sections of data; for each section of data, determine whether the section of data includes the duplicate data, wherein the duplicate data is data that was previously written to another virtual storage device and stored on a physical storage device at a storage location; in response to determining that a section of data includes the duplicate data, perform deduplication in relation to the duplicate data; and generate a descriptor, for the received data set, that identifies the storage location of each section of data previously written to another virtual storage device and stored on the physical storage device, rather than storing the corresponding duplicate data at another storage location on the physical storage device. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a physical storage device to store a data set written to a virtual storage device; and the virtual storage device coupled to the physical storage device to facilitate backup operations associated with the data set, the virtual storage device configured to; receive a data set written to a virtual storage device; partition the received data set into multiple sections of data, the partitioning including to identify at least one anchor point surrounded by a data region identified for potential deduplication within the received data set for partitioning into one or more variable sized sections of data, and to partition remaining portions of the received data set that do not identify an anchor into one or more fixed sized sections of data; for each section of data, determine whether the section of data includes the duplicate data, wherein the duplicate data is data that was previously written to another virtual storage device and stored on the physical storage device at a storage location; in response to determining that a section of data includes the duplicate data, to perform deduplication in relation to the duplicate data; and generate a descriptor, for the received data set, that identifies the storage location of each section of data previously written to another virtual storage device and stored on the physical storage device, rather than storing the corresponding duplicate data at another storage location on the physical storage device. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification