Efficient content meta-data collection and trace generation from deduplicated storage
First Claim
1. A computer-implemented method for collecting meta-data from a deduplication data storage system, the method comprising:
- collecting a set of file recipes for a set of files stored in the deduplication data storage system, each file recipe in the set of file recipes including a fingerprint for each unique data chunk that constitutes a file, wherein each fingerprint identifies each corresponding unique data chunk;
collecting meta-data for a set of unique data chunks for the collected set of files by a data collection engine, wherein the meta-data describes the unique data chunks;
anonymizing the collected set of file recipes and the meta-data by an anonymizing engine; and
storing the anonymized set of file recipes and the anonymized meta-data in a data collection storage unit for content data set analysis without the content data set.
9 Assignments
0 Petitions
Accused Products
Abstract
The method and apparatus collect file recipes from deduplicated data storage systems, the file recipes consist of a list of fingerprints of data chunks of a file. Detailed meta-data for each unique data chunk is also collected. In an offline process, research and analysis can be performed on either the meta-data itself or on a reconstruction of a full trace of meta-data constructed by matching recipe fingerprints to the corresponding meta-data. The method and system can generate the full meta-data trace efficiently in an on-line or off-line process. Typical deduplicated storage systems achieve 10× or higher deduplication rates, and the meta-data collection is faster than processing all of the original files and produces compact meta-data that is smaller to store.
45 Citations
17 Claims
-
1. A computer-implemented method for collecting meta-data from a deduplication data storage system, the method comprising:
-
collecting a set of file recipes for a set of files stored in the deduplication data storage system, each file recipe in the set of file recipes including a fingerprint for each unique data chunk that constitutes a file, wherein each fingerprint identifies each corresponding unique data chunk; collecting meta-data for a set of unique data chunks for the collected set of files by a data collection engine, wherein the meta-data describes the unique data chunks; anonymizing the collected set of file recipes and the meta-data by an anonymizing engine; and storing the anonymized set of file recipes and the anonymized meta-data in a data collection storage unit for content data set analysis without the content data set. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A non-transitory computer-readable storage medium having instructions stored therein, which when executed by a computer, cause the computer to perform a method, the method for collecting meta-data from a deduplication data storage system, the method comprising:
-
collecting a set of file recipes for a set of files stored in the deduplication data storage system, each file recipe in the set of file recipes including a fingerprint for each unique data chunk that constitutes a file, wherein each fingerprint identifies each corresponding unique data chunk; collecting meta-data for a set of unique data chunks for the collected set of files by a data collection engine, wherein the meta-data describes the unique data chunks; anonymizing the collected set of file recipes and the meta-data by an anonymizing engine; and storing the anonymized set of file recipes and the anonymized meta-data in a data collection storage unit for content data set analysis without the content data set. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A deduplication storage system, comprising:
-
a processing system configuring to execute (1) a data collection engine to collect a set of file recipes for a set of files stored in the deduplication data storage system, each file recipe in the set of file recipes including a fingerprint for each unique data chunk that constitutes a file, wherein each fingerprint identifies each corresponding unique data chunk, the data collection engine to collect meta-data for a set of unique data chunks for the collected set of files, wherein the meta-data describes the unique data chunks and (2) an anonymizing engine that is communicatively coupled to the data collection engine, the anonymizing engine to anonymize the collected set of file recipes and the meta-data; and a data collection storage unit communicatively coupled to the processing system and the data collection engine, the data collection storage unit to store the anonymized set of file recipes and the anonymized meta-data for content data set analysis without the content data set. - View Dependent Claims (14, 15, 16, 17)
-
Specification