Method and system to scan data from a system that supports deduplication
First Claim
1. A method of providing file information relating to data deduplication in a computer system, comprising:
- accessing a plurality of files, wherein each file comprises a plurality of segments;
accessing a data repository storing data resultant from a file deduplication process including accessing a plurality of checksum values associated with said plurality of segments, wherein a first application program performs said file deduplication process;
identifying segments of said plurality of segments having a same checksum value;
generating a data association structure by associating said segments of said plurality of segments having said same checksum value, wherein a first checksum value is operable as an index into said data association structure for obtaining segments having said first checksum value;
storing, using a deduplication database, said data association structure;
storing a plurality of respective timestamps associated with said plurality of segments, wherein each respective timestamp indicates a last time an associated segment was altered;
accessing said stored data association structure in said computer memory by a second application program using a received application timestamp and accessing one or more segments of said plurality of segments having an associated timestamp of said plurality of respective timestamps that is newer than the received application timestamp, wherein the second application program identifies, using only said associated timestamp, one or more segments of said plurality of segments that are not processed by the second application program;
receiving, by the second application program, a listing of files to which a segment of said one or more segments of said plurality of segments belongs; and
comparing said listing of files against a defined subset of files to exclude from processing to further determine whether said segment needs to be processed.
7 Assignments
0 Petitions
Accused Products
Abstract
An interface is disclosed that makes information obtained from a file deduplication process available to an application for the efficient operation thereof. A data deduplication repository is scanned to determine a plurality of file segments and respective checksum values associated with the segments. A data structure is generated that allows shared segments to be identified by indexing using a common checksum value. The segments also indicate the file to which they belong and may also include a timestamp value. This data structure is updated as files are modified, etc. The data structure is accessible to an application program so that the application program can readily determine which segments are shared between multiple files. With this information, the application can efficiently process the segment once rather than multiple times. Timestamps can be used by the application to efficiently identify only those segments that were accessed after a given time.
-
Citations
20 Claims
-
1. A method of providing file information relating to data deduplication in a computer system, comprising:
-
accessing a plurality of files, wherein each file comprises a plurality of segments; accessing a data repository storing data resultant from a file deduplication process including accessing a plurality of checksum values associated with said plurality of segments, wherein a first application program performs said file deduplication process; identifying segments of said plurality of segments having a same checksum value; generating a data association structure by associating said segments of said plurality of segments having said same checksum value, wherein a first checksum value is operable as an index into said data association structure for obtaining segments having said first checksum value; storing, using a deduplication database, said data association structure; storing a plurality of respective timestamps associated with said plurality of segments, wherein each respective timestamp indicates a last time an associated segment was altered; accessing said stored data association structure in said computer memory by a second application program using a received application timestamp and accessing one or more segments of said plurality of segments having an associated timestamp of said plurality of respective timestamps that is newer than the received application timestamp, wherein the second application program identifies, using only said associated timestamp, one or more segments of said plurality of segments that are not processed by the second application program; receiving, by the second application program, a listing of files to which a segment of said one or more segments of said plurality of segments belongs; and comparing said listing of files against a defined subset of files to exclude from processing to further determine whether said segment needs to be processed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable storage medium having stored thereon computer executable instructions that, if executed by a computer system, cause the computer system to perform a method of providing file information relating to data deduplication, said method comprising:
-
accessing a plurality of files, wherein each file comprises a plurality of segments; accessing a data repository storing data resultant from a file deduplication process including accessing a plurality of checksum values associated with said plurality of segments, wherein a first application program performs said file deduplication process; identifying segments of said plurality of segments having a same checksum value; generating a data association structure by associating said segments of said plurality of segments having said same checksum value, wherein a first checksum value is operable as an index into said data association structure for obtaining segments having said first checksum value, and wherein, within said data association structure, each segment indicates a respective file to which each segment is associated; storing, using a deduplication database, said data association structure; storing a plurality of respective time stamps associated with said plurality of segments, wherein each respective time stamp indicates a last time an associated segment was altered; accessing said stored data association structure in said computer memory by a second application program using a received application timestamp and accessing one or more segments of said plurality of segments having an associated timestamp of said plurality of respective time stamps that is newer than the received application timestamp, wherein the second application program identifies, using only said respective timestamps, one or more segments of said plurality of segments that are not processed by the second application program; receiving, by the second application program, a listing of files to which a segment of said one or more segments of said plurality of segments belongs; and comparing said listing of files against a defined subset of files to exclude from processing to further determine whether said segment needs to be processed. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A method of providing information relating to a deduplication process in a computer system, said method comprising:
-
generating a segment index data structure responsive to performing a data deduplication process on a plurality of files of a file system, wherein each file of said plurality of files comprises a plurality of segments and wherein said segment index data structure comprises a listing of unique segments within said plurality of segments, and wherein a first application performs said data deduplication process; storing, using a deduplication database, a plurality of respective time stamps associated with said plurality of segments, wherein each respective time stamp indicates a last time an associated segment was altered; receiving a request from a requesting application for segment information; responsive to said request, scanning said segment index data structure; responsive to said scanning, supplying segment information to said requesting application, wherein said request includes an application timestamp indicating a last time said requesting application processed data within said file system, and wherein said scanning comprises scanning only respective segments of said segment index data structure having respective timestamps of said plurality of respective time stamps having a time after said application timestamp, wherein the requesting application identifies, using only said respective timestamps, one or more segments of said respective segments that are not processed by the requesting application; receiving, by the second application program, a listing of files to which a segment of said one or more segments of said plurality of segments belongs; and comparing said listing of files against a defined subset of files to exclude from processing to further determine whether said segment needs to be processed. - View Dependent Claims (17, 18, 19, 20)
-
Specification