Method and system for automatically merging files into a single instance store

US 6,389,433 B1
Filed: 07/16/1999
Issued: 05/14/2002
Est. Priority Date: 07/16/1999
Status: Expired due to Term

First Claim

Patent Images

1. A computer-readable medium having computer-executable instructions, comprising, automatically identifying at least two files having duplicate data, automatically merging the duplicate data of the files into a single instance representation of that data, converting each of the files into logically separate links to the single instance representation, each link comprising a logically separate link file that provides logically separate file system access to the single instance representation of the file data, and reclaiming storage space that was occupied by the duplicate data of at least one of the files.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system that operates as a background process automatically identify and merge duplicate files into a single instance files, wherein the duplicate files become independent links to the single instance files. A groveler maintains a database of information about the files on a volume, including a file size and checksum (signature) based on the file contents. The groveler periodically acts in the background to scan the USN log, a log that dynamically records file system activity. New or modified files detected in the USN log are queued as work items, each work item representing a file. The volume may be scanned to add work items to the queue, which takes place initially or when there is a potential problem with the USN log. The groveler periodically removes items from the queue, calculates the signature of the corresponding file contents, and uses the signature and file size to query the database for matching files. The groveler then compares any matching files with the file corresponding to the work item for an exact duplicate, and if found, calls a single instance store facility to merge the files and create independent links to those files.

518 Citations

41 Claims

1. A computer-readable medium having computer-executable instructions, comprising, automatically identifying at least two files having duplicate data, automatically merging the duplicate data of the files into a single instance representation of that data, converting each of the files into logically separate links to the single instance representation, each link comprising a logically separate link file that provides logically separate file system access to the single instance representation of the file data, and reclaiming storage space that was occupied by the duplicate data of at least one of the files.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The computer-readable medium having computer-executable instructions of claim 1 wherein automatically identifying at least two files having duplicate data includes, adding file identifiers to a work item queue.
  - 3. The computer-readable medium having computer-executable instructions of claim 2 wherein adding file identifiers to a work item queue includes scanning a volume for file identifiers.
  - 4. The computer-readable medium having computer-executable instructions of claim 3 wherein scanning the volume for file identifiers occurs for a limited time.
  - 5. The computer-readable medium having computer-executable instructions of claim 2 wherein adding file identifiers to a work item queue includes extracting file information from a log of file activity.
  - 6. The computer-readable medium of claim 5 having further computer-executable instructions for calculating a time for extracting file information from the log.
  - 7. The computer-readable medium having computer-executable instructions of claim 6 wherein the time calculated is based on an amount of file information previously extracted from the log.
  - 8. The computer-readable medium having computer-executable instructions of claim 1 wherein automatically identifying at least two files having duplicate data includes, dequeuing a file identifier from a work item queue.
  - 9. The computer-readable medium of claim 8 having further computer-executable instructions for, querying a database of file information for a set of at least one file having properties that match properties of a file corresponding to the identifier dequeued from the work item queue.
  - 10. The computer-readable medium having computer-executable instructions of claim 9 wherein querying the database of file information includes providing a file size of the file corresponding to the identifier dequeued from the work item queue to a database manager.
  - 11. The computer-readable medium of claim 9 having further computer-executable instructions for, calculating a signature of the file corresponding to the identifier dequeued from the work item queue.
  - 12. The computer-readable medium having computer-executable instructions of claim 11 wherein querying the database of file information includes providing the signature to a database manager.
  - 13. The computer-readable medium having computer-executable instructions of claim 12 wherein querying the database of file information further includes providing a file size of the file corresponding to the identifier dequeued from the work item queue to the database manager.
  - 14. The computer-readable medium of claim 9 having further computer-executable instructions for, receiving the set of at least one file having properties that match properties of the file corresponding to the identifier dequeued from the work item queue, and comparing the data of at least one file in the set with the data of the file corresponding to the identifier dequeued from the work item queue.
  - 15. The computer-readable medium having computer-executable instructions of claim 14 wherein comparing the data determines if each file is an exact duplicate of the other.

16. A method of identifying files having similar properties on a file system volume, comprising, in a first operation, adding file information to a queue, in a second operation distinct from the first operation, removing file information from the queue, querying a database with at least one property of a file corresponding to the file information removed from the queue, and receiving a set of at least one file identifier, each file identifier in the set corresponding to a file having at least one similar property of the file corresponding to the file information removed from the queue.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 39, 40, 41)
- - 17. The method of claim 16 wherein adding file information to a work item queue includes scanning a volume for file identifier information.
  - 18. The method of claim 17 further comprising limiting the time for scanning the volume.
  - 19. The method of claim 17 wherein adding file information to a work item queue includes extracting file information from a log of file activity.
  - 20. The method of claim 19 further comprising calculating a time for extracting file information from the log.
  - 21. The method of claim 20 further comprising, returning an amount of file information extracted from the log, and using the amount to calculate a next time for extracting file information from the log.
  - 22. The method of claim 16 further comprising calculating a signature of the file corresponding to the file information removed from the queue, and wherein querying the database includes providing the signature to a database manager.
  - 23. The method of claim 22 wherein querying the database further includes providing a file size corresponding to the file information removed from the queue to the database manager.
  - 24. The method of claim 16 wherein querying the database includes providing a file size corresponding to the file information removed from the queue.
  - 25. The method of claim 16 further comprising, comparing data in the file that corresponds to the file information removed from the queue to the data in at least one file corresponding to file identifier information in the set, and if sufficiently similar, merging the files into a single instance representation thereof having independent links thereto.
  - 26. The method of claim 25 wherein comparing the data determines if each file is an exact duplicate of the other.
  - 39. The method of claim 16 wherein the first operation alternates with the second operation.
  - 40. The method of claim 39 wherein the first operation operates for a limited time, and the second operation operates after the time to remove each set of file information added to the queue in the first operation.
  - 41. A computer-readable medium having computer-executable instruction for performing the method of claim 16.

27. A system for identifying files having similar properties on a file system volume, comprising, a database including file property information, a database manager for querying the database, a work queue, a first component for adding file identifiers to the work queue, and a second component for removing file identifiers from the queue, the second component providing a query to the database manager, the query including property information corresponding to a file identified by a file identifier removed from the queue, the second component receiving a set of file identifiers in response to the query, each identifier in the set corresponding to a file having property information that matches the file property information identified in the query.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 28. The system of claim 27 wherein the second component compares the data of the file corresponding to the file identifier removed from the queue with the data of at least one file corresponding to a file identifier returned in response to the query.
  - 29. The system of claim 28 wherein the second component performs a byte comparison of the data in each file to determine if the file data matches exactly.
  - 30. The system of claim 27 wherein if the comparison indicates the file data is similar, the second component calls a facility for merging the files, the facility providing a single instance representation of the file data and logically separate links thereto.
  - 31. The system of claim 27 further comprising a log for recording file activity, and wherein the first component extracts at least some of the file identifiers for adding to the work queue from the log.
  - 32. The system of claim 27 further comprising a third component for scanning a volume to add file identifiers to the queue.
  - 33. The system of claim 27 wherein the file property information includes a file size.
  - 34. The system of claim 27 wherein the file property information includes a signature.
  - 35. The system of claim 34 wherein the second component computes a signature of the file corresponding to the file removed from the queue.
  - 36. The system of claim 27 wherein the first and second components are functions within a single process, and wherein a partition controller corresponding to a file system volume calls the functions.
  - 37. The system of claim 36 including a plurality of partition controllers, each partition controller corresponding to a file system volume, and further comprising a central controller for controlling the operation of the partition controllers.
  - 38. The system of claim 37 wherein the central controller operates the partition controllers as a background process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Bolosky, William J., Douceur, John R., Cutshall, Scott M.
Primary Examiner(s)
Alam, Hosain T.
Assistant Examiner(s)
TRUONG, CAM Y T

Application Number

US09/354,660
Time in Patent Office

1,033 Days
Field of Search

707/1, 707/3, 707/10, 707/200-205
US Class Current

707/749
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/174   Redundancy elimination perf...

Y10S 707/968   Partitioning

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99952   Coherency, e.g. same view t...

Y10S 707/99953   Recoverability

Y10S 707/99956   File allocation

Method and system for automatically merging files into a single instance store

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

518 Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for automatically merging files into a single instance store

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

518 Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links