Method and system for automatically merging files into a single instance store
First Claim
1. A computer-readable medium having computer-executable instructions, comprising, automatically identifying at least two files having duplicate data, automatically merging the duplicate data of the files into a single instance representation of that data, converting each of the files into logically separate links to the single instance representation, each link comprising a logically separate link file that provides logically separate file system access to the single instance representation of the file data, and reclaiming storage space that was occupied by the duplicate data of at least one of the files.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system that operates as a background process automatically identify and merge duplicate files into a single instance files, wherein the duplicate files become independent links to the single instance files. A groveler maintains a database of information about the files on a volume, including a file size and checksum (signature) based on the file contents. The groveler periodically acts in the background to scan the USN log, a log that dynamically records file system activity. New or modified files detected in the USN log are queued as work items, each work item representing a file. The volume may be scanned to add work items to the queue, which takes place initially or when there is a potential problem with the USN log. The groveler periodically removes items from the queue, calculates the signature of the corresponding file contents, and uses the signature and file size to query the database for matching files. The groveler then compares any matching files with the file corresponding to the work item for an exact duplicate, and if found, calls a single instance store facility to merge the files and create independent links to those files.
518 Citations
41 Claims
- 1. A computer-readable medium having computer-executable instructions, comprising, automatically identifying at least two files having duplicate data, automatically merging the duplicate data of the files into a single instance representation of that data, converting each of the files into logically separate links to the single instance representation, each link comprising a logically separate link file that provides logically separate file system access to the single instance representation of the file data, and reclaiming storage space that was occupied by the duplicate data of at least one of the files.
- 16. A method of identifying files having similar properties on a file system volume, comprising, in a first operation, adding file information to a queue, in a second operation distinct from the first operation, removing file information from the queue, querying a database with at least one property of a file corresponding to the file information removed from the queue, and receiving a set of at least one file identifier, each file identifier in the set corresponding to a file having at least one similar property of the file corresponding to the file information removed from the queue.
- 27. A system for identifying files having similar properties on a file system volume, comprising, a database including file property information, a database manager for querying the database, a work queue, a first component for adding file identifiers to the work queue, and a second component for removing file identifiers from the queue, the second component providing a query to the database manager, the query including property information corresponding to a file identified by a file identifier removed from the queue, the second component receiving a set of file identifiers in response to the query, each identifier in the set corresponding to a file having property information that matches the file property information identified in the query.
Specification