De-duplication of data in a data processing system
DC CAFCFirst Claim
1. A computer-implemented method of eliminating a particular data item at a given storage location in a distributed file system when a copy of said particular data item can be obtained from at least one other storage location in the file system,wherein said file system comprises (i) a plurality of distinct storage locations, and (ii) at least one computer distinct from said plurality of storage locations, and (iii) at least one database, andwherein each data item in the file system is stored at multiple distinct storage locations in the file system, based, at least in part, on a predetermined degree of redundancy for said file system, andwherein said at least one database comprises mapping data that identifies where data items are stored in the file system, andwherein the file system further includes a table including a plurality of records indicating changes made to said file system, said table being accessible by said at least one computer,the method comprising the steps of:
- (A) obtaining, at said at least one computer, a request to delete a particular data item from said given storage location; and
(B) based on the request, ascertaining a particular substantially unique data identifier for the particular data item, wherein said particular data item consists of a particular sequence of bits and wherein said particular data identifier is based, at least in part, on a given function of all of the bits in the particular sequence of bits of the particular data item; and
(C) determining, using at least said particular substantially unique data identifier for the particular data item, whether deletion of said particular data item from said given storage location will leave a sufficient number of copies of said data item present at one or more other storage locations in the file system to satisfy said predetermined degree of redundancy for said file system with respect to that particular data item; and
(D) based at least in part on said determining, if it is determined that deletion of the particular data item from the given storage location will leave at least a sufficient number of copies of said data item at one or more other storage locations in said file system to satisfy said predetermined degree of redundancy for said file system with respect to that particular data item, then(d1) adding an entry to said table to indicate deletion of said particular file from the given storage location in the file system, said entry including said particular substantially unique identifier for said particular data item.
9 Assignments
Litigations
1 Petition
Reexaminations
Accused Products
Abstract
In a data processing system, a method includes deleting a particular copy of a data item when at least one other copy of the data item is available. The presence of another copy of the data item is determined, at least in part, based on an identifier for the data item, the identifier having been computed using all of the data in the data item and only the data in the data item, wherein two identical data items in the data processing system will have identical identifiers. The particular copy of the data item may be deleted if another copy of the data is determined to be present on another processor in the system or on the same processor. The identifier of the data item is computed using a function such as a message digest or hash function which may be: MD4, MD5, or SHA.
-
Citations
35 Claims
-
1. A computer-implemented method of eliminating a particular data item at a given storage location in a distributed file system when a copy of said particular data item can be obtained from at least one other storage location in the file system,
wherein said file system comprises (i) a plurality of distinct storage locations, and (ii) at least one computer distinct from said plurality of storage locations, and (iii) at least one database, and wherein each data item in the file system is stored at multiple distinct storage locations in the file system, based, at least in part, on a predetermined degree of redundancy for said file system, and wherein said at least one database comprises mapping data that identifies where data items are stored in the file system, and wherein the file system further includes a table including a plurality of records indicating changes made to said file system, said table being accessible by said at least one computer, the method comprising the steps of: -
(A) obtaining, at said at least one computer, a request to delete a particular data item from said given storage location; and (B) based on the request, ascertaining a particular substantially unique data identifier for the particular data item, wherein said particular data item consists of a particular sequence of bits and wherein said particular data identifier is based, at least in part, on a given function of all of the bits in the particular sequence of bits of the particular data item; and (C) determining, using at least said particular substantially unique data identifier for the particular data item, whether deletion of said particular data item from said given storage location will leave a sufficient number of copies of said data item present at one or more other storage locations in the file system to satisfy said predetermined degree of redundancy for said file system with respect to that particular data item; and (D) based at least in part on said determining, if it is determined that deletion of the particular data item from the given storage location will leave at least a sufficient number of copies of said data item at one or more other storage locations in said file system to satisfy said predetermined degree of redundancy for said file system with respect to that particular data item, then (d1) adding an entry to said table to indicate deletion of said particular file from the given storage location in the file system, said entry including said particular substantially unique identifier for said particular data item. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method of deleting a particular instance of a given file in a file system, wherein the file system comprises (i) a plurality of servers to store data as fixed-size chunks;
- (ii) a database;
(iii) at least one computer connected to the servers, and (iv) a list of records indicating statuses of files in the file system,wherein each file in the file system has a corresponding digital file identifier, each file in the file system consisting of a corresponding sequence of bits, and the corresponding digital file identifier for each file in the file system being based, at least in part, on given function of all of the bits of the file, said given function comprising a hash function or message digest function, and wherein two identical files in the file system have the same digital file identifier as determined using said given function; and wherein the list of records includes digital file identifiers for at least some of the records on the list, wherein each file in the file system consists of one or more non-overlapping chunks, each chunk being replicated on multiple servers of said plurality of servers according to a predetermined degree of redundancy for said file system, the database comprising;
(i) first data that includes digital file identifiers for files for which the file data are stored as chunks; and
(ii) second data that maps the file identifiers to the chunks to which the digital file identifiers correspond; and
(iii) location data that identifies which of the plurality of servers stores which of the chunks,the method comprising the steps of; (A) obtaining a request to delete said particular instance of said given file in the file system; and (B) based at least in part on said request, ascertaining a digital file identifier corresponding to the given file, the given file consisting of a corresponding sequence of bits, and the corresponding digital file identifier for the given file being based, at least in part, on the given function of all of the bits of the given file; and (C) determining whether deletion of said particular instance of said given file will violate said predetermined degree of redundancy for said file system; and (D) based at least in part on said determining in (C), when it is determined that deletion of said particular instance of said given file will not violate said predetermined degree of redundancy for said file system, then;
(d1) updating said list of records to indicate deletion of said particular instance of said given file; and
(d2) deleting said particular instance of said given file. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- (ii) a database;
-
17. A method of deleting duplicate data files in a file system, the file system comprising a plurality of servers to store data as fixed-size chunks, and wherein each file in the file system consists of one or more non-overlapping chunks, wherein each chunk is replicated on multiple servers of said plurality of servers, and
wherein each file in the file system has a corresponding digital file identifier, each file in the file system consisting of a corresponding sequence of bits, and the corresponding digital file identifier for each file in the file system being based, at least in part, on given function of all of the bits of the file, and wherein two identical files in the file system have the same digital file identifier as determined using said given function, and wherein the file system includes mapping data to map the digital file identifier of each file in the file system to corresponding location data that identifies which servers in the file system store the corresponding chunks of said file, the method comprising: -
determining whether at least some of the chunks of at least some of the data files have a sufficient number of duplicates stored on said servers in the file system, said sufficient number of duplicates being based, at least in part, on a predetermined degree of redundancy for said file system, said determining being based, at least in part, on said digital file identifiers for said data files; and
,based at least in part on said determining, updating a list to indicate the deletion of at least some of the chunks of data files that are determined to have more than said sufficient number of duplicates. - View Dependent Claims (18, 19)
-
-
20. A method, operable in a file system comprising (a) a plurality of servers to store file data, and (b) at least one computer distinct from said plurality of servers, and (c) at least one database, and
(i) wherein each file in the file system has a corresponding digital file identifier, each file in the file system consisting of a corresponding sequence of bits, and the corresponding digital file identifier for each file in the file system being based, at least in part, on given function of all of the bits of the corresponding file, and (ii) wherein each file in the file system consists of a corresponding one or more non-overlapping parts, each part being replicated on multiple servers of said plurality of servers in said file system, and (iii) wherein said at least one database comprises a map, for files in the file system, between file identifiers of said files in the file system and corresponding location information, said location information for a specific file identifying which of the plurality of servers stores each part of the specific file, and (iv) wherein the file system further includes a list of entries, each entry indicating statuses of file in said file system, the method comprising: -
(A) in response to a request to delete a particular file, locating an entry corresponding to a particular digital file identifier for said particular file in said at least one database, wherein said particular file consists of a particular sequence of bits, said particular digital file identifier for said particular file being based, at least in part, on said given function of all of the bits in said particular sequence of bits of the particular file; and (B) adding an entry to said list to indicate that a status said particular file is deleted, said entry including said particular digital file identifier of said particular file. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer-implemented deletion method operable in a file system comprising:
- (i) a plurality of servers to store file data as segments; and
(ii) first data that includes file identifiers for files for which the file data are stored as segments; and
(iii) second data that maps the file identifiers to the segments to which the file identifiers correspond;
(iv) location data that identifies which of the plurality of servers stores which of the segments, and (v) a list indicating statuses of files in said file system,the method comprising the steps of; (A) receiving a request to delete a particular data item in the file system, said request including digital data item identifier corresponding to the particular data item, said particular data item consisting of an arbitrary sequence of bits consisting of a first sequence of non-overlapping segments, each of said segments in said first sequence being stored on multiple servers of the plurality of servers in the file system, said digital data item identifier being based at least in part on a hash function of the data comprising the particular data item; (B) hardware in combination with software, attempting to match the digital data item identifier of the particular data item with a digital data item identifier in a database, said database comprising (i) said first data that includes file identifiers for files for which the file data are stored as segments; and
(ii) said second data that maps the file identifiers to the segments to which the file identifiers correspond, and (iii) said location data that identifies which of the plurality of servers stores which of the segments; and(C) based at least in part on said attempting to match in step (B), updating said list to reflect deletion of said particular data item. - View Dependent Claims (26)
- (i) a plurality of servers to store file data as segments; and
-
27. A computer-implemented deletion method operable in a file system comprising (i) a plurality of servers;
- (ii) a database; and
(iii) at least one computer connected to the servers, and (v) a list indicating statuses of files in said file system, the method comprising;obtaining, at said at least one computer, a request to delete a first data item, said request including first data item identifier for a first data item, said first data item consisting of a first plurality of non-overlapping segments, each of said segments consisting of a corresponding sequence of bits, and each of said segments being stored on multiple servers of said plurality of servers in the file system, said first data item identifier being based at least in part on the data comprising the first data item; and determining, using hardware in combination with software, at least one matching record in the database for the first data item based at least in part on the first data item identifier, the database comprising a plurality of records, where the records in the database correspond to data items, and where the records in the database include;
(i) first data that includes data item identifiers for data items for which the data are stored in the file system as segments; and
(ii) second data, keyed on data item identifiers, that maps the data item identifiers to the segments to which the data item identifiers correspond, and (iii) location data, keyed on segment identifiers, that identifies which of the plurality of servers in the file system stores which of the segments; andupdating said list to reflect that said first data item is deleted from the file system; and in response to subsequent processing of said list, deleting at least one segment of the first data item at least one of the plurality of servers in the file system. - View Dependent Claims (28, 29)
- (ii) a database; and
-
30. A computer-implemented deletion method operable in a file system comprising (i) a plurality of servers;
- (ii) a list indicating, for each of a plurality of files in the file system, a corresponding status,
wherein, for each of a plurality of data items in the file system, said data items each consisting of a corresponding sequence of one or more parts; and wherein each data item has a corresponding digital data item identifier, said digital data item identifier for the data item being based, at least in part, on the contents of the data item, wherein two identical data items in the file system have the same digital data item identifier; and wherein each part is replicated on multiple servers of said plurality of servers; and wherein said list includes digital data item identifiers for data items for which changes are to be made in the file system, the method comprising the steps of; (A) obtaining a particular digital data item identifier of a particular data item, said particular digital data item identifier of said particular data item being obtained in response to an attempt to delete said particular data item in said file system; (B) updating a record in said list to reflect deletion of said particular data item from the file system, said record including the particular digital data item identifier to the list. - View Dependent Claims (31, 32)
- (ii) a list indicating, for each of a plurality of files in the file system, a corresponding status,
-
33. A file system comprising:
-
(i) a plurality of servers to store file data as segments; and (ii) first data that includes file identifiers for files for which the file data are stored as segments; and (iii) second data that maps the file identifiers to the segments to which the file identifiers correspond; and (iv) location data that identifies which of the plurality of servers stores which of the segments; and (v) a table including file identifiers for files in the file system, said table including a corresponding status for at least some of the files in the file system, (vi) at least one computer comprising hardware in combination with software and connected to the plurality of servers, the at least one computer programmed; (A) to receive a request to delete a particular data item in the file system; (B) to ascertain, in response to said request, a digital data item identifier corresponding to said particular data item, said particular data item consisting of an arbitrary sequence of bits consisting of a sequence of non-overlapping segments, each of said segments in said sequence being stored on multiple servers of the plurality of servers in the file system, said digital data item identifier being based at least in part on a given function of the data comprising the particular data item, said given function comprising a hash function; (C) to update an entry in said table corresponding to said particular data item to reflect deletion of said particular data item in the file system, said entry including at least said digital data item identifier of said particular data item.
-
-
34. A device comprising hardware including at least one processor and memory, said device operable in a file system,
wherein each file in the file system has a corresponding digital file identifier, each file in the file system consisting of a corresponding sequence of bits, and the corresponding digital file identifier for each file in the file system being based, at least in part, on given function of all of the bits of the file, said given function comprising a hash or message digest function, and wherein two identical files in the file system have the same digital file identifier as determined using said given function; - and
wherein the file system comprises a plurality of servers to store data as fixed-size chunks, and wherein each file in the file system consists of one or more non-overlapping chunks, and wherein each chunk is replicated on multiple servers of said plurality of servers, said memory of the device storing at least; a list including file identifiers for files for which changes are made in the file system, said device comprising software, in combination with said hardware; (A) to receive at said device a request to delete a particular file in the file system; (B) responsive to said request, to update a record in said list, said record including a particular digital file identifier for said particular file, said record indicating that said particular file is deleted. - View Dependent Claims (35)
- and
Specification