Method and apparatus for storing information in a data processing system
First Claim
1. In a computer system including at least one data source wherein data is stored in source allocation units and a data repository having access to the data source and including a storage device for storing data in repository allocation units, a method for storing data from the data source in the storage device of the data repository, comprising the steps of:
- (a) reading data from the source allocation units and restructuring the data into data unit having a size corresponding to the repository allocation units;
(b) for each data unit read from the data source, generating a hash value for the data of each data unit;
(c) for each data unit read from the data source, searching a data table for a table entry having a hash value matching a hash value of the data unit read from the data source, wherein each table entry contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit;
(d) when the hash value of a data unit does not match any hash value of any table entry in the data table, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and writing the new table entry containing the hash value and a repository allocation unit pointer to the newly allocated repository allocation unit to the data table;
(e) when the hash value of a data unit matches the hash value of a data entry in the data table, accessing the table entry having a matching hash value and using the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit, and comparing the data of the data unit and the data of the corresponding repository allocation unit, if the data of the data unit matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the data unit does not match the data of the corresponding repository allocation unit, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and inserting the new table entry into the data table; and
, (f) repeating steps (a) through (e) until all source allocation units have been read.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for storing data from a data source in a storage device of a data repository by reading all source allocation units, restructuring the data into data units having a size corresponding to the repository allocation units, and generating a hash value for the data of each data unit read from the data source. For each data unit, a data table is searched for a table entry having a matching hash value wherein each table entry contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit. When the hash value of a data unit does not match any hash value of any table entry in the data table, the data of the data unit is written into a newly allocated repository allocation unit a new table entry is written to the data table. When the hash value of a data unit matches the hash value of a data entry in the data table, the data of the corresponding repository allocation unit and is compared with the data of the data unit. If the data of the data unit matches the repository allocation unit, the data unit is discarded. If the data of the data unit does not match the corresponding repository allocation unit, the data unit is written into a newly allocated repository allocation unit and a new table entry is inserted into the data table.
-
Citations
21 Claims
-
1. In a computer system including at least one data source wherein data is stored in source allocation units and a data repository having access to the data source and including a storage device for storing data in repository allocation units, a method for storing data from the data source in the storage device of the data repository, comprising the steps of:
-
(a) reading data from the source allocation units and restructuring the data into data unit having a size corresponding to the repository allocation units;
(b) for each data unit read from the data source, generating a hash value for the data of each data unit;
(c) for each data unit read from the data source, searching a data table for a table entry having a hash value matching a hash value of the data unit read from the data source, wherein each table entry contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit;
(d) when the hash value of a data unit does not match any hash value of any table entry in the data table, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and writing the new table entry containing the hash value and a repository allocation unit pointer to the newly allocated repository allocation unit to the data table;
(e) when the hash value of a data unit matches the hash value of a data entry in the data table, accessing the table entry having a matching hash value and using the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit, and comparing the data of the data unit and the data of the corresponding repository allocation unit, if the data of the data unit matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the data unit does not match the data of the corresponding repository allocation unit, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and inserting the new table entry into the data table; and
,(f) repeating steps (a) through (e) until all source allocation units have been read. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 21)
the source allocation units are clusters of a storage device.
-
-
3. The method of claim 1 for storing data from the data source in the storage device of the data repository, wherein:
the source allocation units are sectors of a storage device.
-
4. The method of claim 1 for storing data from the data source in the storage device of the data repository, wherein:
a repository allocation unit pointer is a byte offset of the location of a repository allocation unit from the beginning of the repository allocation units in the storage device of the data repository.
-
5. The method of claim 1 for storing data from the data source in the storage device of the data repository, wherein:
the data table is partitioned into data records wherein each data record contains an array of table entries containing at least one table entry.
-
6. The method of claim 5 for storing data from the data source in the storage device of the data repository, wherein step (c) includes the steps of:
-
(c1) fetching a first/next data record;
(c2) determining whether the fetched first/next data record is the last data record of a linked list of one or more data records in the data table wherein the last data record of the data table is not sorted according to the hash values represented therein, and going to step (c3) when the fetched first/next data record is not the last data record of the data table, and going to step (c5) when the fetched data record is the last data record of the data table;
(c3) determining whether the hash value of the newly received data unit is smaller than the hash value of the first table entry of the data record, and when the hash value of the newly received data unit is smaller than the hash value of the first table entry, going to step (c1), and when the hash value of the newly received data unit is not smaller than the hash value of the first table entry, going to step (c4);
(c4) determining whether the hash value of the newly received data unit is larger than the hash value of the last table entry of the data record, and when the hash value of the newly received data unit is larger than the hash value of the last table entry of the data record, going to step (c1), and when the hash value of the newly received data unit is not larger than the hash value of the last table entry of the data record, going to step (c6);
(c5) performing a linear search to find a match between the hash value of the newly received data unit and the hash value of a table entry in the data record, and when a match between the hash value of the newly received data unit and the hash value of a table entry in the data record is not found, returning to step (d), and when a match between the hash value of the newly received data unit and the hash value of a table entry in the data record is found, returning to step (e). (c6) performing a binary search to find a match between the hash value of the newly received data unit and the hash value of a table entry in the data record, and when a match between the hash value of the newly received data unit and the hash value of a table entry in the data record is not found, returning to step (d), and when a match between the hash value of the newly received data unit and the hash value of a table entry in the data record is found, returning to step (e).
-
-
7. The method of claim 5 for storing data from the data source in the storage device of the data repository, wherein step (d) includes the steps of:
-
(d1) determining whether a data record exists to receive the table entry of the newly received data unit, and if a data record does not exist to receive the table entry, going to step (d2), and if a data record exists to receive the table entry, going to step (d3);
(d2) creating a new data record to receive the table entry of the newly received data unit, (d3) determining whether the data record has space to receive a new table entry, and if the data record has space to receive a new table entry, going to step (d6), and if the data record does not have space to receive a new table entry, going to step (d4);
(d4) sorting the last data record according to the hash values of the record entries appearing therein;
(d5) creating a new data record to be a new last data record of the data table, and linking the new last data record to the chain of one or more data records of the data table;
(d6) inserting the new table entry into the last data record of the data table, and returning to step (d).
-
-
8. The method of claim 1 for storing data from the data source in the storage device of the data repository, wherein step (e) includes the steps of:
-
(e1) upon identifying a match between the hash value of a new data unit and a hash value already represented in the data table, determining whether there is sufficient room in a suspense array to insert a new suspense element wherein the suspense array includes one or more suspense elements, and a suspense element contains the data of a new data unit having a hash value that matches the hash value of a table entry residing in the data table, and (e2) if there is room in the suspense array to insert a new suspense element, going to step (e3), and if there is not room in the suspense buffer to insert a new suspense element, going to step (e4);
(e3) inserting the new suspense element into the suspense array, and returning to step (e);
(e4) flushing the suspense buffer by reading each suspense element of the suspense array and, for each suspense element, accessing the table entry having a matching hash value and using the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit, and comparing the data of the data unit represented by the suspense element and the data of the corresponding repository allocation unit, and if the data of the suspense element matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the suspense element does not match the data of the corresponding repository allocation unit, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and inserting the new table entry into the data table, and returning to step (e).
-
-
9. The method of claim 8 for flushing the suspense buffer, wherein step (e3) includes the steps of:
-
(e3-a) sorting the suspense array by the repository allocation unit pointers to the repository allocation units of the suspense elements in the suspense buffer;
(e3-b) allocating a flushing buffer to store at least one of the suspense elements stored in the suspense array;
(e3-c) setting a suspense array index pointer to point to the first sorted suspense element;
(e3-d) reading suspense elements from the repository into the flushing buffer, starting with the suspense element indicated by the suspense array index pointer;
(e3-e) processing each allocation unit corresponding to each suspense element in the flushing buffer by comparing the data of the data unit represented by a suspense element and the data of the corresponding repository allocation unit, and if the data of the suspense element matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the suspense element does not match the data of the corresponding repository allocation unit, writing the data of the data unit into a newly allocated repository allocation unit, generating a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit, and inserting the new table entry into the data table;
(e3-f) after processing each suspense element in the flushing buffer, advancing the suspense array index pointer to the next suspense array entry representing a suspense element in the suspense buffer that has not been processed, and return to step (e3-d);
(e3-g) when there are no more suspense elements in the flushing buffer to be processed, returning to step e.
-
-
10. The method of claim 9 for flushing the suspense array, wherein the method for determining whether the data of a suspense element matches the data of a data unit already represented in a table entry of a data table and residing in a repository allocation unit of step (e3-3) includes the steps of:
-
(e3-3a) comparing the contents of a data unit represented in a suspense element with the contents of a data unit already residing in a repository allocation unit and represented in a table entry of a data table, if the contents of the data unit in the suspense element match the contents of the data unit already residing in a repository allocation unit and represented in a table entry, discarding the data unit in the suspense element, if the contents of the data unit in the suspense element do not match the contents of the data unit already residing in a repository allocation unit and represented in a data record, writing the data of the data unit to a newly allocated repository allocation unit and adding a corresponding table entry containing the hash value of the data unit and a repository allocation unit pointer to the location of the data unit in the newly allocated repository allocation unit to the data table, and returning to step (e3-f).
-
-
11. The method of claim 1 for storing data from the data source in the storage device of the data repository, wherein:
-
the repository allocation units are organized into one or more containers, each container is organized into one or more compartments, and each compartment includes one or more repository allocation unit.
-
-
12. The method of claim 11 for storing data from the data source in the storage device of the data repository, further including:
-
a compartment set file associated with a container, the compartment set file containing a list of compartments that are to be treated as a single file.
-
-
21. The method of claim 1, further comprising the steps of:
(g) mounting the contents of the repository allocation units of a data repository into a system as a restored disk volume having a directory structure identical to that of the data source, and accessing files on the restored disk volume from a software application using file system input/output calls.
-
13. In a computer system including at least one data source wherein data is stored in source allocation units in clusters of the data source and a data repository having access to the data source and including a storage device for storing data in repository allocation units, a data compression mechanism for storing data from the data source in the storage device of the data repository in compressed form by eliminating duplicate clusters of the data source, comprising:
-
a restructing mechanism for reading data from the source allocation units and restructuring the data into data units having a size corresponding to the repository allocation units;
a hash generator for generating a hash value for the data of each data unit read from the data source;
a table search mechanism responsive to each new data unit read from the data source for searching a data table for a data record having a record entry having a hash value matching a hash value of the data unit read from the data source, wherein the data table includes at least one data record and each data record contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit;
a storage manager responsive to operation of the table search mechanism for writing a newly received data unit into a newly allocated repository allocation unit when the hash value of the newly received data unit does not match a hash value of a record entry in a data record, a table generator responsive to operation of the table search mechanism for generating a new record entry containing the hash value of the newly received data unit and a repository allocation unit pointer to the newly allocated repository allocation unit and writing the new record entry into a data record;
a suspense processor responsive to operation of the table search mechanism for writing the data of the newly received data unit and the corresponding hash value into a suspense element of a suspense buffer when the hash value of a newly received data unit matches the hash value of a record entry in a data record, and for each suspense element, accessing the record entry having a matching hash value and using the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit, comparing the data of the data unit and the data of the corresponding repository allocation, and if the data of the data unit matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the data unit does not match the data of the corresponding repository allocation unit, indicating the mismatch to the storage manager and the table generator, wherein the storage manager is responsive to the suspense processor for writing the data of the data unit into a newly allocated repository allocation unit, the table generator is responsive to the suspense processor for generating a new record entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit and writing the new record entry into a data record. - View Dependent Claims (14, 15)
the repository allocation units are organized into one or more containers, each container is organized into one or more compartments, and each compartment includes one or more repository allocation units.
-
-
15. The data compression mechanism of claim 14, further including:
-
a compartment set file associated with a container, the compartment set file containing a list of compartments that are to be treated as a single file.
-
-
16. In a mass storage device for storing data unit received from at least one data source and including a storage element for storing the data and a controller for controlling the storing of data in storage allocation units of the storage element, a data compression mechanism for storing the data in compressed form by eliminating the storing of duplicate data units, comprising:
-
a hash generator for generating a hash value for the data of each data unit received by the mass storage device;
a table search mechanism responsive to each new data unit for searching a data table for a table entry having a hash value matching the hash value of the new data unit, wherein each table entry contains the hash value of a data unit stored in the storage element and an indicator of the storage allocation unit containing the data unit;
a storage manager responsive to operation of the table search mechanism for writing a newly received data unit into a newly allocated storage allocation unit of the storage element when the hash value of the newly received data unit does not match a hash value of a table entry and discarding the newly received data unit when the hash value of the newly received data unit matches a hash value of a table entry, and a table generator responsive to operation of the table search mechanism when the hash value of the newly received data unit does not match a hash value of a table entry for generating a new table entry containing the hash value of the newly received data unit and an indicator of the newly allocated storage allocation unit containing the newly received data unit and writing the new record entry into the data table. - View Dependent Claims (17, 18, 19, 20)
a suspense processor responsive to operation of the table search mechanism for writing the data of the newly received data unit and the corresponding hash value into a suspense element of a suspense buffer when the hash value of a newly received data unit matches the hash value of a table entry, and for each suspense element, accessing the table entry having a matching hash value and using the indicator of the storage allocation unit containing the data unit therein to read the data of the corresponding repository allocation unit, comparing the data of the data unit and the data of the corresponding repository allocation, and if the data of the data unit matches the data of the corresponding repository allocation unit, discarding the data unit, and if the data of the data unit does not match the data of the corresponding repository allocation unit, indicating the mismatch to the storage manager and the table generator, wherein the storage manager is responsive to the suspense processor for writing the data of the data unit into a newly allocated repository allocation unit, the table generator is responsive to the suspense processor for generating a new record entry containing the hash value of the data unit and an indicator of the newly allocated storage allocation unit containing the data unit and writing the new table entry into the data table.
-
-
18. The mass storage device of claim 16 wherein the mass storage device is a disk drive including at least one magnetic disk storage element or solid-state mechanism or similar mass storage device and the repository allocation units are sectors of the magnetic disk storage element.
-
19. The mass storage device of claim 16 wherein the mass storage device includes at least one solid-state mass storage mechanism having storage space organized as repository allocation units.
-
20. The mass storage device of claim 16 wherein the hash generator and table lookup mechanisms are replaced by associative array hardware.
Specification