Synchronized data deduplication
First Claim
1. A computer-implemented method for performing data deduplication for data used by a plurality of computing systems, the method comprising:
- receiving, over a computer network and at a shared deduplicated storage repository, data from a first computing system of a plurality of computing systems, each of the plurality of computing systems physically separate from the shared deduplicated storage repository and including application software executing thereon, the application software of the first computing system generating the received data;
with a processing system which comprises computer hardware, is physically separate from the plurality of computing systems, and is located at the shared deduplicated storage repository, performing a data deduplication operation on the received data, the deduplication operation comprising;
defining a segment of the received data;
applying an algorithm to the defined data segment to generate a signature for the defined data segment;
comparing the signature for the defined data segment with one or more signatures stored in a central reference table for one or more previously defined data segments to determine whether the defined segment is already stored in the shared deduplicated storage repository; and
updating the central reference table to include the signature for the defined data segment and a reference for the defined data segment if the defined data segment is not in the shared deduplicated storage repository;
subsequent to said performing the data deduplication operation, analyzing data traffic received from the plurality of computing systems;
based on said analyzing the data traffic, determining at least a second computing system of the plurality of computing systems to which to transmit an updated partial instantiation of the central reference table, the partial instantiation including the signature for the defined data segment;
transmitting the updated partial instantiation of the central reference table from the shared deduplicated storage repository to the determined second computing system of the plurality of computing systems, such that the partial instantiation of the central reference table local to the second computing system includes the at least one signature and a partial instantiation of the central reference table local to a third computing system of the plurality of computing systems does not include the at least one signature; and
with the second computing system, subsequent to said transmitting the partial instantiation of the central reference table;
generating a signature for a first data segment generated by the application software executing on the second computing system, the first data segment matching the defined data segment and scheduled for storage in the shared deduplicated storage repository;
comparing the signature for the first data segment with one or more signatures stored in the partial instantiation of the central reference table local to the second computing system;
determining that an entry exists in the partial instantiation of the central reference table local to the second computing system that corresponds to the signature for the first data segment; and
transmitting the signature for the first data segment over the network from the second computing system to the shared deduplicated storage repository without transmitting the first data segment itself.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for data deduplication is presented. Data received from one or more computing systems is deduplicated, and the results of the deduplication process stored in a reference table. A representative subset of the reference table is shared among a plurality of systems that utilize the data deduplication repository. This representative subset of the reference table can be used by the computing systems to deduplicate data locally before it is sent to the repository for storage. Likewise, it can be used to allow deduplicated data to be returned from the repository to the computing systems. In some cases, the representative subset can be a proper subset wherein a portion of the referenced table is identified shared among the computing systems to reduce bandwidth requirements for reference-table synchronization.
338 Citations
12 Claims
-
1. A computer-implemented method for performing data deduplication for data used by a plurality of computing systems, the method comprising:
-
receiving, over a computer network and at a shared deduplicated storage repository, data from a first computing system of a plurality of computing systems, each of the plurality of computing systems physically separate from the shared deduplicated storage repository and including application software executing thereon, the application software of the first computing system generating the received data; with a processing system which comprises computer hardware, is physically separate from the plurality of computing systems, and is located at the shared deduplicated storage repository, performing a data deduplication operation on the received data, the deduplication operation comprising; defining a segment of the received data; applying an algorithm to the defined data segment to generate a signature for the defined data segment; comparing the signature for the defined data segment with one or more signatures stored in a central reference table for one or more previously defined data segments to determine whether the defined segment is already stored in the shared deduplicated storage repository; and updating the central reference table to include the signature for the defined data segment and a reference for the defined data segment if the defined data segment is not in the shared deduplicated storage repository; subsequent to said performing the data deduplication operation, analyzing data traffic received from the plurality of computing systems; based on said analyzing the data traffic, determining at least a second computing system of the plurality of computing systems to which to transmit an updated partial instantiation of the central reference table, the partial instantiation including the signature for the defined data segment; transmitting the updated partial instantiation of the central reference table from the shared deduplicated storage repository to the determined second computing system of the plurality of computing systems, such that the partial instantiation of the central reference table local to the second computing system includes the at least one signature and a partial instantiation of the central reference table local to a third computing system of the plurality of computing systems does not include the at least one signature; and with the second computing system, subsequent to said transmitting the partial instantiation of the central reference table; generating a signature for a first data segment generated by the application software executing on the second computing system, the first data segment matching the defined data segment and scheduled for storage in the shared deduplicated storage repository; comparing the signature for the first data segment with one or more signatures stored in the partial instantiation of the central reference table local to the second computing system; determining that an entry exists in the partial instantiation of the central reference table local to the second computing system that corresponds to the signature for the first data segment; and transmitting the signature for the first data segment over the network from the second computing system to the shared deduplicated storage repository without transmitting the first data segment itself. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computing system, comprising:
-
a deduplicated data storage repository in communication with a plurality of computing devices over a network, the deduplicated data storage repository shared by the plurality of computing devices, the plurality of computing devices each physically separate from the shared deduplicated storage repository and having application software executing thereon; and a data deduplication computing device comprising computing hardware, physically separate from the plurality of computing systems, communicatively coupled to the data storage repository, and configured to; perform data deduplication on a defined data segment received from a first computing system of the plurality of computing systems, the data segment generated by the application software executing on the first computing system; generate a central reference table that includes signatures corresponding to deduplicated data contained in the data storage repository; update the central reference table to include a signature for the defined data segment; subsequent to updating the central reference table, analyze data traffic received from the plurality of computing systems; based on the analysis of the data traffic, determine at least a second computing system of the plurality of computing systems to which to transmit an updated partial instantiation of the central reference table, the partial instantiation including the signature for the defined data segment; and transmit the partial updated instantiation of the central reference table the second computing system, such that the partial instantiation of the central reference table local to the second computing system is updated to include the at least one signature while a partial instantiation of the central reference table local to the third computing system does not include the at least one signature; the second computing system configured, subsequent to receiving the updated partial instantiation of the central reference table, to; generate a signature for a first data segment generated by the application software executing on the second computing system, the first data segment matching the defined data segment and scheduled for storage in the shared deduplicated storage repository; compare the signature for the first data segment with one or more signatures stored in the partial instantiation of the central reference table local to the second computing system; determine that an entry exists in the partial instantiation of the central reference table local to the second computing system that corresponds to the signature for the first data segment; and transmit the signature for the first data segment over the network from the second computing system to the deduplicated data storage repository without transmitting the first data segment itself. - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification