Data duplication detection system and method for controlling data duplication detection system
First Claim
1. A data duplication detection system for detecting a duplication of data, the data duplication detection system comprising:
- a data duplication determination part configured to determine whether each of a plurality of pieces of chunk data formed by dividing a received data is a duplicate of chunk data that has already been stored;
a storage part configured to store chunk data that has been determined not to be duplicative by the data duplication determination part;
a first management table configured to manage, for each piece of chunk data stored in the storage part, identity guarantee data that indicates a data identity, with the identity guarantee data being associated with storage-destination information that indicates a data storage destination;
a second management table created on the basis of the identity guarantee data for the each piece of chunk data stored in the storage part, the second management table being configured to indicate with prescribed reliability that a piece of chunk data is stored in the storage part, the prescribed reliability being a probability equal to greater than a probability threshold value, the threshold value calculated based on a number of prescribed hash functions used to determine hash values for the piece of chunk data and a number of bits of a bit string that indicates the hash values at positions of the bit string corresponding to the hash values; and
a third management table configured to manage a plurality of chunk data sets formed by grouping together the pieces of chunk data stored in the storage part, the third management table being configured to manage the identity guarantee data for prescribed chunk data that represents each of the plurality of chunk data sets,wherein the data duplication determination part;
in a case where the second management table indicates that a target chunk data included in the received data is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the target chunk data is not stored in the third management table, temporarily stores the target chunk data in a temporary storage part;
in a case where the second management table indicates that a second target chunk data that differs from the target chunk data is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the second target chunk data is stored in the third management table, determines whether the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table;
in a case where determination is made that the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table, determines that the target chunk data stored in the temporary storage part is already stored in the storage part; and
in a case where determination is made that the identity guarantee data for the target chunk data stored in the temporary storage part is not stored in the first management table, determines that the target chunk data stored in the temporary storage part is not stored in the storage part.
1 Assignment
0 Petitions
Accused Products
Abstract
Accurate and efficient detection of data duplication is implemented. A data duplication determination part (1A) of a storage system (1) temporarily stores in a pool (4) a target chunk data when a second management table (T2) indicates that the target chunk data, which is included in received data, has already been stored, and, in addition, when data (a fingerprint (FP)) for guaranteeing the identity of the target chunk data is not stored in a third management table (T3). The duplication determination part (1A), in a case where the second management table indicates that another target chunk data that differs from the target chunk is already stored, and, in addition, in a case where determination is made that the FP of the other target chunk data is stored in the third management table, makes a redetermination as to whether the chunk data stored in the pool has already been stored.
11 Citations
14 Claims
-
1. A data duplication detection system for detecting a duplication of data, the data duplication detection system comprising:
-
a data duplication determination part configured to determine whether each of a plurality of pieces of chunk data formed by dividing a received data is a duplicate of chunk data that has already been stored; a storage part configured to store chunk data that has been determined not to be duplicative by the data duplication determination part; a first management table configured to manage, for each piece of chunk data stored in the storage part, identity guarantee data that indicates a data identity, with the identity guarantee data being associated with storage-destination information that indicates a data storage destination; a second management table created on the basis of the identity guarantee data for the each piece of chunk data stored in the storage part, the second management table being configured to indicate with prescribed reliability that a piece of chunk data is stored in the storage part, the prescribed reliability being a probability equal to greater than a probability threshold value, the threshold value calculated based on a number of prescribed hash functions used to determine hash values for the piece of chunk data and a number of bits of a bit string that indicates the hash values at positions of the bit string corresponding to the hash values; and a third management table configured to manage a plurality of chunk data sets formed by grouping together the pieces of chunk data stored in the storage part, the third management table being configured to manage the identity guarantee data for prescribed chunk data that represents each of the plurality of chunk data sets, wherein the data duplication determination part; in a case where the second management table indicates that a target chunk data included in the received data is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the target chunk data is not stored in the third management table, temporarily stores the target chunk data in a temporary storage part; in a case where the second management table indicates that a second target chunk data that differs from the target chunk data is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the second target chunk data is stored in the third management table, determines whether the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table; in a case where determination is made that the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table, determines that the target chunk data stored in the temporary storage part is already stored in the storage part; and in a case where determination is made that the identity guarantee data for the target chunk data stored in the temporary storage part is not stored in the first management table, determines that the target chunk data stored in the temporary storage part is not stored in the storage part. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for controlling a data duplication detection system for detecting a duplication of data, the data duplication detection system including:
-
a data duplication determination part configured to determine whether each of a plurality of pieces of chunk data formed by dividing a received data is a duplicate of chunk data that has already been stored; a storage part configured to store chunk data that has been determined not to be duplicative by the data duplication determination part; a first management table configured to manage, for each piece of chunk data stored in the storage part, identity guarantee data that indicates a data identity, with the identity guarantee data being associated with storage-destination information that indicates a data storage destination; a second management table created on the basis of the identity guarantee data for the each piece of chunk data stored in the storage part, the second management table being configured to indicate with prescribed reliability that a piece of chunk data is stored in the storage part, the prescribed reliability being a probability equal to greater than a probability threshold value, the threshold value calculated based on a number of prescribed hash functions used to determine hash values for the piece of chunk data and a number of bits of a bit string that indicates the hash values at positions of the bit string corresponding to the hash values; and a third management table configured to manage a plurality of chunk data sets formed by grouping together the pieces of chunk data stored in the storage part, the third management table being configured to manage the identity guarantee data for prescribed chunk data that represent each of the plurality of chunk data sets, the data duplication detection system control method comprising;
operating the data duplication determination part;in a case where the second management table indicates that a target chunk data included in the received data is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the target chunk data is not stored in the third management table, to store the target chunk data in a temporary storage part for temporarily storing the target chunk data; in a case where the second management table indicates that a second target chunk data that differs from the target chunk is stored in the storage part, and, in addition, in a case where determination is made that the identity guarantee data for the other second target chunk data is stored in the third management table, to determine whether the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table; in a case where the identity guarantee data for the target chunk data stored in the temporary storage part is stored in the first management table, to determine that the target chunk data stored in the temporary storage part is already stored in the storage part; and in a case where determination is made that the identity guarantee data for the target chunk data stored in the temporary storage part is not stored in the first management table, to determine that the target chunk data stored in the temporary storage part is not stored in the storage part. - View Dependent Claims (11, 12, 13, 14)
-
Specification