Data processing method and apparatus in cluster system
First Claim
1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;
deriving a first super-chunk identification (SID) for a super-chunk of the segment;
identifying a second processing node of the storage system that corresponds to the first SID;
querying the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system;
obtaining fingerprints of data chunks stored in the first data container that corresponds to the first SID;
based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints;
storing the new data chunks in a local buffer of the first processing node;
selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
deriving a second SID for data of the local buffer;
identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and
storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
1 Assignment
0 Petitions
Accused Products
Abstract
In embodiments of the present invention, when a duplicate data query is performed on a received data stream, a first physical node which corresponds to each first sketch value and is in a cluster system is identified according to a first sketch value representing the data stream, and then the first sketch value representing the data stream is sent to the identified physical node for the duplicate data query, and a procedure of the duplicate data query does not change with an increase of the number of nodes in the cluster system; therefore, a calculation amount of each node does not increase with an increase of the number of nodes in the cluster system.
10 Citations
13 Claims
-
1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
receiving a data stream to be stored after de-duplication; dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks; deriving a first super-chunk identification (SID) for a super-chunk of the segment; identifying a second processing node of the storage system that corresponds to the first SID; querying the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system; obtaining fingerprints of data chunks stored in the first data container that corresponds to the first SID; based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints; storing the new data chunks in a local buffer of the first processing node; selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer; deriving a second SID for data of the local buffer; identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A data processing apparatus for performing data de-duplication applied to a storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
a memory configured to store instructions; and a processor coupled to the memory and configured to execute the instructions to;
receive a data stream to be stored after de-duplication;divide a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks; derive a first super-chunk identification (SID) for a super-chunk of the segment; identify a second processing node of the storage system that corresponds to the first SID; query the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system; obtain fingerprints of data chunks stored in the first data container that corresponds to the first SID; based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints; store the new data chunks in a local buffer of the first processing node; select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer; derive a second SID for data of the local buffer; identify, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and store correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
receiving a data stream to be stored after de-duplication; dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks; deriving a first SID for each of the super-chunks and identifying a first processing node of the storage system that corresponds to said each super-chunk based on the SID of said each super-chunk; sending the first SIDs of the super-chunks to respective corresponding processing nodes; receiving responses from at least a subgroup of the corresponding processing nodes, wherein each response identifies container IDs that correspond to first SIDs send to the corresponding processing node; selecting, from the container IDs in the responses from the subgroup of the processing nodes, a subset of container IDs based on times of the container IDs being identified in the responses; identifying, based on querying containers corresponding to the subset of container IDs and using fingerprint comparisons, new data chunks in the super-chunks of the segment; storing the new data chunks in a local buffer of the first processing node; selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer; deriving a second SID for data of the local buffer; identifying, by the same way for identifying the corresponding processing node, a second processing node of the storage system that corresponds to the second SID for data of the local buffer; and storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
-
Specification