DATA PROCESSING METHOD AND APPARATUS IN CLUSTER SYSTEM
First Claim
1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks;
deriving a first super-chuck identification (SID) for a super-chunk of the segment;
identifying a second processing node of the storage system that corresponds to the first SID;
querying the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system;
obtaining fingerprints of data chucks stored in the first data container that corresponds to the first SCID;
based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and
storing the new data chucks in a local buffer of the first processing node.
1 Assignment
0 Petitions
Accused Products
Abstract
In embodiments of the present invention, when a duplicate data query is performed on a received data stream, a first physical node which corresponds to each first sketch value and is in a cluster system is identified according to a first sketch value representing the data stream, and then the first sketch value representing the data stream is sent to the identified physical node for the duplicate data query, and a procedure of the duplicate data query does not change with an increase of the number of nodes in the cluster system; therefore, a calculation amount of each node does not increase with an increase of the number of nodes in the cluster system.
25 Citations
23 Claims
-
1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
receiving a data stream to be stored after de-duplication; dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks; deriving a first super-chuck identification (SID) for a super-chunk of the segment; identifying a second processing node of the storage system that corresponds to the first SID; querying the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system; obtaining fingerprints of data chucks stored in the first data container that corresponds to the first SCID; based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and storing the new data chucks in a local buffer of the first processing node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A data processing apparatus for performing data de-duplication applied to a storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
a memory configured to store instructions; and a processor coupled to the memory and configured to execute the instructions to; receive a data stream to be stored after de-duplication; divide a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks; derive a first super-chuck identification (SID) for a super-chunk of the segment; identifying a second processing node of the storage system that corresponds to the first SID; query the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system; obtain fingerprints of data chucks stored in the first data container that corresponds to the first SCID; based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and store the new data chucks in a local buffer of the first processing node. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
-
receiving a data stream to be stored after de-duplication; dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks; deriving a SID for each of the super-chunks and identifying a processing node of the storage system that corresponds to said each super-chunk based on the SID of said each super-chunk; sending the SIDs of the super-chunks to respective corresponding processing nodes; receiving responses from at least a subgroup of the corresponding processing nodes, wherein each response identifies container IDs that correspond to SIDs send to the corresponding processing node; selecting, from the container IDs in the responses from the subgroup of the processing nodes, a subset of container IDs based on times of the container IDs being identified in the responses; identifying, based on querying containers corresponding to the subset of container IDs and using fingerprint comparisons, new data chunks in the super-chunks of the segment; and storing the new data chunks in a local buffer of the first processing node.
-
Specification