DATA PROCESSING METHOD AND APPARATUS IN CLUSTER SYSTEM

US 20140201169A1
Filed: 12/24/2013
Published: 07/17/2014
Est. Priority Date: 12/12/2012
Status: Active Grant

First Claim

Patent Images

1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:

receiving a data stream to be stored after de-duplication;

dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks;

deriving a first super-chuck identification (SID) for a super-chunk of the segment;

identifying a second processing node of the storage system that corresponds to the first SID;

querying the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system;

obtaining fingerprints of data chucks stored in the first data container that corresponds to the first SCID;

based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and

storing the new data chucks in a local buffer of the first processing node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In embodiments of the present invention, when a duplicate data query is performed on a received data stream, a first physical node which corresponds to each first sketch value and is in a cluster system is identified according to a first sketch value representing the data stream, and then the first sketch value representing the data stream is sent to the identified physical node for the duplicate data query, and a procedure of the duplicate data query does not change with an increase of the number of nodes in the cluster system; therefore, a calculation amount of each node does not increase with an increase of the number of nodes in the cluster system.

25 Citations

View as Search Results

23 Claims

1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
  
  dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks;
  
  deriving a first super-chuck identification (SID) for a super-chunk of the segment;
  
  identifying a second processing node of the storage system that corresponds to the first SID;
  
  querying the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system;
  
  obtaining fingerprints of data chucks stored in the first data container that corresponds to the first SCID;
  
  based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and
  
  storing the new data chucks in a local buffer of the first processing node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method according to claim 1, further comprising:
    - selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
      
      deriving a second SID for data of the local buffer;
      
      identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer;
      
      storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
  - 3. The method according to claim 2, wherein a virtual node is logically obtained through dividing each of the plurality processing node in the storage system, and correspondence between a virtual node and a processing node in the storage system is comprised in each processing node;
    - the identifying a second processing node of the storage system that corresponds to the first SID comprised;
      
      identifying a first virtual node of the storage system that corresponds to the first SID;
      
      obtaining, by querying correspondence between a virtual node and a processing node, the second processing node corresponding to the first virtual node.
  - 4. The method according to claim 3, wherein the identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer comprised:
    - identifying, by the same way for identifying the first virtual node, a second virtual node;
      
      obtaining, by querying the correspondence between a virtual node and a processing node, a fourth processing node;
      
      the storing the correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node comprised;
      
      storing the correspondence between the second SID for data of the local buffer and the second data container in the second virtual node of the corresponding to the fourth processing node.
  - 5. The method according to claim 3, further comprising:
    - when a data migration condition is met to the first processing node, integrally migrating a virtual node in the first processing node whose data needs to be migrated to a target processing node; and
      
      updating correspondence between the migrated virtual node and the target processing node, and notifying another processing node in the storage system of updating the correspondence between the migrated virtual node and the target processing node.
  - 6. The method according to claim 1, wherein the identifying a second processing node of the storage system that corresponds to the first SID comprised:
    - performing a modulus operation on the number of all processing nodes in the storage system by the first SID to obtain the second processing node which corresponds to the first SID.
  - 7. The method according to claim 2, wherein selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprised:
    - storing data in the local buffer in the second data container of a fifth processing node when a preset storage condition is met, wherein the fifth processing node has least load in the storage system.
  - 8. The method according to claim 2, wherein selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprised:
    - storing data in the local buffer in the second data container of the first processing node when a preset storage condition is met.
  - 9. The method according to claim 1, the deriving a first SID for a super-chunk of the segment comprised:
    - obtaining multiple fingerprint corresponding to multiple chunks of the super-chunk;
      
      selecting, from multiple fingerprint corresponding to multiple chunks of the super-chunk, a smallest fingerprint as a first SID for the super-chunk of the segment.
  - 10. The method according to claim 1, the based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints, comprised:
    - loading the obtained fingerprints into a local cache of the first processing node;
      
      comparing fingerprints of data chucks in the super-chunk with the fingerprints in the local cache to identify new data chucks whose signatures are not found in the local cache.
  - 11. The method according to claim 1, the based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints, comprised:
    - sending a query instruction to a processing node in which the first container is stored for instructing to identify new data chucks whose signatures are not found in the obtained fingerprints, wherein the fingerprints of data chucks in the super-chunk are carried in the query instruction, and receiving a query result returned by the processing node in which the first container is stored.

12. A data processing apparatus for performing data de-duplication applied to a storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- a memory configured to store instructions; and
  
  a processor coupled to the memory and configured to execute the instructions to;
  
  receive a data stream to be stored after de-duplication;
  
  divide a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chucks;
  
  derive a first super-chuck identification (SID) for a super-chunk of the segment;
  
  identifying a second processing node of the storage system that corresponds to the first SID;
  
  query the second processing node for a first data container that corresponds to the first SCID, wherein the first data container is maintained by a third processing node of the storage system;
  
  obtain fingerprints of data chucks stored in the first data container that corresponds to the first SCID;
  
  based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints; and
  
  store the new data chucks in a local buffer of the first processing node.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The apparatus according to the claim 12, wherein the processor is further configured to execute the instructions to:
    - select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
      
      derive a second SID for data of the local buffer;
      
      identify, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer;
      
      store correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
  - 14. The apparatus according to claim 13, wherein a virtual node is logically obtained through dividing each of the plurality processing node in the storage system, and correspondence between a virtual node and a processing node in the storage system is comprised in each processing node;
    - wherein the identify a second processing node of the storage system that corresponds to the first SID comprise the steps of;
      
      identifying a first virtual node of the storage system that corresponds to the first SID;
      
      obtaining, by querying correspondence between a virtual node and a processing node, the second processing node corresponding to the first virtual node.
  - 15. The apparatus according to claim 14, wherein the identify, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer comprise the steps of:
    - identifying, by the same way for identifying the first virtual node, a second virtual node;
      
      obtaining, by querying the correspondence between a virtual node and a processing node, a fourth processing node;
      
      the storing the correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node comprised;
      
      storing the correspondence between the second SID for data of the local buffer and the second data container in the second virtual node of the corresponding to the fourth processing node.
  - 16. The apparatus according to claim 14, wherein the processor is further configured to execute the instructions to:
    - when a data migration condition is met to the first processing node, integrally migrate a virtual node in the first processing node whose data needs to be migrated to a target processing node; and
      
      update correspondence between the migrated virtual node and the target processing node, and notify another processing node in the storage system of updating the correspondence between the migrated virtual node and the target processing node.
  - 17. The apparatus according to claim 12, wherein the identify a second processing node of the storage system that corresponds to the first SID comprise the steps of:
    - performing a modulus operation on the number of all processing nodes in the storage system by the first SID to obtain the second processing node which corresponds to the first SID.
  - 18. The apparatus according to claim 13, wherein select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprise the steps of:
    - storing data in the local buffer in the second data container of a fifth processing node when a preset storage condition is met, wherein the fifth processing node has least load in the storage system.
  - 19. The apparatus according to claim 13, wherein select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprise the steps of:
    - storing data in the local buffer in the second data container of the first processing node when a preset storage condition is met.
  - 20. The apparatus according to claim 12, wherein the derive a first SID for a super-chunk of the segment comprise the steps of:
    - obtaining multiple fingerprint corresponding to multiple chunks of the super-chunk;
      
      selecting, from multiple fingerprint corresponding to multiple chunks of the super-chunk, a smallest fingerprint as a first SID for the super-chunk of the segment.
  - 21. The apparatus according to claim 12, wherein the based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints, comprise the steps of:
    - loading the obtained fingerprints into a local cache of the first processing node;
      
      comparing fingerprints of data chucks in the super-chunk with the fingerprints in the local cache to identify new data chucks whose signatures are not found in the local cache.
  - 22. The method according to claim 12, the based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chucks whose signatures are not found in the obtained fingerprints, comprise the steps of:
    - sending a query instruction to a processing node in which the first container is stored for instructing to identify new data chucks whose signatures are not found in the obtained fingerprints, wherein the fingerprints of data chucks in the super-chunk are carried in the query instruction, and receiving a query result returned by the processing node in which the first container is stored.

23. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
  
  dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;
  
  deriving a SID for each of the super-chunks and identifying a processing node of the storage system that corresponds to said each super-chunk based on the SID of said each super-chunk;
  
  sending the SIDs of the super-chunks to respective corresponding processing nodes;
  
  receiving responses from at least a subgroup of the corresponding processing nodes, wherein each response identifies container IDs that correspond to SIDs send to the corresponding processing node;
  
  selecting, from the container IDs in the responses from the subgroup of the processing nodes, a subset of container IDs based on times of the container IDs being identified in the responses;
  
  identifying, based on querying containers corresponding to the subset of container IDs and using fingerprint comparisons, new data chunks in the super-chunks of the segment; and
  
  storing the new data chunks in a local buffer of the first processing node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
LIU, Qiang, SUN, Quancheng, LIU, Xiaobo, YOU, Jun, YANG, Huadi, ZHOU, Dan, HUANG, Yan

Granted Patent

US 8,892,529 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/1748   De-duplication implemented ...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

DATA PROCESSING METHOD AND APPARATUS IN CLUSTER SYSTEM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

25 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

DATA PROCESSING METHOD AND APPARATUS IN CLUSTER SYSTEM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others