Data processing method and apparatus in cluster system

US 8,892,529 B2
Filed: 12/24/2013
Issued: 11/18/2014
Est. Priority Date: 12/12/2012
Status: Active Grant

First Claim

Patent Images

1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:

receiving a data stream to be stored after de-duplication;

dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;

deriving a first super-chunk identification (SID) for a super-chunk of the segment;

identifying a second processing node of the storage system that corresponds to the first SID;

querying the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system;

obtaining fingerprints of data chunks stored in the first data container that corresponds to the first SID;

based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints;

storing the new data chunks in a local buffer of the first processing node;

selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;

deriving a second SID for data of the local buffer;

identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and

storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In embodiments of the present invention, when a duplicate data query is performed on a received data stream, a first physical node which corresponds to each first sketch value and is in a cluster system is identified according to a first sketch value representing the data stream, and then the first sketch value representing the data stream is sent to the identified physical node for the duplicate data query, and a procedure of the duplicate data query does not change with an increase of the number of nodes in the cluster system; therefore, a calculation amount of each node does not increase with an increase of the number of nodes in the cluster system.

10 Citations

13 Claims

1. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
  
  dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;
  
  deriving a first super-chunk identification (SID) for a super-chunk of the segment;
  
  identifying a second processing node of the storage system that corresponds to the first SID;
  
  querying the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system;
  
  obtaining fingerprints of data chunks stored in the first data container that corresponds to the first SID;
  
  based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints;
  
  storing the new data chunks in a local buffer of the first processing node;
  
  selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
  
  deriving a second SID for data of the local buffer;
  
  identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and
  
  storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein a virtual node is logically obtained through dividing each of the plurality processing node in the storage system, and correspondence between a virtual node and a processing node in the storage system is comprised in each processing node;
    - the identifying a second processing node of the storage system that corresponds to the first SID comprised;
      
      identifying a first virtual node of the storage system that corresponds to the first SID;
      
      obtaining, by querying correspondence between a virtual node and a processing node, the second processing node corresponding to the first virtual node.
  - 3. The method according to claim 2, wherein the identifying, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer comprised:
    - identifying, by the same way for identifying the first virtual node, a second virtual node;
      
      obtaining, by querying the correspondence between a virtual node and a processing node, a fourth processing node;
      
      the storing the correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node comprised;
      
      storing the correspondence between the second SID for data of the local buffer and the second data container in the second virtual node of the corresponding to the fourth processing node.
  - 4. The method according to claim 2, further comprising:
    - when a data migration condition is met to the first processing node, integrally migrating a virtual node in the first processing node whose data needs to be migrated to a target processing node; and
      
      updating correspondence between the migrated virtual node and the target processing node, and notifying another processing node in the storage system of updating the correspondence between the migrated virtual node and the target processing node.
  - 5. The method according to claim 1, wherein selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprised:
    - storing data in the local buffer in the second data container of a fifth processing node when a preset storage condition is met, wherein the fifth processing node has least load in the storage system.
  - 6. The method according to claim 1, wherein selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprised:
    - storing data in the local buffer in the second data container of the first processing node when a preset storage condition is met.

7. A data processing apparatus for performing data de-duplication applied to a storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- a memory configured to store instructions; and
  
  a processor coupled to the memory and configured to execute the instructions to;
  
  receive a data stream to be stored after de-duplication;
  
  divide a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;
  
  derive a first super-chunk identification (SID) for a super-chunk of the segment;
  
  identify a second processing node of the storage system that corresponds to the first SID;
  
  query the second processing node for a first data container that corresponds to the first SID, wherein the first data container is maintained by a third processing node of the storage system;
  
  obtain fingerprints of data chunks stored in the first data container that corresponds to the first SID;
  
  based on a comparison between fingerprints of data chunks in the super-chunk and the obtained fingerprints to identify new data chunks whose signatures are not found in the obtained fingerprints;
  
  store the new data chunks in a local buffer of the first processing node;
  
  select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
  
  derive a second SID for data of the local buffer;
  
  identify, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer; and
  
  store correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The apparatus according to claim 7, wherein a virtual node is logically obtained through dividing each of the plurality processing node in the storage system, and correspondence between a virtual node and a processing node in the storage system is comprised in each processing node;
    - wherein the identify a second processing node of the storage system that corresponds to the first SID comprise the steps of;
      
      identifying a first virtual node of the storage system that corresponds to the first SID;
      
      obtaining, by querying correspondence between a virtual node and a processing node, the second processing node corresponding to the first virtual node.
  - 9. The apparatus according to claim 8, wherein the identify, by the same way for identifying the second processing node, a fourth processing node of the storage system that corresponds to the second SID for data of the local buffer comprises the steps of:
    - identifying, by the same way for identifying the first virtual node, a second virtual node;
      
      obtaining, by querying the correspondence between a virtual node and a processing node, a fourth processing node;
      
      the storing the correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node comprises;
      
      storing the correspondence between the second SID for data of the local buffer and the second data container in the second virtual node of the corresponding to the fourth processing node.
  - 10. The apparatus according to claim 8, wherein the processor is further configured to execute the instructions to:
    - when a data migration condition is met to the first processing node, integrally migrate a virtual node in the first processing node whose data needs to be migrated to a target processing node; and
      
      update correspondence between the migrated virtual node and the target processing node,and notify another processing node in the storage system of updating the correspondence between the migrated virtual node and the target processing node.
  - 11. The apparatus according to claim 7, wherein select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprises the steps of:
    - storing data in the local buffer in the second data container of a fifth processing node when a preset storage condition is met, wherein the fifth processing node has least load in the storage system.
  - 12. The apparatus according to claim 7, wherein select, according to a preset storage policy, a second data container of the storage system to write data in the local buffer comprises the steps of:
    - storing data in the local buffer in the second data container of the first processing node when a preset storage condition is met.

13. A method of data de-duplication performed by a first processing node in storage system having a plurality of processing nodes each maintaining multiple data containers for storing de-duplicated data chunks, comprising:
- receiving a data stream to be stored after de-duplication;
  
  dividing a segment of the data stream into a plurality of super-chunks, each super-chunk including multiple data chunks;
  
  deriving a first SID for each of the super-chunks and identifying a first processing node of the storage system that corresponds to said each super-chunk based on the SID of said each super-chunk;
  
  sending the first SIDs of the super-chunks to respective corresponding processing nodes;
  
  receiving responses from at least a subgroup of the corresponding processing nodes, wherein each response identifies container IDs that correspond to first SIDs send to the corresponding processing node;
  
  selecting, from the container IDs in the responses from the subgroup of the processing nodes, a subset of container IDs based on times of the container IDs being identified in the responses;
  
  identifying, based on querying containers corresponding to the subset of container IDs and using fingerprint comparisons, new data chunks in the super-chunks of the segment;
  
  storing the new data chunks in a local buffer of the first processing node;
  
  selecting, according to a preset storage policy, a second data container of the storage system to write data in the local buffer;
  
  deriving a second SID for data of the local buffer;
  
  identifying, by the same way for identifying the corresponding processing node, a second processing node of the storage system that corresponds to the second SID for data of the local buffer; and
  
  storing correspondence between the second SID for data of the local buffer and the second data container in the fourth processing node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
Huang, Yan, Liu, Qiang, Sun, Quancheng, Liu, Xiaobo, You, Jun, Yang, Huadi, Zhou, Dan
Primary Examiner(s)
Vy, Hung T

Application Number

US14/140,403
Publication Number

US 20140201169A1
Time in Patent Office

329 Days
Field of Search

707/692, 707/696, 707/698, 707/813, 707/610, 711/154
US Class Current

707/692
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/1748   De-duplication implemented ...

G06F 3/0608   Saving storage space on sto...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

Data processing method and apparatus in cluster system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Data processing method and apparatus in cluster system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links