Method for de-duplicating data and apparatus therefor

US 9,851,917 B2
Filed: 03/07/2014
Issued: 12/26/2017
Est. Priority Date: 03/07/2013
Status: Active Grant

First Claim

Patent Images

1. A method for data de-duplication, performed in an apparatus for data de-duplication, comprising:

obtaining access property including access time on data, modification on the data, a number of sequential accesses on the data, and a number of random accesses on the data based on input request or output request for the data;

calculating a first difference between a current access time on the data and a previous modification time on the data;

determining a fourth de-duplication unit having a lowest de-duplication probability as the de-duplication unit of the data when the first difference is equal to or less than a predefined first threshold;

calculating a second difference between the current access time on the data and the previous access time on the data when the first difference is in excess of the first threshold;

determining a first de-duplication unit having a highest de-duplication probability as the de-duplication unit of the data when the second difference is in excess of a predefined second threshold;

determining a second de-duplication unit having a lower de-duplication probability than the first de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is equal to and more than the number of sequential accesses on the data;

determining a third de-duplication unit having a lower probability of being de-duplicated than the second de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is less than the number of sequential accesses on the data;

generating at least one data block of the data based on the determined de-duplication unit according to the access property, wherein the determined de-duplication unit is one of the first de-duplication unit, second de-duplication unit, third de-duplication unit, and fourth de-duplication unit;

generating unique identifier for the at least one data block; and

performing de-duplication on the data based on whether the unique identifier is in an index table or not.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are a method for data de-duplication and an apparatus for the same. The method may comprise obtaining access property of data based on input request or output request for the data, determining de-duplication unit of the data based on the access property, and performing de-duplication on the data based on the de-duplication unit. Thus, data de-duplication rate may be determined adaptively based on input/output characteristics of data. Also, data de-duplication may be performed based on the determined data de-duplication rate so as to provide low input/output latency.

27 Citations

View as Search Results

14 Claims

1. A method for data de-duplication, performed in an apparatus for data de-duplication, comprising:
- obtaining access property including access time on data, modification on the data, a number of sequential accesses on the data, and a number of random accesses on the data based on input request or output request for the data;
  
  calculating a first difference between a current access time on the data and a previous modification time on the data;
  
  determining a fourth de-duplication unit having a lowest de-duplication probability as the de-duplication unit of the data when the first difference is equal to or less than a predefined first threshold;
  
  calculating a second difference between the current access time on the data and the previous access time on the data when the first difference is in excess of the first threshold;
  
  determining a first de-duplication unit having a highest de-duplication probability as the de-duplication unit of the data when the second difference is in excess of a predefined second threshold;
  
  determining a second de-duplication unit having a lower de-duplication probability than the first de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is equal to and more than the number of sequential accesses on the data;
  
  determining a third de-duplication unit having a lower probability of being de-duplicated than the second de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is less than the number of sequential accesses on the data;
  
  generating at least one data block of the data based on the determined de-duplication unit according to the access property, wherein the determined de-duplication unit is one of the first de-duplication unit, second de-duplication unit, third de-duplication unit, and fourth de-duplication unit;
  
  generating unique identifier for the at least one data block; and
  
  performing de-duplication on the data based on whether the unique identifier is in an index table or not.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the obtaining access property comprises:
    - obtaining the access time and the modification time based on time information of the input request when the input request is received; and
      
      obtaining the number of sequential accesses on the data or the number of random accesses on the data based on continuity of the input request.
  - 3. The method of claim 1, wherein the obtaining access property comprises:
    - obtaining the access time based on time information of the output request when the output request is received; and
      
      obtaining the number of sequential accesses on the data or the number of random accesses on the data based on continuity of the output request.
  - 4. The method of claim 1, wherein de-duplication is not performed for the fourth de-duplication unit.
  - 5. The method of claim 1, wherein the de-duplication unit is classified into at least one de-duplication units each of which has different de-duplication probability.
  - 6. The method of claim 1, wherein the performing de-duplication on the data comprises:
    - determining whether the unique identifier is in an index table or not;
      
      removing a data block corresponding to the unique identifier when the unique identifier is in the index table; and
      
      storing the unique identifier and the data block corresponding to the unique identifier when the unique identifier is not in the index table.
  - 7. The method of claim 6, wherein the unique identifier is generated using a hash algorithm.

8. An apparatus for data de-duplication, comprising:
- a processing part configured to;
  
  obtain access property including access time on data, modification on the data, a number of sequential accesses on the data, and a number of random accesses on the data based on input request or output request for the data;
  
  calculate a first difference between a current access time on the data and a previous modification time on the data;
  
  determine a fourth de-duplication unit having a lowest de-duplication probability as the de-duplication unit of the data when the first difference is equal to or less than a predefined first threshold;
  
  calculate a second difference between the current access time on the data and the previous access time on the data when the first difference is in excess of the first threshold;
  
  determine a first de-duplication unit having a highest de-duplication probability as the de-duplication unit of the data when the second difference is in excess of a predefined second threshold;
  
  determine a second de-duplication unit having a lower de-duplication probability than the first de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is equal to and more than the number of sequential accesses on the data;
  
  determine a third de-duplication unit having a lower de-duplication probability than the second de-duplication unit as the de-duplication unit of the data when the second difference is equal to or less than the second threshold and the number of random accesses on the data is less than the number of sequential accesses on the data;
  
  generate at least one data block of the data based on the determined de-duplication unit according to the access property wherein the determined de-duplication unit is one of the first de-duplication unit, second de-duplication unit, third de-duplication unit, and fourth de-duplication unit;
  
  generate unique identifier for the at least one data block; and
  
  perform de-duplication on the data based on the de-duplication unit; and
  
  a storage part configured to store information which is processed or has been processed in the processing part.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus of the claim 8, wherein the processing part is further configured to obtain the access time and the modification time based on time information of the input request when the input request is received, and obtain the number of sequential accesses on the data or the number of random accesses on the data based on continuity of the input request.
  - 10. The apparatus of the claim 8, wherein the processing part is further configured to obtain the access time based on time information of the output request when the output request is received, and obtain the number of sequential accesses on the data or the number of random accesses on the data based on continuity of the output request.
  - 11. The apparatus of the claim 8, wherein de-duplication is not performed for the fourth de-duplication unit.
  - 12. The apparatus of the claim 8, wherein the de-duplication unit is classified into at least one de-duplication units each of which has different de-duplication probability.
  - 13. The apparatus of the claim 8, wherein the processing part is further configured to determine whether the unique identifier is in an index table or not, remove a data block corresponding to the unique identifier when the unique identifier is in the index table, and store the unique identifier and the data block corresponding to the unique identifier when the unique identifier is not in the index table.
  - 14. The apparatus of the claim 13, wherein the unique identifier is generated using a hash algorithm.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Postech Academy-Industry Foundation
Original Assignee
Postech Academy-Industry Foundation
Inventors
Park, Chan Ik, Park, Se Jin
Primary Examiner(s)
Elmore, Reba I

Application Number

US14/201,606
Publication Number

US 20140258655A1
Time in Patent Office

1,390 Days
Field of Search

711162, 711166, 711170, 707634, 707637, 707664, 707692, 707749, 707785, 710 19, 710 36, 710 56, 710 60
US Class Current
CPC Class Codes

G06F 12/123   with age lists, e.g. queue,...

G06F 12/127   using additional replacemen...

G06F 3/061   Improving I/O performance

G06F 3/0628   making use of a particular ...

G06F 3/0632   by initialisation or re-ini...

G06F 3/0641   De-duplication techniques

G06F 3/0668   adopting a particular infra...

G06F 3/0673   Single storage device

Method for de-duplicating data and apparatus therefor

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

27 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method for de-duplicating data and apparatus therefor

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links