×

Out-of-core similarity matching

  • US 8,914,338 B1
  • Filed: 12/22/2011
  • Issued: 12/16/2014
  • Est. Priority Date: 12/22/2011
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for data deduplication, the method comprising:

  • in response to a request for compressing data in a data storage system, partitioning the data into a plurality of data chunks, including a target data chunk and a base data chunk;

    generating representative data for the target data chunk and the base data chunk by applying a predetermined algorithm to the target data chunk and the base data chunk;

    sorting the representative data for the target data chunk and the base data chunk based on similarity of bit patterns of the target data chunk and the base data chunk to form a sorted representative data list, wherein sorting the representative data further includes dividing representative data of the plurality of data chunks into a plurality of bin files where directly adjacent representative data of the plurality of data chunks are placed in a same bin file, each of the plurality of bin files sized to fit within main memory of a data storage system, reading into main memory each of the plurality of bin files, and comparing and sorting each of the plurality of bin files according to a first feature defined in representative data of the plurality of data chunks;

    generating a delta data chunk as the difference between the target data chunk and the base data chunk where the representative data of the target chunk is directly adjacent to the representative data of the base data chunk in the sorted representative data list; and

    storing the delta data chunk and the base data chunk in the data storage system, wherein the delta data chunk and the base data chunk represent the target data chunk.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×