×

System and method for data driven de-duplication

  • US 8,495,028 B2
  • Filed: 09/08/2010
  • Issued: 07/23/2013
  • Est. Priority Date: 01/25/2010
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method of locating redundancy within data, the method comprising;

  • recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;

    recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;

    determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;

    determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;

    prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;

    prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;

    identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and

    de-duplicating the subset of the target data with reference to the subset of the reference data, wherein recording the target locations includes recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×