×

Systems and methods of locating redundant data using patterns of matching fingerprints

  • US 9,766,832 B2
  • Filed: 01/22/2014
  • Issued: 09/19/2017
  • Est. Priority Date: 03/15/2013
  • Status: Active Grant
First Claim
Patent Images

1. A method of computing match potential between first data and second data, the method comprising:

  • identifying, by one or more processors, a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data;

    identifying a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data, wherein the first sequence includes a plurality of first subsequences, each first subsequence comprising one or more of the fingerprints of the first sequence, and the second sequence includes a plurality of second subsequences, each second subsequence comprising one or more of the fingerprints of the second sequence;

    quantifying a similarity between the first sequence and the second sequence by quantifying a similarity between at least one first subsequence of the plurality of first subsequences and at least one second subsequence of the plurality of second subsequences by computing at least one score for at least one first section of the first plurality of sections, the at least one first section being characterized by one or more fingerprints within the at least one first subsequence;

    adjusting the match potential between the first data and the second data at least partially based on the quantified similarity; and

    performing de-duplication of at least a portion of the first plurality of sections at least partially based on the match potential, wherein;

    the at least one first section is associated with a matching range that identifies at least one first ordinal position within the second data, the at least one first ordinal position being one or more ordinal positions of one or more of the second plurality of sections, the one or more of the second plurality of sections being characterized by one or more fingerprints that match one or more fingerprints characterizing the at least one first section; and

    computing the at least one score for the at least one first section includes;

    selecting a plurality of combinations of sections from a subset of the first data, the subset being characterized by fingerprints included in the at least one first subsequence, each of the plurality of combinations including the at least one first section and at least one second section of the subset, the at least one second section being associated with one or more matching sections within the second plurality of sections, the one or more matching sections being characterized by at least one fingerprint that matches at least one fingerprint characterizing the second section;

    computing one or more adjusted ordinal positions for the one or more matching sections based on an ordinal position of the first section, an ordinal position of the second section, and one or more ordinal positions of the one or more matching sections; and

    increasing the score for each adjusted ordinal position of the one or more adjusted ordinal positions disposed within the matching range of the at least one first section.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×