×

Finding duplicate passages of text in a collection of text

  • US 10,585,975 B2
  • Filed: 03/02/2012
  • Issued: 03/10/2020
  • Est. Priority Date: 03/02/2012
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for duplicate text detection, the method comprising:

  • producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;

    calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;

    determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text;

    creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;

    determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1;

    until all pairs of adjacent hash values in the partitions have been found;

    finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and

    after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×