×

FINDING DUPLICATE PASSAGES OF TEXT IN A COLLECTION OF TEXT

  • US 20130232160A1
  • Filed: 03/02/2012
  • Published: 09/05/2013
  • Est. Priority Date: 03/02/2012
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for duplicate text detection, the method comprising:

  • accessing a corpus of text;

    accessing a tokenization algorithm;

    producing, according to the tokenization algorithm, a tokenized version of at least a portion of the corpus of text, whereby the portion of the corpus of text is broken up into a set of one or more overlapping segments, where each segment is a specified number of adjacent tokens;

    calculating a rolling hash over each of the overlapping segments to produce a collection of one or more hash values, with one hash value per each overlapping segment;

    identifying, from the collection of hash values, any hash values that two or more segments have in common;

    computing an equivalence relation over the hash values, such that hashes in a same piece of duplicative text are equivalent; and

    producing, based on the equivalence relation, a report of pieces of duplicate text found in the corpus of text.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×