FINDING DUPLICATE PASSAGES OF TEXT IN A COLLECTION OF TEXT
First Claim
Patent Images
1. A computer-implemented method for duplicate text detection, the method comprising:
- accessing a corpus of text;
accessing a tokenization algorithm;
producing, according to the tokenization algorithm, a tokenized version of at least a portion of the corpus of text, whereby the portion of the corpus of text is broken up into a set of one or more overlapping segments, where each segment is a specified number of adjacent tokens;
calculating a rolling hash over each of the overlapping segments to produce a collection of one or more hash values, with one hash value per each overlapping segment;
identifying, from the collection of hash values, any hash values that two or more segments have in common;
computing an equivalence relation over the hash values, such that hashes in a same piece of duplicative text are equivalent; and
producing, based on the equivalence relation, a report of pieces of duplicate text found in the corpus of text.
3 Assignments
0 Petitions
Accused Products
Abstract
A novel system and computer-implemented method for quickly and efficiently finding and reporting all clones with a large corpus of text. This is achieved by tokenizing the corpus, computing a rolling hash, filtering for hashes that occur more than once, and constructing an equivalence relation over these hashes in which hashes are equated if they are part of the same instance of duplication. The equivalence relation is then used to report all detected clones.
17 Citations
18 Claims
-
1. A computer-implemented method for duplicate text detection, the method comprising:
-
accessing a corpus of text; accessing a tokenization algorithm; producing, according to the tokenization algorithm, a tokenized version of at least a portion of the corpus of text, whereby the portion of the corpus of text is broken up into a set of one or more overlapping segments, where each segment is a specified number of adjacent tokens; calculating a rolling hash over each of the overlapping segments to produce a collection of one or more hash values, with one hash value per each overlapping segment; identifying, from the collection of hash values, any hash values that two or more segments have in common; computing an equivalence relation over the hash values, such that hashes in a same piece of duplicative text are equivalent; and producing, based on the equivalence relation, a report of pieces of duplicate text found in the corpus of text. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
-
memory; at least one processor communicatively coupled to the memory, and together configured for; accessing a corpus of text; accessing a tokenization algorithm; producing, according to the tokenization algorithm, a tokenized version of at least a portion of the corpus of text, whereby the portion of the corpus of text is broken up into a set of one or more overlapping segments, where each segment is a specified number of adjacent tokens; calculating a rolling hash over each of the overlapping segments to produce a collection of one or more hash values, with one hash value per each overlapping segment; identifying, from the collection of hash values, any hash values that two or more segments have in common; computing an equivalence relation over the hash values, such that hashes in a same piece of duplicative text are equivalent; and producing, based on the equivalence relation, a report of pieces of duplicate text found in the corpus of text. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured for:
-
accessing a corpus of text; accessing a tokenization algorithm; producing, according to the tokenization algorithm, a tokenized version of at least a portion of the corpus of text, whereby the portion of the corpus of text is broken up into a set of one or more overlapping segments, where each segment is a specified number of adjacent tokens; calculating a rolling hash over each of the overlapping segments to produce a collection of one or more hash values, with one hash value per each overlapping segment; identifying, from the collection of hash values, any hash values that two or more segments have in common; computing an equivalence relation over the hash values, such that hashes in a same piece of duplicative text are equivalent; and producing, based on the equivalence relation, a report of pieces of duplicate text found in the corpus of text. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification