Finding duplicate passages of text in a collection of text
First Claim
Patent Images
1. A computer-implemented method for duplicate text detection, the method comprising:
- producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;
calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;
determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text;
creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;
determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1;
until all pairs of adjacent hash values in the partitions have been found;
finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and
after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.
3 Assignments
0 Petitions
Accused Products
Abstract
A novel system and computer-implemented method for quickly and efficiently finding and reporting all clones with a large corpus of text. This is achieved by tokenizing the corpus, computing a rolling hash, filtering for hashes that occur more than once, and constructing an equivalence relation over these hashes in which hashes are equated if they are part of the same instance of duplication. The equivalence relation is then used to report all detected clones.
3 Citations
19 Claims
-
1. A computer-implemented method for duplicate text detection, the method comprising:
-
producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments; calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file; determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text; creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions; determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1; until all pairs of adjacent hash values in the partitions have been found; finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, and for each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
memory; at least one processor communicatively coupled to the memory, and together configured for; producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments; calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file; determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text; creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions; determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1, in the sequence of hash values; until all pairs of adjacent hash values in the partitions have been found; finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, and for each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer program product comprising a hardware storage unit having computer readable program code embodied therewith, the computer readable program code configured for:
-
producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments; calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file; determining duplicate hash values that each occur more than once in the sequence of hash values, wherein each duplicate hash value is different than each of the other duplicate hash values; creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions; determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1, in the sequence of hash values; until all pairs of adjacent hash values in the partitions have been found; finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, and for each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification