Finding duplicate passages of text in a collection of text

US 10,585,975 B2
Filed: 03/02/2012
Issued: 03/10/2020
Est. Priority Date: 03/02/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for duplicate text detection, the method comprising:

producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;

calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;

determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text;

creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;

determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1;

until all pairs of adjacent hash values in the partitions have been found;

finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and

after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A novel system and computer-implemented method for quickly and efficiently finding and reporting all clones with a large corpus of text. This is achieved by tokenizing the corpus, computing a rolling hash, filtering for hashes that occur more than once, and constructing an equivalence relation over these hashes in which hashes are equated if they are part of the same instance of duplication. The equivalence relation is then used to report all detected clones.

3 Citations

19 Claims

1. A computer-implemented method for duplicate text detection, the method comprising:
- producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;
  
  calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;
  
  determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text;
  
  creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;
  
  determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1;
  
  until all pairs of adjacent hash values in the partitions have been found;
  
  finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and
  
  after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, where producing the report of pieces of duplicate text found in the corpus of text includes using the hash values in the remaining partitions to map the pieces of duplicate text back to their respective locations in the corpus of text.
  - 3. The method of claim 1, where merging respective partitions having the adjacent hash values to form a respective merged partition comprises merging respective partitions using a union-find abstract data type.
  - 4. The method of claim 1, wherein calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text comprises calculating a rolling hash over each of the overlapping segments of each file of the plurality of files of the corpus of text.
  - 5. The method of claim 4, wherein calculating a rolling hash over each of the overlapping segments is done according to:
  - 6. The method of claim 1, wherein producing a respective tokenized version of at least the portion of each file in the plurality of files of the corpus of text comprises preliminarily converting words to a lower case form and discarding punctuation in at least the portion of each file in the plurality of files of the corpus of text.
  - 7. The method of claim 1, wherein merging respective partitions having the adjacent hash values to form the respective merged partition comprises merging respective partitions having the adjacent hash values to form the respective merged partition that includes only hash values that represent different portions of a contiguous passage of duplicate text.

8. A system comprising:
- memory;
  
  at least one processor communicatively coupled to the memory, and together configured for;
  
  producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;
  
  calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;
  
  determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text;
  
  creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;
  
  determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1, in the sequence of hash values;
  
  until all pairs of adjacent hash values in the partitions have been found;
  
  finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and
  
  after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system of claim 8, where producing the report of pieces of duplicate text found in the corpus of text includes using the hash values in the remaining partitions to map the pieces of duplicate text back to their respective locations in the corpus of text.
  - 10. The system of claim 8, where merging respective partitions having the adjacent hash values to form a respective merged partition comprises merging respective partitions using a union-find abstract data type.
  - 11. The system of claim 8, wherein calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text comprises calculating a rolling hash over each of the overlapping segments of each file of the plurality of files of the corpus of text.
  - 12. The system of claim 11, wherein calculating a rolling hash over each of the overlapping segments is done according to:
  - 13. The system of claim 8, wherein producing a tokenized version of at least the portion of each file in the plurality of files of the corpus of text comprises preliminarily converting words to a lower case form and discarding punctuation in at least the portion of each file in the plurality of files of the corpus of text.

14. A computer program product comprising a hardware storage unit having computer readable program code embodied therewith, the computer readable program code configured for:
- producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments;
  
  calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file;
  
  determining duplicate hash values that each occur more than once in the sequence of hash values, wherein each duplicate hash value is different than each of the other duplicate hash values;
  
  creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions;
  
  determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1, in the sequence of hash values;
  
  until all pairs of adjacent hash values in the partitions have been found;
  
  finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, andfor each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and
  
  after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The computer program product of claim 14, where producing the report of pieces of duplicate text found in the corpus of text includes using the hash values in the remaining partitions to map the pieces of duplicate text back to their respective locations in the corpus of text.
  - 16. The computer program product of claim 14, where merging respective partitions having the adjacent hash values to form a respective merged partition comprises merging respective partitions using a union-find abstract data type.
  - 17. The computer program product of claim 14, wherein calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text comprises calculating a rolling hash over each of the overlapping segments of each file of the plurality of files of the corpus of text.
  - 18. The computer program product of claim 17, wherein calculating a rolling hash over each of the overlapping segments is done according to:
  - 19. The computer program product of claim 14, wherein producing a tokenized version of at least the portion of each file in the plurality of files of the corpus of text comprises preliminarily converting words to a lower case form and discarding punctuation in at least the portion of each file in the plurality of files of the corpus of text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Github Software UK Limited
Inventors
Tibble, Julian David
Primary Examiner(s)
Almani, Mohsen

Application Number

US13/411,234
Publication Number

US 20130232160A1
Time in Patent Office

2,930 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/194 Calculation of difference b...

Finding duplicate passages of text in a collection of text

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

3 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Finding duplicate passages of text in a collection of text

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

3 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links