Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

US 6,978,419 B1
Filed: 11/15/2000
Issued: 12/20/2005
Est. Priority Date: 11/15/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:

initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments,wherein the distinctiveness score is the reciprocal of the number of documents containing the text fragment multiplied by a monotonic function of the number of words in the text fragment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a computer-assisted method for finding duplicate or near-duplicate documents or text spans within a document collection by using high-discriminability text fragments. Distinctive features of the documents or text spans are identified. For each pair of documents or text spans with at least one distinctive feature in common, the distinctive features of each document or text span are compared to determine whether the pair is duplicates or near-duplicates. An apparatus for performing this computer-assisted method is also disclosed.

Citations

34 Claims

1. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
- initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments,wherein the distinctiveness score is the reciprocal of the number of documents containing the text fragment multiplied by a monotonic function of the number of words in the text fragment.
- View Dependent Claims (2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer-assisted method according to claim 1, wherein the method is applied to removing duplicates in document collections;
    - detecting plagiarism, detecting copyright infringement;
      
      determining the authorship of a document;
      
      clustering successive versions of a document from among a collection of documents;
      
      seeding a text classification or text clustering algorithm with sets of duplicate or near-duplicate documents;
      
      matching an e-mail message with responses to the e-mail message;
      
      matching responses to an e-mail message with the e-mail message;
      
      creating a document index for use with a query system to efficiently find documents in response to a query which contain a particular phrase or excerpt, or any combination thereof.
  - 3. The computer-assisted method according to claim 2, wherein the document index can be utilized even if the particular phrase or excerpt was not recorded correctly in the document or in the query.
  - 5. The computer-assisted method according to claim 1, wherein the method is applied to information retrieval methods.
  - 6. The computer-assisted method according to claim 5, wherein a text classification method is applied to the information retrieval method.
  - 7. The computer-assisted method according to claim 5, wherein:
    - the information retrieval method assumes word independence, andthe distinctive text fragments are added to an index set.
  - 8. The computer-assisted method according to claim 1, wherein if one distinctive text fragment is contained within another distinctive text fragment within the same document, only the longest distinctive text fragment is considered as a distinctive feature.
  - 9. The computer-assisted method according to claim 1, wherein the sequences of at least two words are considered as appearing in a document when the document contains the sequence of a user-specified minimum frequency.
  - 10. The computer-assisted method according to claim 1, wherein the monotonic function is the number of words in the text fragment.
  - 11. The computer-assisted method according to claim 1, wherein the limited number is selected by a user.
  - 12. The computer-assisted method according to claim 1, wherein the limited number is defined by a linear function of the number of documents in the document collection.
  - 13. The computer-assisted method according to claim 1, wherein the distinctive text fragments include glue words.
  - 14. The computer-assisted method according to claim 13, wherein the glue words do not appear at either extreme of the distinctive text fragments.

4. The computer-assisted method according to claim to 1, wherein the distinctive features appear in a different order in each of the documents.

15. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
- initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,further including the step of, for each pair of documents having at least one distinctive feature in common, counting the number of distinctive features in common,wherein determining whether the pair of documents is duplicates or near-duplicates includes the steps of;
  
  for each pair of documents, calculating an overlap ratio by dividing the number of distinctive features in common by the smaller of the number of distinctive features per document, andcomparing the overlap ratio to a threshold and if the overlap ratio is greater than the threshold, then the pair of documents are duplicates or near-duplicates, otherwise the pair of documents is not duplicates or near-duplicates,building a document index that maps each document to its associated distinctive features, wherein if one distinctive feature is repeated within one document the index maps the document to the distinctive feature once, andbuilding a feature index that maps each distinctive feature to its associated document, wherein if one distinctive feature is repeated within one document, the index maps the distinctive feature to the document once,wherein determining whether the pair of documents are duplicates or near-duplicates further includes the steps of;
  
  creating a list of unique distinctive features from the document index,for each unique distinctive feature, creating a list of documents which contain the unique distinctive feature, andfor each document, creating a list of documents that have at least one feature in common with the document and the number of features in common with the document.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-assisted method according to claim 15, wherein the distinctive features include distinctive phrases.
  - 17. The computer-assisted method according to claim 15, wherein the distinctive features appear in a different order in each of the documents.
  - 18. The computer-assisted method according to claim 15, wherein the distinctive features include text spans.
  - 19. The computer-assisted method according to claim 18, wherein the text spans include sentences.
  - 20. The computer-assisted method according to claim 18, wherein the text spans include lines of text.

21. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
- initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments,wherein the distinctiveness score is the percentage of documents not containing the phrase multiplied by a monotonic function of the number of words in the text fragment.
- View Dependent Claims (22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 22. The computer-assisted method according to claim 21, wherein the monotonic function is the number of words in the text fragment.
  - 23. The computer-assisted method according to claim 21, wherein the method is applied to:
    - removing duplicates in document collections;
      
      detecting plagiarism, detecting copyright infringement;
      
      determining the authorship of a document;
      
      clustering successive versions of a document from among a collection of documents;
      
      seeding a text classification or text clustering algorithm with sets of duplicate or near-duplicate documents;
      
      matching an e-mail message with responses to the e-mail message;
      
      matching responses to an e-mail message with the e-mail message;
      
      creating a document index for use with a query system to efficiently find documents in response to a query which contain a particular phrase or excerpt, or any combination thereof.
  - 24. The computer-assisted method according to claim 23, wherein the document index can be utilized even if the particular phrase or excerpt was not recorded correctly in the document or in the query.
  - 26. The computer-assisted method according to claim 21, wherein the method is applied to information retrieval methods.
  - 27. The computer-assisted method according to claim 26, wherein a text classification method is applied to the information retrieval method.
  - 28. The computer-assisted method according to claim 26, wherein:
    - the information retrieval method assumes word independence, andthe distinctive text fragments are added to an index set.
  - 29. The computer-assisted method according to claim 21, wherein if one distinctive text fragment is contained within another distinctive text fragment within the same document, only the longest distinctive text fragment is considered as a distinctive feature.
  - 30. The computer-assisted method according to claim 21, wherein the sequences of at least two words are considered as appearing in a document when the document contains the sequence of a user-specified minimum frequency.
  - 31. The computer-assisted method according to claim 21, wherein the limited number is selected by a user.
  - 32. The computer-assisted method according to claim 21, wherein the limited number is defined by a linear function of the number of documents in the document collection.
  - 33. The computer-assisted method according to claim 21, wherein the distinctive text fragments include glue words.
  - 34. The computer-assisted method according to claim 33, wherein the glue words do not appear at either extreme of the distinctive text fragments.

25. The computer-assisted method according to claim to 21, wherein the distinctive features appear in a different order in each of the documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Justsystems Corporation
Original Assignee
Justsystems Corporation
Inventors
Kantrowitz, Mark
Primary Examiner(s)
HUYNH, CONG LAC T

Application Number

US09/713,733
Time in Patent Office

1,861 Days
Field of Search

715/511, 715/500, 715/530, 707 3- 6
US Class Current

715/209
CPC Class Codes

G06F 16/3346 using probabilistic model

Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links