Detecting duplicate documents using classification

US 8,180,773 B2
Filed: 05/27/2009
Issued: 05/15/2012
Est. Priority Date: 05/27/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for managing a collection of documents, comprising:

determining, by operation of one or more computer processors, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;

determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;

upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;

training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and

upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;

assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods and articles of manufacture are disclosed for detecting a duplicate document. A plurality of documents may be assigned to categories, each category corresponding to a collection of duplicates, or near duplicate documents. A new document may be received. The new document may be evaluated against each category to determine a similarity score between the new document and each category. The new document may be identified as a duplicate based on the similarity scores and thresholds for each category. An action may then be performed on the duplicate based on duplication rules. The thresholds and duplication rules may be customized by a user.

Citations

21 Claims

1. A computer-implemented method for managing a collection of documents, comprising:
- determining, by operation of one or more computer processors, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;
  
  determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;
  
  upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;
  
  training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and
  
  upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;
  
  assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein determining the similarity score between the received document and each respective category comprises supplying the received document as input to a classifier associated with each respective category, wherein each classifier is configured to generate the measure of similarity between the received document and the documents assigned to a respective category.
  - 3. The computer-implemented method of claim 2, wherein the documents assigned to each respective category are duplicates, or near duplicates, of one another.
  - 4. The computer-implemented method of claim 1, wherein the operation further comprises, upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories, assigning the received document to the new category.
  - 5. The computer-implemented method of claim 1, wherein determining, based on the plurality of similarity scores, whether the received document is a near-duplicate of one of the documents assigned to one of the plurality of categories comprises whether the similarity score exceeds the second similarity threshold.
  - 6. The computer-implemented method of claim 1, wherein the operation further comprises excluding, based on a common template, at least some content in the received document from the determination of the similarity score between the received document and each of the plurality of categories.
  - 7. The computer-implemented method of claim 1, wherein the operation further comprises, upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories, performing a user-specified action selected from at least one of, deprecating the duplicate, or near duplicate document, removing the received document from the collection of documents, removing at least the duplicate, or near duplicate document from the collection of documents, and notifying a user that the received document is a duplicate, or near duplicate, of at least one document in the collection of documents, wherein the first similarity threshold for each category is different from the second similarity threshold for the respective category, wherein the first similarity threshold for at least a first category is different from the first similarity threshold for at least a second category.

8. A computer program product, the computer program product comprising a computer usable storage medium having computer usable program code for managing a collection of documents, the code being configured for:
- determining, by operation of one or more computer processors when executing the computer usable program code, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;
  
  determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;
  
  upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;
  
  training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and
  
  upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;
  
  assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein determining the similarity score between the received document and each respective category comprises supplying the received document as input to a classifier associated with each respective category, wherein each classifier is configured to generate the measure of similarity between the received document and the documents assigned to a respective category.
  - 10. The computer program product of claim 9, wherein the documents assigned to each respective category are duplicates, or near duplicates, of one another.
  - 11. The computer program product of claim 8, wherein the code is further configured for, upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories, assigning the received document to the new category.
  - 12. The computer program product of claim 8, wherein determining, based on the plurality of similarity scores, whether the received document is a near-duplicate of one of the documents assigned to one of the plurality of categories comprises whether the similarity score exceeds the second similarity threshold.
  - 13. The computer program product of claim 8, wherein the code is further configured for excluding, based on a common template, at least some content in the received document from the determination of the similarity score between the received document and each of the plurality of categories.
  - 14. The computer program product of claim 8, wherein the code is further configured for, upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories, performing a user-specified action selected from at least one of, deprecating the duplicate, or near duplicate document, removing the received document from the collection of documents, removing at least the duplicate, or near duplicate document from the collection of documents, and notifying a user that the received document is a duplicate, or near duplicate, of at least one document in the collection of documents, wherein the first similarity threshold for each category is different from the second similarity threshold for the respective category, wherein the first similarity threshold for at least a first category is different from the first similarity threshold for at least a second category.

15. A system, comprising:
- a processor; and
  
  a memory containing an application program configured to manage a collection of documents, which, when executed on the processor is configured to perform an operation comprising;
  
  determining a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;
  
  determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;
  
  upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;
  
  training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and
  
  upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;
  
  assigning the document to the determined category; and
  
  training a classifier associated with the determined category using the received document.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The system of claim 15, wherein determining the similarity score between the received document and each respective category comprises supplying the received document as input to a classifier associated with each respective category, wherein each classifier is configured to generate the measure of similarity between the received document and the documents assigned to a respective category.
  - 17. The system of claim 15, wherein the documents assigned to each respective category are duplicates, or near duplicates, of one another.
  - 18. The system of claim 15, wherein the operation further comprises, upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories, assigning the received document to the new category.
  - 19. The system of claim 15, wherein determining, based on the plurality of similarity scores, whether the received document is a near-duplicate of one of the documents assigned to one of the plurality of categories comprises whether the similarity score exceeds the second similarity threshold.
  - 20. The system of claim 15, wherein the operation further comprises excluding, based on a common template, at least some content in the received document from the determination of the similarity score between the received document and each of the plurality of categories.
  - 21. The system of claim 15, wherein the operation further comprises, upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories, performing a user-specified action selected from at least one of, deprecating the duplicate, or near duplicate document, removing the received document from the collection of documents, removing at least the duplicate, or near duplicate document from the collection of documents, and notifying a user that the received document is a duplicate, or near duplicate, of at least one document in the collection of documents, wherein the first similarity threshold for each category is different from the second similarity threshold for the respective category, wherein the first similarity threshold for at least a first category is different from the first similarity threshold for at least a second category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chitiveli, Srinivas V., Emanuel, Barton W., Holt, Alexander W., Moran, Michael E.
Primary Examiner(s)
Lovel, Kimberly

Application Number

US12/472,758
Publication Number

US 20100306204A1
Time in Patent Office

1,084 Days
Field of Search

707/752, 707/999.102, 707/737
US Class Current

707/737
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/355 Class or cluster creation o...

Detecting duplicate documents using classification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate documents using classification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links