Detecting duplicate documents using classification
First Claim
1. A computer-implemented method for managing a collection of documents, comprising:
- determining, by operation of one or more computer processors, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;
determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;
upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;
training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and
upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;
assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods and articles of manufacture are disclosed for detecting a duplicate document. A plurality of documents may be assigned to categories, each category corresponding to a collection of duplicates, or near duplicate documents. A new document may be received. The new document may be evaluated against each category to determine a similarity score between the new document and each category. The new document may be identified as a duplicate based on the similarity scores and thresholds for each category. An action may then be performed on the duplicate based on duplication rules. The thresholds and duplication rules may be customized by a user.
-
Citations
21 Claims
-
1. A computer-implemented method for managing a collection of documents, comprising:
-
determining, by operation of one or more computer processors, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate; determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories; upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories; training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories; assigning the document to the determined category, and training a classifier associated with the determined category using the received document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product, the computer program product comprising a computer usable storage medium having computer usable program code for managing a collection of documents, the code being configured for:
-
determining, by operation of one or more computer processors when executing the computer usable program code, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate; determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories; upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories; training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories; assigning the document to the determined category, and training a classifier associated with the determined category using the received document. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
a processor; and a memory containing an application program configured to manage a collection of documents, which, when executed on the processor is configured to perform an operation comprising; determining a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate; determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories; upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories; training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories; assigning the document to the determined category; and training a classifier associated with the determined category using the received document. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification