DETECTING DUPLICATE DOCUMENTS USING CLASSIFICATION
First Claim
1. A computer-implemented method for managing a collection of documents, comprising configuring one or more processors to perform an operation comprising:
- receiving a document;
determining, by operation of the one or more computer processors, a similarity score between the received document and each of a plurality of categories, wherein each category is assigned one or more documents;
determining, based on the plurality of similarity scores, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;
upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories;
creating a new category for the received document, andtraining a classifier associated with the new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and
upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories;
assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods and articles of manufacture are disclosed for detecting a duplicate document. A plurality of documents may be assigned to categories, each category corresponding to a collection of duplicates, or near duplicate documents. A new document may be received. The new document may be evaluated against each category to determine a similarity score between the new document and each category. The new document may be identified as a duplicate based on the similarity scores and thresholds for each category. An action may then be performed on the duplicate based on duplication rules. The thresholds and duplication rules may be customized by a user.
37 Citations
21 Claims
-
1. A computer-implemented method for managing a collection of documents, comprising configuring one or more processors to perform an operation comprising:
-
receiving a document; determining, by operation of the one or more computer processors, a similarity score between the received document and each of a plurality of categories, wherein each category is assigned one or more documents; determining, based on the plurality of similarity scores, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories; upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories; creating a new category for the received document, and training a classifier associated with the new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories; assigning the document to the determined category, and training a classifier associated with the determined category using the received document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product, the computer program product comprising a computer usable medium having computer usable program code for managing a collection of documents, the code being configured for:
-
receiving a document; determining, by operation of the one or more computer processors, a similarity score between the received document and each of a plurality of categories, wherein each category is assigned one or more documents; determining, based on the plurality of similarity scores, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories; upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories; creating a new category for the received document, and training a classifier associated with the new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories; assigning the document to the determined category, and training a classifier associated with the determined category using the received document. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
a processor; and a memory containing an application program configured to manage a collection of documents, which, when executed on the processor is configured to perform an operation comprising; receiving a document, determining, by operation of the one or more computer processors, a similarity score between the received document and each of a plurality of categories, wherein each category is assigned one or more documents, determining, based on the plurality of similarity scores, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories, upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories; creating a new category for the received document; and training a classifier associated with the new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document, and upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories; assigning the document to the determined category; and training a classifier associated with the determined category using the received document. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification