×

DETECTING DUPLICATE DOCUMENTS USING CLASSIFICATION

  • US 20100306204A1
  • Filed: 05/27/2009
  • Published: 12/02/2010
  • Est. Priority Date: 05/27/2009
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for managing a collection of documents, comprising configuring one or more processors to perform an operation comprising:

  • receiving a document;

    determining, by operation of the one or more computer processors, a similarity score between the received document and each of a plurality of categories, wherein each category is assigned one or more documents;

    determining, based on the plurality of similarity scores, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;

    upon determining that the received document is not a duplicate, or near duplicate, to a document assigned to any of the plurality of categories;

    creating a new category for the received document, andtraining a classifier associated with the new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and

    upon determining that the received document is a duplicate, or near duplicate, to a document assigned to at least one of the plurality of categories;

    assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×