×

Detecting duplicate documents using classification

  • US 8,180,773 B2
  • Filed: 05/27/2009
  • Issued: 05/15/2012
  • Est. Priority Date: 05/27/2009
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method for managing a collection of documents, comprising:

  • determining, by operation of one or more computer processors, a similarity score between a received document and each of a plurality of categories, wherein each category is assigned one or more documents, and wherein each category is associated with a first similarity threshold for determining duplicates and a second similarity threshold for determining near-duplicates, wherein the similarity thresholds for each category are specified by one or more duplication rules associated with the respective category, and wherein the one or more duplication rules for each category further specify, for the respective category, a first action to be performed on a duplicate and a second, different action to be performed on a near-duplicate;

    determining, based on the plurality of similarity scores and the similarity thresholds, whether the received document is one of a duplicate and a near-duplicate of one of the documents assigned to one of the plurality of categories;

    upon determining that the received document is not a duplicate and not a near duplicate of a document assigned to any of the plurality of categories;

    training a classifier associated with a new category using the received document, wherein the trained classifier is configured to determine a measure of similarity between the received document and an input document; and

    upon determining that the received document is a duplicate or near duplicate of a document assigned to at least one of the plurality of categories;

    assigning the document to the determined category, andtraining a classifier associated with the determined category using the received document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×