×

Identifying potential duplicates of a document in a document corpus

  • US 9,195,714 B1
  • Filed: 02/17/2011
  • Issued: 11/24/2015
  • Est. Priority Date: 12/06/2007
  • Status: Active Grant
First Claim
Patent Images

1. A method, comprising:

  • performing, by one or more computers,initiating, based on receiving a source document, a routine for identifying one or more candidate duplicate documents of the source document from a document corpus, said identifying comprising;

    receiving the source document, wherein a document type of the source document is the same as a document type of at least some of a plurality of reference documents in the document corpus;

    determining a plurality of different queries from content of the source document;

    executing the plurality of different queries on the document corpus in response to receiving the source document, wherein the plurality of different queries differ from one another by at least one search term, wherein individual queries of the plurality of different queries produce a respective list of documents that identifies at least some of the plurality of reference documents of the document corpus, wherein the respective reference documents of the respective lists of documents are scored at least in part with respect to the source document;

    based, at least in part, on scores for the reference documents from at least two of the respective lists, selecting one or more of the reference documents having a respective score that meets a score threshold as potential duplicates of the same received source document that initiated the routine; and

    storing an identification of the one or more potential duplicate documents.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×