Please download the dossier by clicking on the dossier button x
×

Representative document selection for sets of duplicate documents in a web crawler system

  • US 8,260,781 B2
  • Filed: 07/19/2011
  • Issued: 09/04/2012
  • Est. Priority Date: 07/03/2003
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of detecting duplicate documents, comprising:

  • at a server having one or more processors and memory;

    receiving documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;

    generating a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;

    indexing at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and

    while performing the indexing,identifying respective sets of received documents having the same content identifier,selecting a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents,indexing the representative document, andwith respect to each respective set of received documents, including only the representative document in the document index.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×