×

Representative document selection for sets of duplicate documents in a web crawler system

  • US 7,984,054 B2
  • Filed: 12/01/2009
  • Issued: 07/19/2011
  • Est. Priority Date: 07/03/2003
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of detecting duplicate documents, comprising:

  • at a server having one or more processors and memory;

    receiving a first document, the received document characterized by a document content identifier;

    selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;

    wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;

    updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents;

    determining a representative document for the updated set of documents in accordance with the score information;

    indexing the received document when the received document is the representative document for the updated set of documents; and

    repeating the receiving, selecting, updating, determining and indexing operations with respect to a plurality of received documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the received documents are determined to be representative documents and are indexed.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×