×

Representative Document Selection for Sets of Duplicate Dcouments in a Web Crawler System

  • US 20100076954A1
  • Filed: 12/01/2009
  • Published: 03/25/2010
  • Est. Priority Date: 07/03/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method of detecting duplicate documents, comprising:

  • at a server having one or more processors and memory;

    receiving a first document, the received document characterized by a document content identifier;

    selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;

    wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;

    updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; and

    identifying a representative document for the updated set of documents in accordance with the score information.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×