×

Duplicate document detection in a web crawler system

  • US 7,627,613 B1
  • Filed: 07/03/2003
  • Issued: 12/01/2009
  • Est. Priority Date: 07/03/2003
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method of detecting duplicate documents in a network crawling system, comprising, at a server having one or more processors and memory:

  • constructing a plurality of tables, each table corresponding to a portion of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank;

    wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;

    receiving a newly crawled document, such document characterized by a document content identifier and a document rank;

    reading information stored in the plurality of tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;

    updating the information stored in at least one of the tables in accordance with the document ranks of the identified set of documents and the newly crawled document;

    determining a representative document for the newly crawled document and the identified set of documents;

    indexing the representative document when the representative document is the newly crawled document; and

    repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×