×

Method and system for detecting duplicate documents in web crawls

  • US 6,547,829 B1
  • Filed: 06/30/1999
  • Issued: 04/15/2003
  • Est. Priority Date: 06/30/1999
  • Status: Expired due to Term
First Claim
Patent Images

1. A computer-based method for use in crawling a computer-readable document store, and particularly for detecting duplicate documents during a crawl so as to avoid unnecessarily retrieving and processing such duplicates, comprising the following acts:

  • (a) obtaining from the document store a content identifier (CID) corresponding to a particular document, wherein the CID is characterized in that;

    (1) the CID can be fetched independently of the document itself, (2) the CID uniquely identifies the physical document in that no two different documents would have equal CIDs, and (3) the same document accessible through different URLs would have the same CID;

    (b) determining whether the value of the CID is the same as the value of a previously obtained CID corresponding to another document; and

    (c) if the value of the CID is not the same as the value of a previously obtained CID, fetching the particular document from the document store.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×