×

System for automatically managing duplicate documents when crawling dynamic documents

  • US 7,680,773 B1
  • Filed: 03/31/2005
  • Issued: 03/16/2010
  • Est. Priority Date: 03/31/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method of grouping document identifiers by their document contents, comprising:

  • partitioning a plurality of document identifiers into multiple clusters, wherein the document identifiers in each cluster comprising universal resource locators (URLs) are selected so as to have the same hostname, wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster;

    generating an equivalence rule for at least one cluster, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and

    grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class, wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content;

    identifying a single one of the URLs within each equivalence class as its representative document identifier; and

    with respect to a respective equivalence class having a plurality of URLs, performing a particular computer-implemented operation on only the single representative URL of the equivalence class.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×