System for automatically managing duplicate documents when crawling dynamic documents
First Claim
1. A method of grouping document identifiers by their document contents, comprising:
- partitioning a plurality of document identifiers into multiple clusters, wherein the document identifiers in each cluster comprising universal resource locators (URLs) are selected so as to have the same hostname, wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster;
generating an equivalence rule for at least one cluster, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and
grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class, wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content;
identifying a single one of the URLs within each equivalence class as its representative document identifier; and
with respect to a respective equivalence class having a plurality of URLs, performing a particular computer-implemented operation on only the single representative URL of the equivalence class.
2 Assignments
0 Petitions
Accused Products
Abstract
A system of reducing the possibility of crawling duplicate document identifiers partitions a plurality of document identifiers into multiple clusters, each cluster having a cluster name and a set of document parameters. The system generates an equivalence rule for each cluster of document identifiers, the rule specifying which document parameters associated with the cluster are content-relevant. Next, the system groups each cluster of document identifiers into one or more equivalence classes in accordance with its associated equivalence rule, each equivalence class including one or more document identifiers that correspond to a document content and having a representative document identifier identifying the document content.
-
Citations
24 Claims
-
1. A method of grouping document identifiers by their document contents, comprising:
-
partitioning a plurality of document identifiers into multiple clusters, wherein the document identifiers in each cluster comprising universal resource locators (URLs) are selected so as to have the same hostname, wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster; generating an equivalence rule for at least one cluster, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class, wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content; identifying a single one of the URLs within each equivalence class as its representative document identifier; and with respect to a respective equivalence class having a plurality of URLs, performing a particular computer-implemented operation on only the single representative URL of the equivalence class. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer readable storage medium storing one or more programs for execution by one or more processors, the one or more programs comprising:
-
a partitioning module for partitioning a plurality of document identifiers into multiple clusters, the document identifiers comprising universal resource locators (URLs), wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster; an equivalence rule generator for generating an equivalence rule for a cluster of URLs, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and a grouping module for grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content; an identifying module for identifying a single one of the URLs having the same hostname within each equivalence class as its representative document identifier; and an operation module for performing, with respect to a respective equivalence class having a plurality of URLs, a particular computer-implemented operation on only the single representative URL of the equivalence class. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification