System for automatically managing duplicate documents when crawling dynamic documents

US 7,680,773 B1
Filed: 03/31/2005
Issued: 03/16/2010
Est. Priority Date: 03/31/2005
Status: Active Grant

First Claim

Patent Images

1. A method of grouping document identifiers by their document contents, comprising:

partitioning a plurality of document identifiers into multiple clusters, wherein the document identifiers in each cluster comprising universal resource locators (URLs) are selected so as to have the same hostname, wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster;

generating an equivalence rule for at least one cluster, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and

grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class, wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content;

identifying a single one of the URLs within each equivalence class as its representative document identifier; and

with respect to a respective equivalence class having a plurality of URLs, performing a particular computer-implemented operation on only the single representative URL of the equivalence class.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system of reducing the possibility of crawling duplicate document identifiers partitions a plurality of document identifiers into multiple clusters, each cluster having a cluster name and a set of document parameters. The system generates an equivalence rule for each cluster of document identifiers, the rule specifying which document parameters associated with the cluster are content-relevant. Next, the system groups each cluster of document identifiers into one or more equivalence classes in accordance with its associated equivalence rule, each equivalence class including one or more document identifiers that correspond to a document content and having a representative document identifier identifying the document content.

Citations

24 Claims

1. A method of grouping document identifiers by their document contents, comprising:
- partitioning a plurality of document identifiers into multiple clusters, wherein the document identifiers in each cluster comprising universal resource locators (URLs) are selected so as to have the same hostname, wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster;
  
  generating an equivalence rule for at least one cluster, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and
  
  grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class, wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content;
  
  identifying a single one of the URLs within each equivalence class as its representative document identifier; and
  
  with respect to a respective equivalence class having a plurality of URLs, performing a particular computer-implemented operation on only the single representative URL of the equivalence class.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein said generating an equivalence rule for at least one cluster includes:
    - performing one or more analysis procedures for each of the URL parameters associated with the cluster;
      
      deriving from said analysis procedures one or more values indicative of a relationship pattern between the URL parameter and its associated document contents; and
      
      classifying the URL parameter into one of multiple categories in accordance with its values.
  - 3. The method of claim 2, wherein performing the one or more analysis procedures includes performing an insignificance analysis of each URL parameter associated with the cluster.
  - 4. The method of claim 3, wherein performing the insignificance analysis of a particular URL parameter includes:
    - grouping document identifiers in the cluster into multiple sets, each set corresponding to a unique document content with respect to other sets; and
      
      computing an insignificance index for the particular URL parameter in accordance with the number of document identifiers in each set in which the particular URL parameter has at least two different parameter values.
  - 5. The method of claim 4, further including a numeric insignificance threshold, wherein the particular URL parameter is classified as content-relevant if its insignificance index is less than the insignificance threshold.
  - 6. The method of claim 2, wherein performing the one or more analysis procedures includes significance analysis of each URL parameter associated with the cluster.
  - 7. The method of claim 6, wherein the significance analysis of a respective URL parameter further includes:
    - removing the respective URL parameter from each document identifier associated with the cluster, each document identifier having a document identifier remainder;
      
      grouping the document identifiers into multiple sets, each set having a distinct document identifier remainder; and
      
      summing up the number of distinct document contents within each set that has at least two different document contents as the URL parameter'"'"'s significance index.
  - 8. The method of claim 7, further including a numeric significance threshold, wherein the respective URL parameter is classified as content-irrelevant if its significance index is less than the significance threshold.
  - 9. The method of claim 1 further including:
    - selecting a set of validation URLs for the equivalence rule;
      
      checking if the equivalence rule correctly predicts a document content type for each of the validation URLs; and
      
      repeating said generating and grouping until said checking determines that the equivalence rule correctly predicts a document content type for each of the validation URLs.
  - 10. The method of claim 9, wherein a URL more different than another one from the selected validation URLs is given a higher priority in the selection of validation URLs.
  - 11. The method of claim 1 further including:
    - selecting a set of validation URLs for each equivalence class associated with the cluster, the equivalence class having a representative URL;
      
      checking if a document content referenced by the representative URL is substantially identical to a document content referenced by said validation URL; and
      
      replacing the representative URL with one of the validation URLs if the checking produces a negative result.
  - 12. The method of claim 11, wherein selecting a set of validation URLs for an equivalence class includes identifying, from among a set of URLs in the equivalence class, a first URL having a greater distance from the URL than a second URL in the set of URLs in the equivalence class, in accordance with a predefined distance metric.
  - 13. The method of claim 1, wherein performing the particular computer implemented operation includes:
    - performing a search engine operation on only the single representative URL of the equivalence class.

14. A computer readable storage medium storing one or more programs for execution by one or more processors, the one or more programs comprising:
- a partitioning module for partitioning a plurality of document identifiers into multiple clusters, the document identifiers comprising universal resource locators (URLs), wherein each document identifier is a text string that identifies exactly one document, each cluster has a cluster name, and each cluster has a set of URL parameters, wherein each URL parameter is a text string contained within the URLs in the cluster;
  
  an equivalence rule generator for generating an equivalence rule for a cluster of URLs, the equivalence rule specifying which of the URL parameters associated with the cluster are content-relevant; and
  
  a grouping module for grouping a respective cluster into a plurality of equivalence classes in accordance with its equivalence rule, each equivalence class including URLs that correspond to a document content associated with the equivalence class wherein all the URLs in a respective equivalence class of the plurality of equivalence classes have the same hostname and reference documents having substantially the same content;
  
  an identifying module for identifying a single one of the URLs having the same hostname within each equivalence class as its representative document identifier; and
  
  an operation module for performing, with respect to a respective equivalence class having a plurality of URLs, a particular computer-implemented operation on only the single representative URL of the equivalence class.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 15. The computer readable storage medium of claim 14, wherein said equivalence rule generator including instructions for:
    - performing one or more analysis procedures for each of the URL parameters associated with the cluster;
      
      deriving from said analysis procedures one or more values indicative of a relationship pattern between the URL parameter and its associated document contents; and
      
      classifying the URL parameter into one of multiple categories in accordance with its values.
  - 16. The computer readable storage medium of claim 15, wherein performing the one or more analysis procedures includes performing an insignificance analysis of each URL parameter associated with the cluster.
  - 17. The computer readable storage medium of claim 16, wherein performing the insignificance analysis of a particular URL parameter includes:
    - grouping document identifiers in the cluster into multiple sets, each set corresponding to a unique document content with respect to other sets; and
      
      computing an insignificance index for the particular URL parameter in accordance with the number of document identifiers in each set in which the particular URL parameter has at least two different parameter values.
  - 18. The computer readable storage medium of claim 17, further including a numeric insignificance threshold, wherein the particular URL parameter is classified as content-relevant if its insignificance index is less than the insignificance threshold.
  - 19. The computer readable storage medium of claim 15, wherein performing the one or more analysis procedures includes significance analysis of each URL parameter associated with the cluster.
  - 20. The computer readable storage medium of claim 19, wherein the significance analysis of a respective URL parameter further includes:
    - removing the respective URL parameter from each document identifier associated with the cluster, each document identifier having a document identifier remainder;
      
      grouping the document identifiers into multiple sets, each set having a distinct document identifier remainder; and
      
      summing up the number of distinct document contents within each set that has at least two different document contents as the URL parameter'"'"'s significance index.
  - 21. The computer readable storage medium of claim 20, further including a numeric significance threshold, wherein the respective URL parameter is classified as content-irrelevant if its significance index is less than the significance threshold.
  - 22. The computer readable storage medium of claim 14, further including an equivalence rule validation module for:
    - selecting a set of validation URLs for the equivalence rule;
      
      checking if the equivalence rule correctly predicts a document content type for each of the validation URLs; and
      
      repeating said generating and grouping until said checking determines that the equivalence rule correctly predicts a document content type for each of the validation URLs.
  - 23. The computer readable storage medium of claim 22, wherein a URL more different than another one from the selected validation URLs is given a higher priority in the selection of validation URLs.
  - 24. The computer readable storage medium of claim 14, further including an equivalence rule validation module for:
    - selecting a set of validation URLs for each equivalence class associated with the cluster, the equivalence class having a representative URL;
      
      checking if a document content referenced by the representative URL is substantially identical to a document content referenced by said validation URL; and
      
      replacing the representative URL with one of the validation URLs if the checking produces a negative result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Mukherjee, Arup, Acharya, Anurag, Jain, Arvind
Primary Examiner(s)
Pham; Khanh B
Assistant Examiner(s)
Yen; Syling

Application Number

US11/097,687
Time in Patent Office

1,811 Days
Field of Search

707/5, 707/200, 707/3, 707/10, 707/9, 709/245, 709/217, 709/219
US Class Current

707/737
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

System for automatically managing duplicate documents when crawling dynamic documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

System for automatically managing duplicate documents when crawling dynamic documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links