×

Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling

  • US 7,672,943 B2
  • Filed: 10/26/2006
  • Issued: 03/02/2010
  • Est. Priority Date: 10/26/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented targeted web crawling method comprising:

  • analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages;

    generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired content-based web page type;

    deriving one or more anchor text tokens from the URL;

    generating an anchor text score based on the derived one or more anchor text tokens and information stored in a token model database, where the anchor text score indicates probability that a content of the outlinked web page is of the desired content-based web page type;

    deriving one or more URL string tokens from the URL;

    generating a URL string score based on the derived one or more URL string tokens and information stored in the token model database, where the URL string score indicates probability that the content of the outlinked web page is of the desired web page type;

    generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired content-based web page type, wherein generating the category need score comprises;

    identifying a category for the downloaded web page in response to the analyzing step, wherein the category is one of the plurality of categories;

    if representation of the category in the current distribution is relatively high, generating a relatively low category need score; and

    if representation of the category in the current distribution is relatively low, generating a relatively high category need score;

    generating a link proximity score for the URL, where the link proximity score indicates linking distance from the URL to a linking URL that corresponds to the desired content-based web page type, the linking distance is measured by a number of links between the URL to the linking URL in a linking structure;

    calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score; and

    downloading, in an order determined by the downloading priority, a second web page that corresponds to the URL.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×