Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling

US 7,672,943 B2
Filed: 10/26/2006
Issued: 03/02/2010
Est. Priority Date: 10/26/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented targeted web crawling method comprising:

analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages;

generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired content-based web page type;

deriving one or more anchor text tokens from the URL;

generating an anchor text score based on the derived one or more anchor text tokens and information stored in a token model database, where the anchor text score indicates probability that a content of the outlinked web page is of the desired content-based web page type;

deriving one or more URL string tokens from the URL;

generating a URL string score based on the derived one or more URL string tokens and information stored in the token model database, where the URL string score indicates probability that the content of the outlinked web page is of the desired web page type;

generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired content-based web page type, wherein generating the category need score comprises;

identifying a category for the downloaded web page in response to the analyzing step, wherein the category is one of the plurality of categories;

if representation of the category in the current distribution is relatively high, generating a relatively low category need score; and

if representation of the category in the current distribution is relatively low, generating a relatively high category need score;

generating a link proximity score for the URL, where the link proximity score indicates linking distance from the URL to a linking URL that corresponds to the desired content-based web page type, the linking distance is measured by a number of links between the URL to the linking URL in a linking structure;

calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score; and

downloading, in an order determined by the downloading priority, a second web page that corresponds to the URL.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A web crawler system as described herein utilizes a targeted approach to increase the likelihood of downloading web pages of a desired type or category. The system employs a plurality of URL scoring metrics that generate individual scores for outlinked URLs contained in a downloaded web page. For each outlinked URL, the individual scores are combined using an appropriate algorithm or formula to generate an overall score that represents a downloading priority for the outlinked URL. The web crawler application can then download subsequent web pages in an order that is influenced by the downloading priorities.

Citations

10 Claims

1. A computer-implemented targeted web crawling method comprising:
- analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages;
  
  generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired content-based web page type;
  
  deriving one or more anchor text tokens from the URL;
  
  generating an anchor text score based on the derived one or more anchor text tokens and information stored in a token model database, where the anchor text score indicates probability that a content of the outlinked web page is of the desired content-based web page type;
  
  deriving one or more URL string tokens from the URL;
  
  generating a URL string score based on the derived one or more URL string tokens and information stored in the token model database, where the URL string score indicates probability that the content of the outlinked web page is of the desired web page type;
  
  generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired content-based web page type, wherein generating the category need score comprises;
  
  identifying a category for the downloaded web page in response to the analyzing step, wherein the category is one of the plurality of categories;
  
  if representation of the category in the current distribution is relatively high, generating a relatively low category need score; and
  
  if representation of the category in the current distribution is relatively low, generating a relatively high category need score;
  
  generating a link proximity score for the URL, where the link proximity score indicates linking distance from the URL to a linking URL that corresponds to the desired content-based web page type, the linking distance is measured by a number of links between the URL to the linking URL in a linking structure;
  
  calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score; and
  
  downloading, in an order determined by the downloading priority, a second web page that corresponds to the URL.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method according to claim 1, further comprising:
    - downloading, in an order determined by the downloading priority, a second web page that corresponds to the URL, the second web page containing a second outlinked web page having a second URL; and
      
      processing the downloading priority when generating a second link proximity score for the second URL.
  - 3. A method according to claim 1, wherein generating the domain density score comprises:
    - obtaining a ratio of a number of indexed pages to a number of processed pages, where the number of indexed pages represents a number of web pages from the sub-group having the desired content-based web page type, and where the number of processed pages represents a total number of web pages from the sub-group processed by the web crawling method; and
      
      calculating the domain density score from the ratio.
  - 4. A method according to claim 1, wherein generating the anchor text score comprises:
    - extracting words from the anchor text of the URL; and
      
      for each word extracted from the anchor text of the URL, calculating a word score that indicates probability of relevance to the desired content-based web page type.
  - 5. A method according to claim 1, wherein generating the anchor text score comprises:
    - extracting words from the anchor text of the URL;
      
      identifying at least one combination of words extracted from the anchor text of the URL; and
      
      for each combination of words, calculating a combined word score that indicates probability of relevance to the desired content-based web page type.
  - 6. A method according to claim 1, wherein generating the URL string score comprises:
    - extracting strings from the URL; and
      
      for each string extracted from the URL, calculating a string score that indicates probability of relevance to the desired content-based web page type.
  - 7. A method according to claim 1, wherein generating the URL string score comprises:
    - extracting strings from the URL;
      
      identifying at least one combination of strings extracted from the URL; and
      
      for each combination of strings, calculating a combined string score that indicates probability of relevance to the desired content-based web page type.

8. A computer-readable medium having computer-executable instructions for performing steps comprising:
- analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages;
  
  generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired content-based web page type;
  
  deriving one or more anchor text tokens from the URL;
  
  generating an anchor text score based on the derived one or more anchor text tokens and information stored in a token model database, where the anchor text score indicates probability that a content of the outlinked web page is of the desired content-based web page type;
  
  deriving one or more URL string tokens from the URL;
  
  generating a URL string score based on the derived one or more URL string tokens and information stored in the token model database, where the URL string score indicates probability that the content of the outlinked web page is of the desired web page type;
  
  generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired content-based web page type, wherein generating the category need score comprises;
  
  identifying a category for the downloaded web page in response to the analyzing step, wherein the category is one of the plurality of categories;
  
  if representation of the category in the current distribution is relatively high, generating a relatively low category need score; and
  
  if representation of the category in the current distribution is relatively low, generating a relatively high category need score;
  
  generating a link proximity score for the URL, where the link proximity score indicates linking distance from the URL to a linking URL that corresponds to the desired content-based web page type, the linking distance is measured by a number of links between the URL to the linking URL in a linking structure;
  
  calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score; and
  
  downloading, in an order determined by the downloading priority, a second web page that corresponds to the URL.

9. A web crawler system comprising:
- a computer system comprising a processing unit comprising;
  
  a web crawler core module configured to download web pages in accordance with a downloading priority scheme;
  
  a web page classifier coupled to the web crawler core module, the web page classifier being configured to analyze a first web page downloaded by the web crawler core module, the first web page having an outgoing link to a second web page corresponding to a URL; and
  
  a URL scoring module coupled to the web page classifier and to the web crawler core module, the URL scoring module being configured to assign a downloading priority to the URL based upon a plurality of metrics, wherein the plurality of metrics comprises;
  
  a domain density metric that results in a domain density score for the URL;
  
  an anchor text metric that results in an anchor text score for the URL;
  
  a URL string score metric that results in a URL string score for the URL; and
  
  a link proximity metric that results in a link proximity score for the URL;
  
  wherein each of the plurality of metrics indicates a different measure of probability that the second web page is of a designated content type, wherein the designated content type is at least one of commercial product pages, customer reviews pages, news pages, blogs, personal blogs, political pages, sports pages, education pages or reference pages.
- View Dependent Claims (10)
- - 10. A web crawler system according to claim 9, wherein the URL scoring module is configured to assign the downloading priority to the URL based upon a category need metric that indicates a predicted category for the second web page.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kim, Joon Young, Wong, Sandy, Thogersen, Michael D., Huynh, Yet L., Natarajan, Ramakrishnan, Yao, Tong
Primary Examiner(s)
CHANNAVAJJALA, SRIRAMA T

Application Number

US11/586,779
Publication Number

US 20080104113A1
Time in Patent Office

1,223 Days
Field of Search

707 1- 3, 707 5- 7, 707/10, 707/100, 707/102, 707/104.1, 707/200, 709220-225, 709218-219, 715/733, 715/808, 715/839, 715/854, 715205-206, 706 45- 50
US Class Current

707/709
CPC Class Codes

G06F 16/9535 Search customisation based ...

Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links