Uniform resource locator scoring for targeted web crawling
First Claim
Patent Images
1. A targeted web crawling method comprising:
- analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages;
generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired web page type;
generating an anchor text score in response to anchor text of the URL, where the anchor text score indicates probability that the outlinked web page is of the desired web page type;
generating a URL string score in response to characters of the URL, where the URL string score indicates probability that the outlinked web page is of the desired web page type;
generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired web page type; and
calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, and the category need score.
2 Assignments
0 Petitions
Accused Products
Abstract
A web crawler system as described herein utilizes a targeted approach to increase the likelihood of downloading web pages of a desired type or category. The system employs a plurality of URL scoring metrics that generate individual scores for outlinked URLs contained in a downloaded web page. For each outlinked URL, the individual scores are combined using an appropriate algorithm or formula to generate an overall score that represents a downloading priority for the outlinked URL. The web crawler application can then download subsequent web pages in an order that is influenced by the downloading priorities.
93 Citations
20 Claims
-
1. A targeted web crawling method comprising:
-
analyzing a downloaded web page that links to an outlinked web page having a URL and belonging to a sub-group of web pages; generating a domain density score in response to the sub-group, where the domain density score indicates relevance of the URL to a desired web page type; generating an anchor text score in response to anchor text of the URL, where the anchor text score indicates probability that the outlinked web page is of the desired web page type; generating a URL string score in response to characters of the URL, where the URL string score indicates probability that the outlinked web page is of the desired web page type; generating a category need score in response to characteristics of the downloaded web page, where the category need score is influenced by a current distribution of a plurality of categories associated with the desired web page type; and calculating a downloading priority for the URL in response to the domain density score, the anchor text score, the URL string score, and the category need score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-readable medium having computer-executable instructions for performing steps comprising:
-
obtaining information about a first web page that includes a link to a second web page having a URL; generating a plurality of scores for the URL, each of the plurality of scores indicating a different measure related to whether the URL corresponds to a desired web page type; calculating a downloading priority for the URL in response to the plurality of scores; and providing the URL along with the downloading priority to a web crawler application. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A web crawler system comprising:
-
a web crawler core module configured to download web pages in accordance with a downloading priority scheme; a web page classifier coupled to the web crawler core module, the web page classifier being configured to analyze a first web page downloaded by the web crawler core module, the first web page having an outgoing link to a second web page corresponding to a URL; and a URL scoring module coupled to the web page classifier and to the web crawler core module, the URL scoring module being configured to assign a downloading priority to the URL based upon a plurality of metrics;
whereineach of the plurality of metrics indicates a different measure of probability that the second web page is of a designated type. - View Dependent Claims (19, 20)
-
Specification