Systems and methods for inferring uniform resource locator (URL) normalization rules
First Claim
1. A method for normalizing uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
- analyzing the content of at least two web resources to determine whether the web resources are substantially identical; and
determining a rule for the URLs of the web resources if the web resources are substantially identical.
3 Assignments
0 Petitions
Accused Products
Abstract
Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
70 Citations
20 Claims
-
1. A method for normalizing uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
-
analyzing the content of at least two web resources to determine whether the web resources are substantially identical; and
determining a rule for the URLs of the web resources if the web resources are substantially identical. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for normalizing uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
-
a web crawler for receiving the web resources from a web server; and
a processor for analyzing the content of at least two web resources to determine whether the web resources are substantially identical, and determining a rule for the URLs of the web resources if the web resources are substantially identical. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A method for normalizing uniform resource locators (URLs), comprising:
-
determining if a plurality of web resources are substantially identical;
identifying URLs addressing substantially identical web resources on a web server; and
constructing a URL normalization rule. - View Dependent Claims (18, 19, 20)
-
Specification