Systems and methods for inferring uniform resource locator (URL) normalization rules
First Claim
1. A method for determining a rule applicable to uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
- analyzing the content of web resources from at least one web site;
grouping web resources by content so that each group comprises all of the web resources from the at least one web site that have substantially identical content, wherein each group of substantially identical web resources is referred to as an equivalence class;
analyzing URLs corresponding to all substantially identical web resources in an equivalence class to determine a per equivalence class URL rewrite rule applicable to the URLs;
analyzing the per equivalence class URL rewrite rule compared to at least one other per equivalence class URL rewrite rule for at least one different equivalence class to determine a trans-equivalence class URL rewrite rule; and
applying the trans-equivalence class URL rewrite rule to additional web resources from the at least one website to predict that different URLs reference substantially identical web resources, thereby avoiding a plurality of references to or downloads of substantially identical web resources.
3 Assignments
0 Petitions
Accused Products
Abstract
Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
59 Citations
19 Claims
-
1. A method for determining a rule applicable to uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
-
analyzing the content of web resources from at least one web site; grouping web resources by content so that each group comprises all of the web resources from the at least one web site that have substantially identical content, wherein each group of substantially identical web resources is referred to as an equivalence class; analyzing URLs corresponding to all substantially identical web resources in an equivalence class to determine a per equivalence class URL rewrite rule applicable to the URLs; analyzing the per equivalence class URL rewrite rule compared to at least one other per equivalence class URL rewrite rule for at least one different equivalence class to determine a trans-equivalence class URL rewrite rule; and applying the trans-equivalence class URL rewrite rule to additional web resources from the at least one website to predict that different URLs reference substantially identical web resources, thereby avoiding a plurality of references to or downloads of substantially identical web resources. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for determining a rule applicable to uniform resource locators (URLs) corresponding to a plurality of web resources, comprising:
a web crawler to receive the web resources from at least one web site on a web server; and a processor to receive the content of the web resources, grouping web resources by content so that each group comprises all of the web resources from the at least one web site that have substantially identical content, wherein each group of substantially identical web resources is referred to as an equivalence class, analyzing URLs corresponding to all substantially identical web resources in an equivalence class to determine a per equivalence class URL rewrite rule applicable to the URLs;
analyzing the per equivalence class URL rewrite rule compared to at least one other per equivalence class URL rewrite rule for at least one different equivalence class to determine a trans-equivalence class URL rewrite rule; and
applying the trans-equivalence class URL rewrite rule to additional web resources from the at least one website to predict that different URLs reference substantially identical web resources, thereby avoiding receipt by the web crawler of substantially identical web resources.- View Dependent Claims (11, 12, 13, 14, 15)
-
16. A method for determining a rule applicable to uniform resource locators (URLs), comprising:
-
grouping web resources on a web server according to substantially identical content, wherein each group of substantially identical web resources is referred to as an equivalence class; analyzing URLs addressing all substantially identical web resources in an equivalence class; constructing a per equivalence class URL normalization rule applicable to the URLs corresponding to all substantially identical web resources in the equivalence class; analyzing the per equivalence class URL normalization rule compared to at least one other per equivalence class URL normalization rule for at least one different equivalence class to determine a trans-equivalence class URL normalization rule; and applying the trans-equivalence class URL normalization rule to additional web resources to predict that different URLs reference substantially identical web resources, thereby avoiding a plurality of references to or downloads of substantially identical web resources. - View Dependent Claims (17, 18, 19)
-
Specification