System and method for detecting a web page
First Claim
1. A computer system for detecting a template, comprising:
- a web page template detector for performing page-level template detection on a web page;
a web page template classifier operably coupled to the web page template detector for identifying a web page template of the web page; and
a storage operably coupled to the web page template detector for storing a plurality of templateness scores assigned by the web page template classifier for nodes of the web page.
9 Assignments
0 Petitions
Accused Products
Abstract
An improved system and method is provided for detecting a web page template. A web page template detector may be provided for performing page-level template detection on a web page. In general, the web page template classifier may be trained using automatically generated training data, and then the web page template classifier may be applied to web pages to identify web page templates. A web page template may be detected by classifying segments of a web page as template structures, by assigning classification scores to the segments of the web page classified as template structures, and then by smoothing the classification scores assigned to the segments of the web page. Generalized isotonic regression may be applied for smoothing scores associated with the nodes of a hierarchy by minimizing an optimization function using dynamic programming.
25 Citations
20 Claims
-
1. A computer system for detecting a template, comprising:
-
a web page template detector for performing page-level template detection on a web page; a web page template classifier operably coupled to the web page template detector for identifying a web page template of the web page; and a storage operably coupled to the web page template detector for storing a plurality of templateness scores assigned by the web page template classifier for nodes of the web page. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method for detecting a template, comprising:
-
receiving a plurality of web pages; automatically generating training data; and training a page-level classifier to identify a web page template. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer system for detecting a template, comprising:
-
means for receiving a web page; means for classifying segments of the web page as template structures; means for assigning classification scores to the segments of the web page classified as the template structures; and means for smoothing classification scores assigned to the segments of the web page. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification