Annotating HTML segments with functional labels
First Claim
1. A method comprising:
- processing a web page to determine a plurality of segments, wherein each segment from the plurality of segments includes one or more HTML elements;
each machine-based classifier of a plurality of machine-based classifiers generating, based at least upon metadata associated with two or more segments from the plurality of segments that indicates one or more presentation features in the HTML elements of the two or more segments from the plurality of segments, a probability output for each segment of the two or more segments from the plurality of segments, wherein each functional category from the plurality of functional categories corresponds to a functional role of HTML elements in the web page;
wherein each machine-based classifier from the plurality of machine-based classifiers corresponds to a functional category from the plurality of functional categories;
assigning, based on the plurality of probability output, one or more functional categories to each segment of the two or more segments;
a first application selecting a first set of functional categories from the plurality of functional categories;
a second application that is different than the first application selecting a second set of functional categories from the plurality of functional categories, wherein the second set of functional categories does not include functional categories from the first set of functional categories;
the first application selecting for processing, based upon the first set of functional categories and the functional categories assigned to the two or more segments, a first set of one or more segments from the two or more segments;
the second application selecting for processing, based upon the second set of functional categories and the functional categories assigned to the two or more segments, a second set of one or more segments from the two or more segments, wherein the second set of one or more segments includes at least one segment that is not in the first set of one or more segments and the first set of one or more segments includes at least one segment that is not in the second set of one or more segments;
the first application processing content contained in the first set of one or more segments and not processing content contained in the second set of one or more segments;
the second application processing content contained in the second set of one or more segments and not processing content contained in the first set of one or more segments; and
wherein the method is performed by one or more computing devices.
9 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus is described for assigning functional labels to segments of web pages in an application-independent way. In the approach described herein, one of a generic set functional labels are automatically assigned to each segment of a web page, where the generic functional labels may be topic-independent and application-independent. Applications with different needs can determine which segments of the web page to process based on which functional labels correspond to the types of information needed by each application. Thus, the work of classifying the function of each segment of a web page is separated from the work of selecting which segments satisfy the need of a particular application. The work of classification can be performed in an application-independent way, relieving the burden from every application developer from having to create their own classifiers.
-
Citations
20 Claims
-
1. A method comprising:
-
processing a web page to determine a plurality of segments, wherein each segment from the plurality of segments includes one or more HTML elements; each machine-based classifier of a plurality of machine-based classifiers generating, based at least upon metadata associated with two or more segments from the plurality of segments that indicates one or more presentation features in the HTML elements of the two or more segments from the plurality of segments, a probability output for each segment of the two or more segments from the plurality of segments, wherein each functional category from the plurality of functional categories corresponds to a functional role of HTML elements in the web page; wherein each machine-based classifier from the plurality of machine-based classifiers corresponds to a functional category from the plurality of functional categories; assigning, based on the plurality of probability output, one or more functional categories to each segment of the two or more segments; a first application selecting a first set of functional categories from the plurality of functional categories; a second application that is different than the first application selecting a second set of functional categories from the plurality of functional categories, wherein the second set of functional categories does not include functional categories from the first set of functional categories; the first application selecting for processing, based upon the first set of functional categories and the functional categories assigned to the two or more segments, a first set of one or more segments from the two or more segments; the second application selecting for processing, based upon the second set of functional categories and the functional categories assigned to the two or more segments, a second set of one or more segments from the two or more segments, wherein the second set of one or more segments includes at least one segment that is not in the first set of one or more segments and the first set of one or more segments includes at least one segment that is not in the second set of one or more segments; the first application processing content contained in the first set of one or more segments and not processing content contained in the second set of one or more segments; the second application processing content contained in the second set of one or more segments and not processing content contained in the first set of one or more segments; and wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. One or more non-transitory computer-readable media storing instructions which, when processed by one or more processors, cause:
-
processing a web page to determine a plurality of segments, wherein each segment from the plurality of segments includes one or more HTML elements; each machine-based classifier of a plurality of machine-based classifiers generating, based at least upon metadata associated with two or more segments from the plurality of segments that indicates one or more presentation features in the HTML elements of the two or more segments from the plurality of segments, a probability output for each segment of the two or more segments from the plurality of segments, wherein each functional category from the plurality of functional categories corresponds to a functional role of HTML elements in the web page; wherein each machine-based classifier from the plurality of machine-based classifiers corresponds to a functional category from the plurality of functional categories; assigning, based on the plurality of probability output, one or more functional categories to each segment of the two or more segments; a first application selecting a first set of functional categories from the plurality of functional categories; a second application that is different than the first application selecting a second set of functional categories from the plurality of functional categories, wherein the second set of functional categories does not include functional categories from the first set of functional categories; the first application selecting for processing, based upon the first set of functional categories and the functional categories assigned to the two or more segments, a first set of one or more segments from the two or more segments; the second application selecting for processing, based upon the second set of functional categories and the functional categories assigned to the two or more segments, a second set of one or more segments from the two or more segments, wherein the second set of one or more segments includes at least one segment that is not in the first set of one or more segments and the first set of one or more segments includes at least one segment that is not in the second set of one or more segments; the first application processing content contained in the first set of one or more segments and not processing content contained in the second set of one or more segments; and the second application processing content contained in the second set of one or more segments and not processing content contained in the first set of one or more segments. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification