Automatic classification of segmented portions of web pages
First Claim
1. A method comprising:
- with one or more special purpose computing devices coupled to a memory;
accessing a plurality of segmented portions of at least one of a plurality of displayable web pages represented by one or more digital signals of one or more files stored in a memory, wherein a particular displayable web page of the plurality of displayable web pages comprises at least two of the plurality of segmented portions;
using one or more machine learned models for;
identifying one or more feature properties of the plurality of segmented portions within the one or more files, or otherwise inferable from the one or more files,classifying the at least two of the plurality of segmented portions as being at least one of a plurality of segment types based, at least in part, on the one or more identified feature properties, the one or more identified feature properties comprising at least language feature properties of a language model of content to be displayed in one or more of the at least two of the plurality of segmented portions, anddetermining content quality scores for at least two of the plurality of segmented portions of at least the particular displayable web page; and
storing one or more digital signals in the memory as part of an index for the plurality of segmented portions, the index being based, at least in part, on the segment type, the index indicating the content quality scores.
9 Assignments
0 Petitions
Accused Products
Abstract
Exemplary methods and apparatuses are provided which may be used for classifying and indexing segmented portions of web pages and providing related information for use in information extraction and/or information retrieval systems. In an embodiment, an index of segmented portions may be used by a search engine to respond to a search query. In an embodiment, one or more machine learned models may be used to identify one or more feature properties of a plurality of segmented portions within one or more files, or otherwise inferable from the one or more files. In an embodiment, one or more machine learned models may be used to classify one or more of a plurality of segmented portions as being at least one of a plurality of segment types.
-
Citations
20 Claims
-
1. A method comprising:
-
with one or more special purpose computing devices coupled to a memory; accessing a plurality of segmented portions of at least one of a plurality of displayable web pages represented by one or more digital signals of one or more files stored in a memory, wherein a particular displayable web page of the plurality of displayable web pages comprises at least two of the plurality of segmented portions; using one or more machine learned models for; identifying one or more feature properties of the plurality of segmented portions within the one or more files, or otherwise inferable from the one or more files, classifying the at least two of the plurality of segmented portions as being at least one of a plurality of segment types based, at least in part, on the one or more identified feature properties, the one or more identified feature properties comprising at least language feature properties of a language model of content to be displayed in one or more of the at least two of the plurality of segmented portions, and determining content quality scores for at least two of the plurality of segmented portions of at least the particular displayable web page; and storing one or more digital signals in the memory as part of an index for the plurality of segmented portions, the index being based, at least in part, on the segment type, the index indicating the content quality scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus comprising:
-
a memory having stored therein one or more digital signals to represent at least one file for a particular displayable web page to comprise at least two of a plurality of segmented portions; at least one processing unit coupled to the memory and programmed with instructions to; access the plurality of segmented portions of the at least one displayable web page, and use one or more machine learned models to; identify one or more feature properties of the plurality of segmented portions within the one or more files, or otherwise to be inferable from the one or more files, classify the at least two of the plurality of segmented portions as at least one of a plurality of segment types to be based, at least in part, on the one or more feature properties to be identified, the one or more feature properties to be identified are to comprise at least language feature properties of language model of content to be displayed in one or more of the at least two of the plurality of segmented portions, and determine content quality scores for at least two of the plurality of segmented portions of at least the particular displayable web page; and establish an index in the memory, the index to be established for the plurality of segmented portions and to be based, at least in part, on the segment type, the index to indicate the content quality scores. - View Dependent Claims (12, 13, 14, 15)
-
-
16. An article comprising:
a non-transitory computer readable medium having computer implementable instructions stored thereon to be implemented by one or more processing units in a computing device to transform the computing device into a special purpose device to; access a plurality of segmented portions of at least one of a plurality of web pages to be displayable by one or more digital signals of one or more files stored in a memory, wherein a particular web page of the plurality of web pages is to comprise at least two of the plurality of segmented portions; use one or more machine learned models to; identify one or more feature properties of the plurality of segmented portions within the one or more files, or otherwise to be inferable from the one or more files, classify the at least two of the plurality of segmented portions as at least one of a plurality of segment types to be based, at least in part, on the one or more identified feature properties, the one or more feature properties to be identified are to comprise at least language feature properties of language model of content to be displayed in one or more of the at least two of the plurality of segmented portions, and determine content quality scores for at least two of the plurality of segmented portions of at least the particular web page; and establish one or more digital signals to represent an index within a memory to be coupled to the one or more processing units, the index to be established for the plurality of segmented portions and to be based, at least in part, on the segment type, the index to indicate the content quality scores. - View Dependent Claims (17, 18, 19, 20)
Specification