System and method for determining web page quality using collective inference based on local and global information
First Claim
1. A computer system for classifying a web page, comprising:
- one or more processors to execute instructions;
a classification engine for determining a quality of the web page using local features of a seed set of web pages and global web graph information about the seed set of web pages, wherein;
each web page of the seed set of web pages is a web page of a known quality, the local features of the seed set of web pages comprises text, clicking, domain, or time stamp information concerning the seed set of web pages, andthe global web graph information about the seed set of web pages comprises hyperlink or co-citation relationships among the seed set of web pages;
a binary classifier coupled to the classification engine for performing binary classification to provide a binary score for the web page; and
a collective inference engine coupled to the binary classifier for performing collective inference by applying collective inference for binary classification using the local features of the seed set of web pages and the global web graph information about the seed set of web pages, comprising finding a minimum value of a regularized convex dual of a logistic regression loss function for a node of a graph.
9 Assignments
0 Petitions
Accused Products
Abstract
An improved system and method is provided for determining web page quality using collective inference based on local and global web page information. A classification engine may be provided for classifying a web page using local features of a seed set of web pages and global web graph information about the seed set of web pages. A dual algorithm based on graph regularization formulated as a well-formed optimization solution may be used in an embodiment for applying collective inference for binary classification of the web page using the local web page information and global web graph information of a web page, the local web page information and global web graph information of an authoritative set of web pages, and the local web page information and global web graph information of a non-authoritative set of web pages.
47 Citations
24 Claims
-
1. A computer system for classifying a web page, comprising:
one or more processors to execute instructions; a classification engine for determining a quality of the web page using local features of a seed set of web pages and global web graph information about the seed set of web pages, wherein; each web page of the seed set of web pages is a web page of a known quality, the local features of the seed set of web pages comprises text, clicking, domain, or time stamp information concerning the seed set of web pages, and the global web graph information about the seed set of web pages comprises hyperlink or co-citation relationships among the seed set of web pages; a binary classifier coupled to the classification engine for performing binary classification to provide a binary score for the web page; and a collective inference engine coupled to the binary classifier for performing collective inference by applying collective inference for binary classification using the local features of the seed set of web pages and the global web graph information about the seed set of web pages, comprising finding a minimum value of a regularized convex dual of a logistic regression loss function for a node of a graph. - View Dependent Claims (2, 3, 4)
-
5. A computer-implemented method for classifying a web page, comprising:
-
accessing, by one or more computing devices, local web page information of and global web graph information about a plurality of authoritative web pages, local web page information of and global web graph information about a plurality of non-authoritative web pages, and local web page information of and global web graph information about the web page, wherein; each authoritative web page of the plurality of authoritative web pages is a web page of known high quality, each non-authoritative web page of the plurality of non-authoritative web pages is a web page of known low quality, the local web page information of the web pages comprises text, clicking, domain, or time stamp information concerning the web pages, and the global web graph information about the web pages comprises hyperlink or co-citation relationships among the web pages; determining, by the one or more computing devices, a quality of the web page using collective inference by applying collective inference for binary classification of the web page using the local web page information of the web page and the global web graph information about the web page, the local web page information of the plurality of authoritative web pages and the global web graph information about the plurality of authoritative web pages, and the local web page information of the plurality of non-authoritative web pages and the global web graph information about the plurality of non-authoritative web pages, comprising finding a minimum value of a regularized convex dual of a logistic regression loss function for a node of a graph; and outputting, by the one or more computing devices, an indication of the quality of the web page. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method for classifying a web page, comprising:
-
accessing, by one or more computing devices, local web page information of and global web graph information about a plurality of authoritative web pages, local web page information of and global web graph information about a plurality of non-authoritative web pages, and local web page information of and global web graph information about the web page, wherein; each authoritative web page of the plurality of authoritative web pages is a web page of known high quality, each non-authoritative web page of the plurality of non-authoritative web pages is a web page of known low quality, the local web page information of the web pages comprises text, clicking, domain, or time stamp information concerning the web pages, and the global web graph information about the web pages comprises hyperlink or co-citation relationships among the web pages; determining, by the one or more computing devices, a quality of the web page using collective inference by applying collective inference for binary classification of the web page using the local web page information of the web page and the global web graph information about the web page, the local web page information of the plurality of authoritative web pages and global web graph information about the plurality of authoritative web pages, and the local web page information of the plurality of non-authoritative web pages and global web graph information about the plurality of non-authoritative web pages, comprising finding a minimum value of a regularization function associated with an edge of a graph; and outputting, by the one or more computing devices, an indication of the quality of the web page. - View Dependent Claims (14)
-
-
15. A computer system for classifying a web page, comprising:
- means for accessing
local web page information of and global web graph information about a plurality of first web pages, local web page information of and global web graph information about the web page, wherein; each first web page of the plurality of first web pages is a web page of known first quality, the local web page information of the web pages comprises text, clicking, domain, or time stamp information concerning the web pages, and the global web graph information about the web pages comprises hyperlink or co-citation relationships among the web pages; means for determining a quality of the web page using collective inference by applying collective inference for binary classification of the web page using the local web page information of the web page and the global web graph information about the web page and the local web page information of the plurality of first web pages and the global web graph information about the plurality of first web pages, comprising; means for finding a minimum value of a regularized convex dual of a logistic regression loss function for a node of a graph; and means for outputting an indication of the classification of the web page. - View Dependent Claims (16, 17)
- means for accessing
-
18. A method comprising:
-
accessing, by one or more computing devices, local web page information of and global web graph information about a plurality of first web pages, local web page information of and global web graph information about a plurality of second web pages, and local web page information of and global web graph information about a third web page, wherein; each first web page of the plurality of first web pages is of a known first quality; each second web page of the plurality of second web pages is of a known second quality; the local web page information of the web pages comprises text contained in the web pages, and the global web graph information about the web pages comprises hyperlink relationships among the web pages; determining, by the one or more computing devices, a quality of the third web page using collective inference by applying collective inference for binary classification of the third web page using the local web page information of and the global web graph information about the plurality of first web pages, the local web page information of and the global web graph information about the plurality of second web pages, and the local web page information of and the global web graph information about the third web page, wherein; the collective inference is used to infer a quality score for the third web page; and applying collective inference for binary classification of the third web page comprises; computing at least one linear weight using a regularized linear prediction model based on the at least one dimensional vector of words; determining a graph regularization condition for the graph; deriving a regularization parameter from the graph regularization condition; and estimating the quality of the third web page based on the at least one linear weight and the regularization parameter; and outputting, by the one or more computing devices, an indication of the quality of the third web page. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
Specification