Query log mining for detecting spam hosts
First Claim
Patent Images
1. A method, comprising:
- generating by a network device one or more graphs using data obtained from a query log, the one or more graphs including an anticlick graph, wherein the anticlick graph represents information pertaining to documents in previously provided search results that, according to the data obtained from the query log, have not been clicked by a user that submitted a corresponding search query and does not represent information pertaining to documents in the previously provided search results that, according to the data obtained from the query log, have been clicked by the user that submitted the corresponding search query, wherein the anticlick graph includes one or more nodes representing or corresponding to documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query;
ascertaining by the network device values of one or more syntactic features of the one or more graphs;
determining by the network device values of one or more semantic features of the one or more graphs by propagating categories from a web directory among nodes in each of the one or more graphs; and
detecting by the network device spam hosts based upon the values of the syntactic features and the semantic features;
wherein the anti-click graph includes a host-based graph or a document-based graph, wherein the nodes of the host-based graph includes one or more host nodes representing hosts corresponding to the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query, and wherein the nodes of the document-based graph includes one or more document nodes representing the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query.
10 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods and apparatus for detecting spam hosts. In one embodiment, one or more graphs are generated using data obtained from a query log, where the one or more graphs include at least one of an anticlick graph or a view graph. Values of one or more syntactic features of the one or more graphs are ascertained. Values of one or more semantic features of the one or more graphs are determined by propagating categories from a web directory among nodes in each of the one or more graphs. Spam hosts are then detected based upon the values of the syntactic features and the semantic features.
-
Citations
20 Claims
-
1. A method, comprising:
-
generating by a network device one or more graphs using data obtained from a query log, the one or more graphs including an anticlick graph, wherein the anticlick graph represents information pertaining to documents in previously provided search results that, according to the data obtained from the query log, have not been clicked by a user that submitted a corresponding search query and does not represent information pertaining to documents in the previously provided search results that, according to the data obtained from the query log, have been clicked by the user that submitted the corresponding search query, wherein the anticlick graph includes one or more nodes representing or corresponding to documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query; ascertaining by the network device values of one or more syntactic features of the one or more graphs; determining by the network device values of one or more semantic features of the one or more graphs by propagating categories from a web directory among nodes in each of the one or more graphs; and detecting by the network device spam hosts based upon the values of the syntactic features and the semantic features; wherein the anti-click graph includes a host-based graph or a document-based graph, wherein the nodes of the host-based graph includes one or more host nodes representing hosts corresponding to the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query, and wherein the nodes of the document-based graph includes one or more document nodes representing the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium storing thereon computer-readable instructions, comprising:
-
instructions for generating one or more graphs using data obtained from a query log, the one or more graphs including an anticlick graph, wherein the anticlick graph represents information pertaining to search results that, according to the data obtained from the query log, have not been clicked by a user that submitted a corresponding search query and does not represent information pertaining to search results that, according to the data obtained from the query log, have been clicked by the user that submitted the corresponding search query, wherein the anticlick graph includes one or more edges, the one or more edges being between a query and the corresponding search results that, according to the data obtained from the query log, have not been clicked by the user that submitted the query; instructions for propagating categories from a web directory among nodes in each of the one or more graphs; instructions for determining values of one or more semantic features of the one or more graphs after propagating categories among the nodes; and instructions for detecting spam hosts based upon the values of the semantic features; wherein the anti-click graph includes a host-based graph or a document-based graph, wherein the nodes of the host-based graph includes one or more host nodes representing hosts corresponding to the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query, and wherein the nodes of the document-based graph includes one or more document nodes representing the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query.
-
-
16. An apparatus, comprising:
-
a processor; and a memory, at least one of the processor or the memory being adapted for; generating one or more graphs using data obtained from a query log, the one or more graphs including an anticlick graph, wherein the anticlick graph represents a search query and information pertaining to documents in corresponding search results that, according to the data obtained from the query log, have not been clicked by a user that submitted the search query, wherein the anticlick graph does not represent information pertaining to documents in the corresponding search results that, according to the data obtained from the query log, have been clicked by the user that submitted the search query, wherein the anticlick graph includes one or more edges, the one or more edges being between a query and the corresponding search results that, according to the data obtained from the query log, have not been clicked by the user that submitted the query; ascertaining values of one or more syntactic features of the one or more graphs; propagating categories from a web directory among nodes in each of the one or more graphs; determining values of one or more semantic features of the one or more graphs; and detecting spam hosts using the values of the syntactic features and the semantic features; wherein the anti-click graph includes a host-based graph or a document-based graph, wherein the nodes of the host-based graph includes one or more host nodes representing hosts corresponding to the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query, and wherein the nodes of the document-based graph includes one or more document nodes representing the documents that, according to the data obtained from the query log, have not been clicked by the user that submitted the corresponding search query. - View Dependent Claims (17, 18, 19, 20)
-
Specification