Method and system for generating web pages for topics unassociated with a dominant URL
First Claim
1. A method for identifying a query topic unassociated with a dominant URL (uniform resource locator), the method comprising:
- receiving an identification of a set of keywords associated with a query topic;
scanning a search log to identify search queries associated with the set of keywords;
grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;
merging the clusters to generate an extended seed query string;
determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and
if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, generating a web page associated with the query topic, includingretrieving documents associated with queries of the extended seed query string;
generating a compilation document that includes the retrieved documents;
calculating term frequency-inverse document frequency (TF-IDF) scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and
outputting the n-grams (w) having TF-IDF scores above a predetermined threshold;
wherein the TF-IDF score for the n-grams (w) is calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and IDF represents the total number of documents (by URL) divided by the number of documents that contain the n-gram.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques are provided for identifying topics that are unassociated with a dominant URL. A set of keywords associated with a topic is identified. A search log is scanned to identify search queries associated with the set of keywords. The identified search queries are grouped into clusters. Clusters associated with similar URLs are merged to generate an extended seed query string. The extended seed query string is analyzed to determine whether it relates to an existing dominant URL. If the extended seed query string is determined to be unassociated with an existing dominant URL, a web page associated with the topic may be generated.
-
Citations
17 Claims
-
1. A method for identifying a query topic unassociated with a dominant URL (uniform resource locator), the method comprising:
-
receiving an identification of a set of keywords associated with a query topic; scanning a search log to identify search queries associated with the set of keywords; grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries; merging the clusters to generate an extended seed query string; determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, generating a web page associated with the query topic, including retrieving documents associated with queries of the extended seed query string; generating a compilation document that includes the retrieved documents; calculating term frequency-inverse document frequency (TF-IDF) scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and outputting the n-grams (w) having TF-IDF scores above a predetermined threshold; wherein the TF-IDF score for the n-grams (w) is calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and IDF represents the total number of documents (by URL) divided by the number of documents that contain the n-gram. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for creating a web page associated with a topic, the method comprising:
-
receiving information associated with a query topic determined to be unassociated with a dominant URL (uniform resource locator); generating content for a web page related to the query topic; and creating a web page that includes the generated content, receiving an identification of a set of keywords associated with the query topic; scanning a log of search queries to identify search queries associated with the set of keywords; grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries; merging the clusters to generate an extended seed query string; determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, creating the web page; wherein generating content for a web page related to the query topic includes; retrieving documents associated with queries of the extended seed query string; generating a compilation document that includes the retrieved documents; calculating TF-IDF scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and outputting the n-grams (w) having TF-IDF scores above a predetermined threshold; wherein the TF-IDF score for the n-gram (w) calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and represents the total number of documents (by URL) divided by the number of documents that contain the n-gram. - View Dependent Claims (9, 10, 11)
-
-
12. A system for outputting content to be displayed by a web page, the system comprising:
-
one or more processors operable with computer program code to implement; a topic identification module that is configured to identify one or more topics unassociated with a dominant URL (uniform resource locator); and a content generation module that is configured to generate content for a web page associated with the identified one or more topics unassociated with the dominant URL and transmit the content for the webpage; receiving an identification of a set of keywords associated with the query topic; scanning a log of search queries to identify search queries associated with the set of keywords; grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries; merging the clusters to generate an extended seed query string; determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, creating the web page; wherein generating content for a web page related to the query topic includes; retrieving documents associated with queries of the extended seed query string;
generating a compilation document that includes the retrieveddocuments; calculating TF-IDF scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and outputting the n-grams (w) having TF-IDF scores above a predetermined threshold; wherein the TF-IDF score for the n-grams (w) calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and represents the total number of documents (by URL) divided by the number of documents that contain the n-gram. - View Dependent Claims (13, 14, 15, 16, 17)
-
Specification