Method and system for generating web pages for topics unassociated with a dominant URL

US 8,799,260 B2
Filed: 12/17/2010
Issued: 08/05/2014
Est. Priority Date: 12/17/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying a query topic unassociated with a dominant URL (uniform resource locator), the method comprising:

receiving an identification of a set of keywords associated with a query topic;

scanning a search log to identify search queries associated with the set of keywords;

grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;

merging the clusters to generate an extended seed query string;

determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and

if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, generating a web page associated with the query topic, includingretrieving documents associated with queries of the extended seed query string;

generating a compilation document that includes the retrieved documents;

calculating term frequency-inverse document frequency (TF-IDF) scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and

outputting the n-grams (w) having TF-IDF scores above a predetermined threshold;

wherein the TF-IDF score for the n-grams (w) is calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and IDF represents the total number of documents (by URL) divided by the number of documents that contain the n-gram.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided for identifying topics that are unassociated with a dominant URL. A set of keywords associated with a topic is identified. A search log is scanned to identify search queries associated with the set of keywords. The identified search queries are grouped into clusters. Clusters associated with similar URLs are merged to generate an extended seed query string. The extended seed query string is analyzed to determine whether it relates to an existing dominant URL. If the extended seed query string is determined to be unassociated with an existing dominant URL, a web page associated with the topic may be generated.

Citations

17 Claims

1. A method for identifying a query topic unassociated with a dominant URL (uniform resource locator), the method comprising:
- receiving an identification of a set of keywords associated with a query topic;
  
  scanning a search log to identify search queries associated with the set of keywords;
  
  grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;
  
  merging the clusters to generate an extended seed query string;
  
  determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and
  
  if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, generating a web page associated with the query topic, includingretrieving documents associated with queries of the extended seed query string;
  
  generating a compilation document that includes the retrieved documents;
  
  calculating term frequency-inverse document frequency (TF-IDF) scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and
  
  outputting the n-grams (w) having TF-IDF scores above a predetermined threshold;
  
  wherein the TF-IDF score for the n-grams (w) is calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and IDF represents the total number of documents (by URL) divided by the number of documents that contain the n-gram.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - generating a title and content for the identified query topic;
      
      andcreating a web page for the identified query topic using the generated title and content.
  - 3. The method of claim 1 wherein recervmg an identification of a set of keywords associated with a query topic includes receiving input from a user identifying keywords of the set of keywords.
  - 4. The method of claim 1 wherein recervmg an identification of a set of keywords associated with a query topic includes:
    - scanning a bid query log that relates queries with bids associated with sponsorship of the quenes;
      
      clustering queries of the bid query log; and
      
      selecting a set of keywords that includes at least one of the clustered queries.
  - 5. The method of claim 1, wherein determining whether the extended seed query string is associated with a dominant URL includes:
    - calculating a first score associated with a user interest in the extended seed query string;
      
      calculating a second score associated with an advertiser interest in the extended seed query string;
      
      calculating a third score associated with a dominance of one or more URLs associated with the extended seed query string; and
      
      if one or more of the first score, second score, and third score satisfy a predetermined criteria, determining that the extended seed query string is unassociated with a dominant URL.
  - 6. The method of claim 1, wherein determining whether the extended seed query string is associated with a dominant URL includes:
    - calculating a score associated with a dominance of one or more URLs associated with the extended seed query string; and
      
      if the calculated score satisfies a predetermined threshold, determining that the extended seed query string is unassociated with a dominant URL.
  - 7. The method of claim 1, wherein grouping the identified search queries into clusters includes generating an impression graph that associates search queries to URLs.

8. A method for creating a web page associated with a topic, the method comprising:
- receiving information associated with a query topic determined to be unassociated with a dominant URL (uniform resource locator);
  
  generating content for a web page related to the query topic;
  
  and creating a web page that includes the generated content,receiving an identification of a set of keywords associated with the query topic;
  
  scanning a log of search queries to identify search queries associated with the set of keywords;
  
  grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;
  
  merging the clusters to generate an extended seed query string;
  
  determining whether the extended seed query string is associated with a dominant URL by calculating a URL dominance score for queries of the extended seed query; and
  
  if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, creating the web page;
  
  wherein generating content for a web page related to the query topic includes;
  
  retrieving documents associated with queries of the extended seed query string;
  
  generating a compilation document that includes the retrieved documents;
  
  calculating TF-IDF scores for n-grams (w) of phrase consisting of n consecutive words within the compilation document; and
  
  outputting the n-grams (w) having TF-IDF scores above a predetermined threshold;
  
  wherein the TF-IDF score for the n-gram (w) calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and represents the total number of documents (by URL) divided by the number of documents that contain the n-gram.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8, wherein creating the web page comprises:
    - generating a title for the web page based on the generated content.
  - 10. The method of claim 8, wherein creating the web page comprises:
    - generating body text for the web page based on the generated content.
  - 11. The method of claim 8, wherein receiving information associated with a query topic determined to be unassociated with a dominant URL includes:
    - receiving information identifying a score associated with a user interest in the query topic;
      
      receiving information identifying a score associated with an advertiser interest in the query topic; and
      
      receiving information identifying a score associated with a dominance of one or more URLs associated with the query topic.

12. A system for outputting content to be displayed by a web page, the system comprising:
- one or more processors operable with computer program code to implement;
  
  a topic identification module that is configured to identify one or more topics unassociated with a dominant URL (uniform resource locator); and
  
  a content generation module that is configured to generate content for a web page associated with the identified one or more topics unassociated with the dominant URL and transmit the content for the webpage;
  
  receiving an identification of a set of keywords associated with the query topic;
  
  scanning a log of search queries to identify search queries associated with the set of keywords;
  
  grouping the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;
  
  merging the clusters to generate an extended seed query string;
  
  determining whether the extended seed query string is associated with a dominant URLby calculating a URL dominance score for queries of the extended seed query; and
  
  if the calculated URL dominance score is within a threshold range, the extended seed query is unassociated with a dominant URL, creating the web page;
  
  wherein generating content for a web page related to the query topic includes;
  
  retrieving documents associated with queries of the extended seed querystring;
  
  generating a compilation document that includes the retrieveddocuments;
  
  calculating TF-IDF scores for n-grams (w) of phrase consisting of nconsecutive words within the compilation document; and
  
  outputting the n-grams (w) having TF-IDF scores above a predeterminedthreshold;
  
  wherein the TF-IDF score for the n-grams (w) calculated by multiplying term frequency (TF) and inverse document frequency (IDF), where TF represents the number of times the n-gram (w) is contained within the document d, and represents the total number of documents (by URL) divided by the number of documents that contain the n-gram.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The system of claim 12, wherein the topic identification module includes:
    - a keyword receiver that is configured to receive a set of keywords associated with a query topic;
      
      a query identifier that is configured to scan a log of search queries to identify search queries associated with the set of keywords;
      
      a query grouper that is configured to group the identified search queries into clusters, wherein each of the clusters is associated with at least one URL returned by a search engine when performing a search for information using one or more of the identified search queries;
      
      a cluster merger that is configured to merge clusters associated with similar URLs to generate an extended seed query string; and
      
      a URL dominance determiner that is configured to determine whether the extended seed query string is associated with a dominant URL.
  - 14. The system of claim 12, wherein the content generation module includes:
    - a document retriever that is configured to retrieve one or more documents associated with the identified one or more topics;
      
      a content generator that is configured to generate content for a web page related to the identified one or more topics based on the retrieved one or more documents; and
      
      a page creator that is configured to create a web page that includes the generated content.
  - 15. The system of claim 14, wherein the content generator IS configured to generate content from which a title of the web page is selected by page creator.
  - 16. The system of claim 14, wherein the content generator is configured to generate content from which body text of the web page is selected by page creator.
  - 17. The system of claim 13, wherein the URL dominance determiner includes:
    - a user interest score calculator that is configured to calculate a score associated with a user interest in the identified one or more topics;
      
      an advertiser interest score calculator that is configured to calculate a score associated with advertiser interest in the identified one or more topics; and
      
      a URL dominance score calculator that is configured to calculate a score associated with dominance of a single URL for the identified one or more topics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Papadimitriou, Panagiotis, Krishnamurthy, Prabhakar, Schmidt, Frederick Kenneth
Primary Examiner(s)
Uddin, Md. I

Application Number

US12/971,608
Publication Number

US 20120158693A1
Time in Patent Office

1,327 Days
Field of Search

707/706, 707/708, 707/736, 707/750, 707/776, 707/999.002
US Class Current

707/708
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 16/9566 URL specific, e.g. using al...

Method and system for generating web pages for topics unassociated with a dominant URL

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for generating web pages for topics unassociated with a dominant URL

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links