Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy

US 20070136256A1
Filed: 12/01/2006
Published: 06/14/2007
Est. Priority Date: 12/01/2005
Status: Active Grant

First Claim

Patent Images

1. A method for obtaining a representation of one or more sets of one or more words from a collection of one or more documents, comprising:

issuing the one or more sets of one or more words as a query to a search engine;

receiving at least one result from the search engines in response to the one or more search queries; and

selecting a number of closest results from the search engine, each result of the closest results representing a document selected from the collection of documents, wherein for each selected closest results documents some content is retrieved, the retrieved content comprising at least one of a document summary, document body, and document body with anchor text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of a data representation system for describing specific data sets, such as documents, web pages, or search engine queries, based on data tokens, such as words or n-grams, contained in a collection of documents are described. Such a system can be used in any type of information retrieval application, such as a document, web page, or online advertisement serving process, based on an information request, such as a query executed through an Internet search engine. For example, when a search is performed at a search engine, a content provider uses the system to represent the search query and compares the query representation against representations of a set of content in order to identify, retrieve and aggregate the content from the set most relevant to the search query, in the form of a web page or other data unit for display or access through the web browser.

Citations

35 Claims

1. A method for obtaining a representation of one or more sets of one or more words from a collection of one or more documents, comprising:
- issuing the one or more sets of one or more words as a query to a search engine;
  
  receiving at least one result from the search engines in response to the one or more search queries; and
  
  selecting a number of closest results from the search engine, each result of the closest results representing a document selected from the collection of documents, wherein for each selected closest results documents some content is retrieved, the retrieved content comprising at least one of a document summary, document body, and document body with anchor text.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1 further comprising parsing the retrieved content using one or more tokens, wherein the one or more tokens comprise one or more phrases, and wherein a phrase comprises ordered set of one or more words.
  - 3. The method of claim 2 wherein the one or more phrases comprises one of ordered sets of one or more words that appear in order and consecutively in the retrieved content, and ordered sets of one or more words that jointly appear in order and consecutively in the retrieved content and in a previously selected list of phrases, wherein the previously selected list of phrases comprises all query logs issued to the search engine.

4. The method of 3 wherein the retrieved content together with the one or more tokens present in the retrieved content is used to compute one or more vector representations of the one or more sets of one or more words, and the number of occurrences of the one or more tokens in the retrieved content is used to compute a token frequency, and wherein the number of occurrences of the one or more tokens in a previously selected collection of content is used to compute an inverse token frequency.
- View Dependent Claims (5)
- - 5. The method of claim 4 further comprises computing a vector representation of the one or more sets of one or more words using the token frequency and the inverse token frequency.

6. A method for obtaining a representation of one or more sets of one or more words at different levels of resolution from a base collection of documents, comprising:
- defining one or more hierarchical taxonomies of the base collection of documents, comprising;
  
  issuing the one or more sets of one or more words as one or more queries to one or more search engines;
  
  receiving results from the one or more search engines in response to the one or more queries, wherein the results comprise one or more ranked documents at the one or more search engines together with the assignment of the one or more ranked documents to the nodes of the one or more taxonomies at a selected resolution, wherein for each ranked document some content is retrieved, the content comprising at least one of the document title, document summary, document body, and document body together with anchor text; and
  
  aggregating the ranked documents to the selected nodes of the one or more taxonomies using the assignments of the ranked documents to the nodes of the one or more taxonomies at the selected resolution.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The method of claim 6 wherein the aggregation of the one or more selected ranked documents to the selected nodes of the one or more taxonomies comprises concatenating the content of all the selected ranked documents that are assigned to the same node of the one or more taxonomies.
  - 8. The method of claim 7 wherein the one or more taxonomy aggregated retrieved content is parsed using one or more tokens.
  - 9. The method of claim 8 wherein the one or more tokens comprise one or more phrases, each phrase of the one or more phrases being an ordered set of one or more words.
  - 10. The method of claim 9 wherein the one or more phrases comprises one of ordered sets of one or more words that appear in order and consecutively in the retrieved content, and ordered sets of one or more words that jointly appear in order and consecutively in the retrieved content and in a previously selected list of phrases, wherein the previously selected list of phrases comprises all query logs issued to the search engine.
  - 11. The method of claim 8 further comprising computing one or more vector representations of the one or more sets of one or more words using the one or more taxonomy aggregated retrieved content together with the one or more tokens present in the taxonomy aggregated retrieved content.
  - 12. The method of claim 11 wherein the number of occurrences of the one or more tokens in the one or more retrieved content is used to compute a token frequency, and the number of occurrences of the one or more tokens in a previously selected collection of content is used to compute an inverse token frequency.
  - 13. The method of claim 12 wherein the token frequency and the inverse token frequency are used to compute a vector representation of the one or more sets of one or more words at each taxonomy aggregated retrieved content.
  - 14. The method of claim 13 wherein the vector representations of the one or more sets of one or more words is selected from the group consisting of:
    - the union of the vectors representing the one or more sets of one or more words at the one or more taxonomy aggregated retrieved content;
      
      the concatenation of the vectors representing the one or more sets of one or more words at the one or more taxonomy aggregated retrieved content; and
      
      the average of the vectors representing the one or more sets of one or more words at the one or more taxonomy aggregated retrieved content.

15. A method for computing the similarity of one or more source sets of one or more words to another set of one or more target sets of one or more words using one or more documents and one or more search engine indexes, comprising:
- computing the vector representation of the one or more source sets of one or more words;
  
  computing the vector representation of the one or more target sets of one or more words is computed; and
  
  computing the similarity between the target and source vector representations using a vector similarity measure selected from the group consisting of cosine product, Hellinger distance, and Kullback-Leibler distance.

16. A method for computing the similarity of one or more source sets of one or more words to another set of one or more target sets of one or more words using one or more documents, one or more taxonomies and one or more search engine indexes, comprising computing the vector representation of the one or more source sets of one or more words;
- computing the vector representation of the one or more target sets of one or more words; and
  
  computing the similarity between the target and source vector representations using a vector similarity measure selected from the group consisting of cosine product, Hellinger distance, and Kullback-Leibler distance.

17. A method of categorizing one or more source sets of one or more words into one or more target categories using one or more documents and at least one search engine index, comprising:
- identifying each target category with one or more representative sets of one or more words for the category; and
  
  computing the similarity between the source sets of one or more words and the union of the one or more representative sets of one or more words for the one or more target categories.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17 wherein the one or more representative sets of one or more words for the category are selected from the group consisting of manually identified representative sets of one or more words for the category, and automatically identified sets of one or more words for the category using one or more of entropy, mutual information.
  - 19. The method of claim 17 wherein the one or more source sets of one or more words are assigned to the target category with the highest similarity to the source sets of one or more words.
  - 20. The method of claim 17 wherein the one or more source sets of one or more words are probabilistically assigned to the target categories with the probabilities computed as a function of the similarities between the source sets of one or more words and the target categories.

21. A method of categorizing one or more source sets of one or more words into one or more target categories using one or more documents, one or more search engine indexes and one or more hierarchal taxonomies, comprising:
- identifying each target category with one or more representative sets of one or more words for the category;
  
  computing the similarity between the source sets of the one or more words and the union of the one or more representative sets of one or more words for the one or more target categories.
- View Dependent Claims (22, 23)
- - 22. The method of claim 21 wherein the one or more source sets of one or more words are assigned to the target category with the highest similarity to the source sets of one or more words.
  - 23. The method of claim 21 wherein the one or more source sets of one or more words are probabilistically assigned to the target categories with the probabilities computed as a function of the similarities between the source sets of one or more words and the target categories.

24. A method of computing the similarity of one or more search engine queries to one or more advertising messages using one or more documents and at least one search engine index, comprising:
- associating the one or more advertising messages to a set of representative sets of one or more words; and
  
  treating the one or more search engine queries as a set of one or more words;
  
  defining the similarity between the one of more advertising messages and the one of more search engine queries to be the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries.

25. A method for identifying from a set of one or more advertising messages, a particular advertising message to display in response to one or more search engine queries using a collection of one or more documents and one or more search engine indexes, comprising:
- computing the similarity between the one of more advertising messages and the one of more search engine queries;
  
  ranking the advertising messages by their similarity to the search engine queries;
  
  normalizing the vector of similarities between the advertising messages and the search engine queries to a probability distribution.
- View Dependent Claims (26, 27)
- - 26. The method of claim 25, wherein the one or more advertising messages are displayed in proportion to their normalized similarities to the search engine queries.
  - 27. The method of claim 26, wherein the advertising messages that have the highest similarity to the one or more search engine queries are displayed.

28. A method of computing the similarity of one or more search engine queries to one or more “
- advertising messages”
  
  using one or more documents, one or more taxonomies, and one or more search engine indexes, comprising;
  
  associating the one or more advertising messages to a set of representative sets of one or more words;
  
  treating each of the one or more search engine queries as a set of one or more words;
  
  defining the similarity between the one of more advertising messages and the one of more search engine queries to be the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries; and
  
  computing the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries.

29. A method of identifying from a set of one or more advertising messages, a particular advertising message to display in response to one or more search engine queries using a collection of one or more documents, one or more taxonomies and one or more search engine indexes, comprising:
- computing the similarity between the one of more advertising messages and the one of more search engine queries;
  
  ranking the advertising messages by their similarity to the search engine queries; and
  
  normalizing the vector of similarities between the advertising messages and the search engine queries to a probability distribution.
- View Dependent Claims (30, 31)
- - 30. The method of claim 29 wherein the one or more advertising messages are displayed in proportion to their normalized similarities to the search engine queries.
  - 31. The method of claim 29 wherein the advertising messages that achieve the highest similarity to the one or more search engine queries are displayed.

32. A method for describing one or more sets of one or more words along one or more dimensions, comprising:
- identifying one or more dimensions for describing the one or more sets of one or more words;
  
  identifying one or more categorical values for each dimension; and
  
  identifying one or more representative sets of one or more words for each categorical value identified for each dimension, wherein a union of the one or more sets of one or more words is used to identify a list of terms associated with the categorical value.
- View Dependent Claims (33, 34)
- - 33. The method of claim 32 further comprising computing a similarity measure between the one or more sets of one or more words and the one or more list of terms associated with the one or more dimensions.
  - 34. The method of claim 33 wherein the one or more similarities between the one or more sets of one or more words and the one or more list of terms associated with the one or more categorical values is used to represent the one or more sets of one or more words as a vector of similarities.

35. A method of computing one or more predictions of the performance of one or more target sets of one or more words using observed performance data for one or more source sets of one or more words, and one or more dimensions, comprising:
- computing one or more representations of the one or more source sets of one or more words, wherein the computed representations of the one or more source sets of one or more words together with the observed performance data of the one or more source sets of one or more words are used to train one or more predictive models; and
  
  computing the one or more representations of the one or more target sets of one or more words, wherein the one or more representations of the one or more target sets of one or more words and the one or more predictive models are used to compute one or more predictions of the performance for the one or more target sets of one or more words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Walmart Apollo, LLC (WalMart Inc.)
Original Assignee
Adchemy Incorporated (WalMart Inc.)
Inventors
Kapur, Shyam, Farahat, Ayman, Chatwin, Richard

Granted Patent

US 7,580,926 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

G06F 16/367   Ontology

G06F 16/951   Indexing; Web crawling tech...

G06Q 30/02   Marketing; Price estimation...

Y10S 707/99933   Query processing, i.e. sear...

Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links