Method and apparatus for representing text using search engine, document collection, and hierarchal taxonomy
First Claim
1. A method for obtaining a representation of one or more sets of one or more words from a collection of one or more documents, comprising:
- issuing the one or more sets of one or more words as a query to a search engine;
receiving at least one result from the search engines in response to the one or more search queries; and
selecting a number of closest results from the search engine, each result of the closest results representing a document selected from the collection of documents, wherein for each selected closest results documents some content is retrieved, the retrieved content comprising at least one of a document summary, document body, and document body with anchor text.
3 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of a data representation system for describing specific data sets, such as documents, web pages, or search engine queries, based on data tokens, such as words or n-grams, contained in a collection of documents are described. Such a system can be used in any type of information retrieval application, such as a document, web page, or online advertisement serving process, based on an information request, such as a query executed through an Internet search engine. For example, when a search is performed at a search engine, a content provider uses the system to represent the search query and compares the query representation against representations of a set of content in order to identify, retrieve and aggregate the content from the set most relevant to the search query, in the form of a web page or other data unit for display or access through the web browser.
-
Citations
35 Claims
-
1. A method for obtaining a representation of one or more sets of one or more words from a collection of one or more documents, comprising:
-
issuing the one or more sets of one or more words as a query to a search engine;
receiving at least one result from the search engines in response to the one or more search queries; and
selecting a number of closest results from the search engine, each result of the closest results representing a document selected from the collection of documents, wherein for each selected closest results documents some content is retrieved, the retrieved content comprising at least one of a document summary, document body, and document body with anchor text. - View Dependent Claims (2, 3)
-
- 4. The method of 3 wherein the retrieved content together with the one or more tokens present in the retrieved content is used to compute one or more vector representations of the one or more sets of one or more words, and the number of occurrences of the one or more tokens in the retrieved content is used to compute a token frequency, and wherein the number of occurrences of the one or more tokens in a previously selected collection of content is used to compute an inverse token frequency.
-
6. A method for obtaining a representation of one or more sets of one or more words at different levels of resolution from a base collection of documents, comprising:
-
defining one or more hierarchical taxonomies of the base collection of documents, comprising;
issuing the one or more sets of one or more words as one or more queries to one or more search engines;
receiving results from the one or more search engines in response to the one or more queries, wherein the results comprise one or more ranked documents at the one or more search engines together with the assignment of the one or more ranked documents to the nodes of the one or more taxonomies at a selected resolution, wherein for each ranked document some content is retrieved, the content comprising at least one of the document title, document summary, document body, and document body together with anchor text; and
aggregating the ranked documents to the selected nodes of the one or more taxonomies using the assignments of the ranked documents to the nodes of the one or more taxonomies at the selected resolution. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for computing the similarity of one or more source sets of one or more words to another set of one or more target sets of one or more words using one or more documents and one or more search engine indexes, comprising:
-
computing the vector representation of the one or more source sets of one or more words;
computing the vector representation of the one or more target sets of one or more words is computed; and
computing the similarity between the target and source vector representations using a vector similarity measure selected from the group consisting of cosine product, Hellinger distance, and Kullback-Leibler distance.
-
-
16. A method for computing the similarity of one or more source sets of one or more words to another set of one or more target sets of one or more words using one or more documents, one or more taxonomies and one or more search engine indexes, comprising
computing the vector representation of the one or more source sets of one or more words; -
computing the vector representation of the one or more target sets of one or more words; and
computing the similarity between the target and source vector representations using a vector similarity measure selected from the group consisting of cosine product, Hellinger distance, and Kullback-Leibler distance.
-
-
17. A method of categorizing one or more source sets of one or more words into one or more target categories using one or more documents and at least one search engine index, comprising:
-
identifying each target category with one or more representative sets of one or more words for the category; and
computing the similarity between the source sets of one or more words and the union of the one or more representative sets of one or more words for the one or more target categories. - View Dependent Claims (18, 19, 20)
-
-
21. A method of categorizing one or more source sets of one or more words into one or more target categories using one or more documents, one or more search engine indexes and one or more hierarchal taxonomies, comprising:
-
identifying each target category with one or more representative sets of one or more words for the category;
computing the similarity between the source sets of the one or more words and the union of the one or more representative sets of one or more words for the one or more target categories. - View Dependent Claims (22, 23)
-
-
24. A method of computing the similarity of one or more search engine queries to one or more advertising messages using one or more documents and at least one search engine index, comprising:
-
associating the one or more advertising messages to a set of representative sets of one or more words; and
treating the one or more search engine queries as a set of one or more words;
defining the similarity between the one of more advertising messages and the one of more search engine queries to be the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries.
-
-
25. A method for identifying from a set of one or more advertising messages, a particular advertising message to display in response to one or more search engine queries using a collection of one or more documents and one or more search engine indexes, comprising:
-
computing the similarity between the one of more advertising messages and the one of more search engine queries;
ranking the advertising messages by their similarity to the search engine queries;
normalizing the vector of similarities between the advertising messages and the search engine queries to a probability distribution. - View Dependent Claims (26, 27)
-
-
28. A method of computing the similarity of one or more search engine queries to one or more “
- advertising messages”
using one or more documents, one or more taxonomies, and one or more search engine indexes, comprising;
associating the one or more advertising messages to a set of representative sets of one or more words;
treating each of the one or more search engine queries as a set of one or more words;
defining the similarity between the one of more advertising messages and the one of more search engine queries to be the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries; and
computing the similarity between the set of representative sets of one or more words associated with the one of more advertising messages and the one of more search engine queries.
- advertising messages”
-
29. A method of identifying from a set of one or more advertising messages, a particular advertising message to display in response to one or more search engine queries using a collection of one or more documents, one or more taxonomies and one or more search engine indexes, comprising:
-
computing the similarity between the one of more advertising messages and the one of more search engine queries;
ranking the advertising messages by their similarity to the search engine queries; and
normalizing the vector of similarities between the advertising messages and the search engine queries to a probability distribution. - View Dependent Claims (30, 31)
-
-
32. A method for describing one or more sets of one or more words along one or more dimensions, comprising:
-
identifying one or more dimensions for describing the one or more sets of one or more words;
identifying one or more categorical values for each dimension; and
identifying one or more representative sets of one or more words for each categorical value identified for each dimension, wherein a union of the one or more sets of one or more words is used to identify a list of terms associated with the categorical value. - View Dependent Claims (33, 34)
-
-
35. A method of computing one or more predictions of the performance of one or more target sets of one or more words using observed performance data for one or more source sets of one or more words, and one or more dimensions, comprising:
-
computing one or more representations of the one or more source sets of one or more words, wherein the computed representations of the one or more source sets of one or more words together with the observed performance data of the one or more source sets of one or more words are used to train one or more predictive models; and
computing the one or more representations of the one or more target sets of one or more words, wherein the one or more representations of the one or more target sets of one or more words and the one or more predictive models are used to compute one or more predictions of the performance for the one or more target sets of one or more words.
-
Specification