Retrieval of Documents Using Language Models
First Claim
Patent Images
1. A method of modeling documents comprising:
- receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, includingidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphscalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language model.
11 Assignments
0 Petitions
Accused Products
Abstract
Methods of retrieving documents using a language model are disclosed. A method may include preparing a language model of a plurality of documents, receiving a query, processing the query using the language model, and using the processed query to retrieve documents responding to the query via the search engine. The methods may be implemented in software and/or hardware on computing devices, including personal computers, telephones, servers, and others.
129 Citations
25 Claims
-
1. A method of modeling documents comprising:
receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, including identifying paragraph boundaries in the tokenized text identifying word pairs in the paragraphs calculating the frequency of the word pairs in the paragraphs adding the word pairs and corresponding frequency information to the language model. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A method of processing a query for documents comprising:
-
receiving an original query for documents tokenizing the original query into a tokenized query extracting a group of associated terms from a language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized query appending a top group of associated terms to the original query to form an extended query computing a boost weight for each term in the extended query based on a probability of the terms in the extended query being related to the terms in the original query to create a weighted query submitting the weighted query to a search engine. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A method of retrieving documents comprising:
-
receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, including building a language model, including identifying paragraph boundaries in the tokenized text identifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one another calculating the frequency of the word pairs in the paragraphs adding the word pairs and corresponding frequency information to the language model receiving an original query for documents tokenizing the original query into a tokenized query extracting a group of associated terms from the language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized query appending a top group of associated terms to the original query to form an expanded query computing a boost weight for each term in the expanded query based on a probability of the terms in the expanded query being related to the terms in the original query to create a weighted query submitting the weighted query to the search engine receiving a list of documents from the plurality of documents that most closely correspond to the original query.
-
-
17. A storage medium having instructions stored thereon which when executed by a processor cause the processor to perform actions comprising:
-
receiving an original query for documents tokenizing the original query into a tokenized query extracting a group of associated terms from a language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized query appending a top group of associated terms to the original query to form an extended query computing a boost weight for each term in the extended query based on a probability of the terms in the extended query being related to the terms in the original query to create a weighted query submitting the weighted query to a search engine. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A computing device to retrieve documents in response to receiving a query for documents, the computing device comprising:
-
a processor, a memory coupled with the processor, and a storage medium having instructions stored thereon which when executed cause the computing device to perform actions including; receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, including identifying paragraph boundaries in the tokenized text identifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one another calculating the frequency of the word pairs in the paragraphs adding the word pairs and corresponding frequency information to the language model receiving the query for documents tokenizing the query for documents into a tokenized query extracting a group of associated terms from the language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized query appending a top group of terms associated with the query for documents to form an expanded query computing a boost weight for each term in the expanded query based on a probability of the terms in the expanded query being related to the terms in the query for documents to create a weighted query submitting the weighted query to the search engine receiving a list of documents from the plurality of documents that most closely correspond to the query for documents.
-
Specification