Retrieval of documents using language models
First Claim
Patent Images
1. A method of modeling documents implemented by a computing device comprising:
- receiving a plurality of documents and building a language model, the building comprising, for each of the documents,tokenizing text included in the document;
defining paragraphs by identifying paragraph boundaries in the tokenized text;
identifying word pairs in each defined paragraph wherein the word pairs comprise two words occurring in any location in the same defined paragraph, including adjacent to one another;
calculating the frequency of the identified word pairs; and
adding the identified word pairs and corresponding frequency information to the language model.
11 Assignments
0 Petitions
Accused Products
Abstract
Methods of retrieving documents using a language model are disclosed. A method may include preparing a language model of a plurality of documents, receiving a query, processing the query using the language model, and using the processed query to retrieve documents responding to the query via the search engine. The methods may be implemented in software and/or hardware on computing devices, including personal computers, telephones, servers, and others.
37 Citations
26 Claims
-
1. A method of modeling documents implemented by a computing device comprising:
-
receiving a plurality of documents and building a language model, the building comprising, for each of the documents, tokenizing text included in the document; defining paragraphs by identifying paragraph boundaries in the tokenized text; identifying word pairs in each defined paragraph wherein the word pairs comprise two words occurring in any location in the same defined paragraph, including adjacent to one another; calculating the frequency of the identified word pairs; and adding the identified word pairs and corresponding frequency information to the language model. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of processing a query for documents implemented by a computing device comprising:
-
providing a language model, the language model comprising a plurality of terms and a plurality of pair-values, each pair-value representing the relatedness of term pairs within a document; receiving an original query for documents after the language model is prepared independent of the original query; tokenizing the original query into a tokenized query; extracting a group of associated terms from the language model, wherein the extracting comprises identifying pair-values in the language model corresponding to term pairs comprising a term in the tokenized query and another term that is different from any term in the tokenized query and identifying associated terms as the terms that are different from any term in the tokenized query and that are related to at least one term in the tokenized query and wherein the probability of relatedness is determined from the pair-values in the language model; forming an expanded query comprising a top group of associated terms and the original query, wherein the top group of associated terms are a subset of the associated terms, the terms in the top group of associated terms having a higher probability of relatedness to terms in the tokenized query than associated terms not in the top group; computing a boost weight for each term in the expanded query based on the probability of relatedness of the terms in the expanded query being related to the terms in the original query to create a weighted query; and submitting the weighted query to a search engine. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method of retrieving documents implemented by a computing device comprising:
-
receiving a plurality of documents; building a language model, the building comprising, for each of the documents, tokenizing text included in the document; defining paragraphs by identifying paragraph boundaries in the tokenized text; identifying word pairs in each defined paragraph wherein the word pairs comprise two words occurring in any location in the same defined paragraph, including adjacent to one another; calculating the frequency of the identified word pairs; and adding the identified word pairs and corresponding frequency information to the language model; receiving an original query for documents; tokenizing the original query into a tokenized query; extracting a group of associated terms from the language model, wherein the associated terms have a highest probability of relatedness to each of the terms in the tokenized query; and wherein the probability of relatedness is calculated from the probabilities in the language model; forming an expanded query comprising a top group of associated terms and the original query; computing a boost weight for each term in the expanded query based on the probability of relatedness of the terms in the expanded query being related to the terms in the original query to create a weighted query; submitting the weighted query to the search engine; and receiving a list of documents from the plurality of documents that most closely correspond to the original query.
-
-
17. A storage medium having instructions stored thereon which when executed by a processor cause the processor to perform actions comprising:
-
providing a language model, the language model comprising a plurality of terms and a plurality of pair-values, each pair-value representing the relatedness of term pairs within a document; receiving an original query for documents after the language model is prepared independent of the original query; tokenizing the original query into a tokenized query; extracting a group of associated terms from the language model, wherein the extracting comprises identifying pair-values in the language model corresponding to term pairs comprising a term in the tokenized query and another term that is different from any term in the tokenized query and identifying associated terms as the terms that are different from any term in the tokenized query and that are related to at least one term in the tokenized query and wherein the probability of relatedness is determined from the pair-values in the language model; forming an expanded query comprising a top group of associated terms and the original query, wherein the top group of associated terms are a subset of the associated terms, the terms in the top group of associated terms having a higher probability of relatedness to terms in the tokenized query than associated terms not in the top group; computing a boost weight for each term in the expanded query based on the probability of relatedness of the terms in the expanded query being related to the terms in the original query to create a weighted query; and submitting the weighted query to a search engine. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computing device to retrieve documents in response to receiving a query for documents, the computing device comprising:
-
a processor, a memory coupled with the processor, and a storage medium having instructions stored thereon which when executed cause the computing device to perform actions including; receiving a plurality of documents; building a language model, the building comprising, for each of the documents, tokenizing text included in the document; defining paragraphs by identifying paragraph boundaries in the tokenized text; identifying word pairs in each defined paragraph wherein the word pairs comprise two words occurring in any location in the same defined paragraph, including adjacent to one another; calculating the frequency of the identified word pairs; and adding the identified word pairs and corresponding frequency information to the language model; receiving a query for documents; tokenizing the query for documents into a tokenized query; extracting a group of associated terms from the language model, wherein the associated terms have a highest probability of relatedness to each of the terms in the tokenized query and wherein the probability of relatedness is calculated from the probabilities in the language model; forming an expanded query comprising a top group of associated terms and the query for documents; computing a boost weight for each term in the expanded query based on the probability of relatedness of the terms in the expanded query being related to the terms in the query for documents to create a weighted query; submitting the weighted query to the search engine; and receiving a list of documents from the plurality of documents that most closely correspond to the query for documents.
-
Specification