Retrieval of Documents Using Language Models

US 20080059187A1
Filed: 08/30/2007
Published: 03/06/2008
Est. Priority Date: 08/31/2006
Status: Active Grant

First Claim

Patent Images

1. A method of modeling documents comprising:

receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, includingidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphscalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language model.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods of retrieving documents using a language model are disclosed. A method may include preparing a language model of a plurality of documents, receiving a query, processing the query using the language model, and using the processed query to retrieve documents responding to the query via the search engine. The methods may be implemented in software and/or hardware on computing devices, including personal computers, telephones, servers, and others.

129 Citations

25 Claims

1. A method of modeling documents comprising:
- receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the document building a language model, includingidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphscalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language model.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one another.
  - 3. The method of claim 1 further comprising:
    - identifying the language(s) used in the document.
  - 4. The method of claim 1 further comprising:
    - removing stopwords from the tokenized text.
  - 5. The method of claim 1 further comprising:
    - extracting text from the document before the tokenizing the text from the document.
  - 6. The method of claim 1 wherein building the language model further comprises:
    - computing the probability of alternate terms.
  - 7. The method of claim 1 further comprising:
    - indexing the tokenized text.

8. A method of processing a query for documents comprising:
- receiving an original query for documentstokenizing the original query into a tokenized queryextracting a group of associated terms from a language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized queryappending a top group of associated terms to the original query to form an extended querycomputing a boost weight for each term in the extended query based on a probability of the terms in the extended query being related to the terms in the original query to create a weighted querysubmitting the weighted query to a search engine.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method of claim 8 further comprising:
    - preparing the language model.
  - 10. The method of claim 9 wherein the preparing the language model comprises:
    - receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the documentidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one anothercalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language model.
  - 11. The method of claim 10 wherein the preparing the language model further comprises:
    - identifying the language used in the document.
  - 12. The method of claim 10 wherein the preparing the language model further comprises:
    - removing stopwords from the tokenized text.
  - 13. The method of claim 10 wherein the preparing the language model further comprises:
    - extracting text from the document before the tokenizing of the text from the document.
  - 14. The method of claim 10 wherein the preparing the language model further comprises:
    - computing the probability of alternate terms.
  - 15. The method of claim 10 wherein the preparing the language model comprises:
    - indexing the tokenized text.

16. A method of retrieving documents comprising:
- receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the documentbuilding a language model, includingbuilding a language model, includingidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one anothercalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language modelreceiving an original query for documentstokenizing the original query into a tokenized queryextracting a group of associated terms from the language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized queryappending a top group of associated terms to the original query to form an expanded querycomputing a boost weight for each term in the expanded query based on a probability of the terms in the expanded query being related to the terms in the original query to create a weighted querysubmitting the weighted query to the search enginereceiving a list of documents from the plurality of documents that most closely correspond to the original query.

17. A storage medium having instructions stored thereon which when executed by a processor cause the processor to perform actions comprising:
- receiving an original query for documentstokenizing the original query into a tokenized queryextracting a group of associated terms from a language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized queryappending a top group of associated terms to the original query to form an extended querycomputing a boost weight for each term in the extended query based on a probability of the terms in the extended query being related to the terms in the original query to create a weighted querysubmitting the weighted query to a search engine.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The storage medium of claim 17 having further instructions stored thereon which when executed by a processor cause the processor to perform further actions comprising:
    - preparing the language model.
  - 19. The storage medium of claim 17 wherein the preparing the language model comprises:
    - receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the documentidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one anothercalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language model.
  - 20. The storage medium of claim 18 wherein the preparing the language model further comprises:
    - identifying the language used in the document.
  - 21. The storage medium of claim 18 wherein the preparing the language model further comprises:
    - removing stopwords from the tokenized text.
  - 22. The storage medium of claim 18 wherein the preparing the language model further comprises:
    - extracting text from the document before the tokenizing of the text from the document.
  - 23. The storage medium of claim 18 wherein the preparing of the language model comprises:
    - computing the probability of alternate terms.
  - 24. The storage medium of claim 18 wherein the preparing the language model comprises:
    - indexing the tokenized text.

25. A computing device to retrieve documents in response to receiving a query for documents, the computing device comprising:
- a processor, a memory coupled with the processor, and a storage medium having instructions stored thereon which when executed cause the computing device to perform actions including;
  
  receiving a plurality of documents, for each of the plurality of documents tokenizing text included in the documentbuilding a language model, includingidentifying paragraph boundaries in the tokenized textidentifying word pairs in the paragraphs, wherein the word pairs comprise two words occurring in any location in a particular paragraph, including adjacent to one anothercalculating the frequency of the word pairs in the paragraphsadding the word pairs and corresponding frequency information to the language modelreceiving the query for documentstokenizing the query for documents into a tokenized queryextracting a group of associated terms from the language model, the associated terms having a highest probability of relatedness to each of the terms in the tokenized queryappending a top group of terms associated with the query for documents to form an expanded querycomputing a boost weight for each term in the expanded query based on a probability of the terms in the expanded query being related to the terms in the query for documents to create a weighted querysubmitting the weighted query to the search enginereceiving a list of documents from the plurality of documents that most closely correspond to the query for documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Proofpoint Incorporated
Original Assignee
OrcaTec LLC
Inventors
Roitblat, Herbert L., Golbere, Brian

Granted Patent

US 8,401,841 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/257
CPC Class Codes

G06F 16/3344 using natural language anal...

Retrieval of Documents Using Language Models

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

129 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Retrieval of Documents Using Language Models

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

129 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links