Topic specific language models built from large numbers of documents

US 20060212288A1
Filed: 03/17/2006
Published: 09/21/2006
Est. Priority Date: 03/17/2005
Status: Active Grant

First Claim

Patent Images

1. A computer system comprising:

a query element which produces queries to a large database of documents; and

a language model part, which receives information responsive to said queries, and uses said information for said language model, by using some, but not all, of said information, for said language model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Forming and/or improving a language model based on data from a large collection of documents, such as web data. The collection of documents is queried using queries that are formed from the language model. The language model is subsequently improved using the information thus obtained. The improvement is used to improve the query. As data is received from the collection of documents, it is compared to a rejection model, that models what rejected documents typically look like. Any document that meets the test is then rejected. The documents that remain are characterized to determine whether they add information to the language model, whether they are relevant, and whether they should be independently rejected. Rejected documents are used to update the rejection model; accepted documents are used to update the language model. Each iteration improves the language model, and the documents may be analyzed again using the improved language model.

108 Citations

33 Claims

1. A computer system comprising:
- a query element which produces queries to a large database of documents; and
  
  a language model part, which receives information responsive to said queries, and uses said information for said language model, by using some, but not all, of said information, for said language model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. A system as in claim 1, wherein said query element uses the language model to form said queries.
  - 3. A system as in claim 2, wherein the query element forms new queries as the language model is improved based on new information from said database of documents.
  - 4. A system as in claim 1, wherein said language model part includes a rejection model, which compares said information from said queries to other information which has already been determined as not being used for said language model.
  - 5. A system as in claim 4, wherein said language model part also includes a background model which classifies said information as background, and a topic model which classifies said information as being specific to a specific topic of the language model.
  - 6. A system as in claim 1, wherein said language model part models the information at least at a document level which includes a collection of sentences, and at an utterance level which includes less than said collection of sentences.
  - 7. A system as in claim 1, wherein said language model part analyzes said information, and adds said information to a language model only if said information reduces a total relative entropy with respect to the data distribution.
  - 8. A system as in claim 7, wherein said language model compares new information to previously obtained information, and uses said new information only when it is not sufficiently similar to previously obtained information.
  - 9. A system as in claim 3, wherein said language model includes a background model which is not specific to a topic of the language model, a topic model which is specific to a specified topic of the language model, and wherein said query element uses both said background and topic models to form said queries.
  - 10. A system as in claim 9, wherein said language model further includes a rejection model which models information that should not be added to said language model, and wherein said rejection model forms said rejection model based on information that scores poorly using both said background model and said topic model.
  - 11. A system as in claim 1, wherein said large database of documents is the Internet, and said query element queries at least one part of the Internet.
  - 12. A system as in claim 1, wherein said large database of documents is a collection of documents within a company.
  - 13. A system as in claim 1, wherein said large database of documents is a database with more than 10,000 documents.

14. A method, comprising:
- querying a large database of documents which includes more than 10,000 documents, and includes documents which are directed to a plurality of different topics;
  
  receiving information responsive to said querying; and
  
  using said information for a language model by classifying said documents, and using some, but not all, of said information, for said language model.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
- - 15. A method as in claim 14, wherein said querying comprises using the language model to form queries.
  - 16. A method as in claim 14, wherein said using comprises using said information to improve said language model, and using an improved language model to form new queries.
  - 17. A method as in claim 16, wherein said using further comprises using a background model which is generic to a plurality of topics, a topic specific model, which is specific to a specified topic.
  - 18. A method as in claim 14, further comprising using said information both at a document level and at an utterance level, wherein a document level includes a plurality of sentences, and said utterance level includes less than said plurality of sentences.
  - 19. A method as in claim 14, wherein said using comprises analyzing said information to determine if said information adds new information to said language model, and adding said information only if said information adds said new information to said language model.
  - 20. A method as in claim 14, wherein said large database of documents is the Internet.
  - 21. A method as in claim 14, wherein said using comprises using a rejection model, which models information received responsive to said querying, that has not been added to the language model, and rejecting subsequent information based on said rejection model.

22. A method comprising:
- using a language model to form queries to a plurality of documents, which documents include at least some documents that have information about a topic, and at least other documents which do not have information about said topic;
  
  receiving information from said documents, responsive to said queries; and
  
  classifying said information and using said information to modify said language model.
- View Dependent Claims (23)
- - 23. A method as in claim 22, wherein said using comprises using the information to improve the language model, and using an improved language model to form new queries.

24. A method, comprising:
- accessing a plurality of documents which includes some documents that include information about a topic and other documents that do not include information about said topic;
  
  comparing information from said documents to a rejection model which represents a model of information that is not sufficiently relevant to said topic to use as a language model for said topic;
  
  rejecting information which is not sufficiently relevant; and
  
  using information which is sufficiently relevant for said language model.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 25. A method as in claim 24, further comprising using said language model to form a query which queries the Internet, and using responses from said query as documents used by said obtaining.
  - 26. A method as in claim 24, wherein said comparing compares both documents as a whole and also compares utterances within the documents.
  - 27. A method as in claim 24, further comprising weighting parts of the documents according to relevance to the topic.
  - 28. A method as in claim 24, wherein said language model includes a background language model, representative of a topic independent model, and a topic language model representative of the topic.
  - 29. A method as in claim 28, wherein said comparing comprises determining a relative entropy comparison of the background model and the topic model.
  - 30. A method as in claim 24, further comprising forming a topic specific language model using said information which is sufficiently relevant, and using said topic specific language model to update said language model.
  - 31. A method as in claim 30, further comprising using said language model for speech recognition.
  - 32. A method as in claim 25, further comprising forming a list of Internet sites based on said rejecting, and rejecting use of Internet sites that are on said list.
  - 33. A method as in claim 32, wherein said forming comprises identifying queries and web addresses URLS which provide data for improving the language model, by measuring the gains in model at each of a plurality of iterations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Georgiou, Panayiotis, Narayanan, Shrikanth, Sethy, Abhinav

Granted Patent

US 7,739,286 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/10
CPC Class Codes

G06F 40/216   using statistical methods

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

G10L 15/197   Probabilistic grammars, e.g...

Topic specific language models built from large numbers of documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

108 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Topic specific language models built from large numbers of documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links