Apparatus and method for building domain-specific language models
First Claim
1. A method for building a language model specific to a domain, comprising the steps of:
- a) building a reference language model based on a seed corpus containing linguistic units relevant to said domain;
b) accessing an external corpus containing a large number of linguistic units;
c) using said reference language model, selectively extracting linguistic units from said external corpus that have a sufficient degree of relevance to said domain; and
d) updating said reference language model based on said seed corpus and said extracted linguistic units.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a method and apparatus for building a domain-specific language model for use in language processing applications, e.g., speech recognition. A reference language model is generated based on a relatively small seed corpus containing linguistic units relevant to the domain. An external corpus containing a large number of linguistic units is accessed. Using the reference language model, linguistic units which have a sufficient degree of relevance to the domain are extracted from the external corpus. The reference language model is then updated based on the seed corpus and the extracted linguistic units. The process may be repeated iteratively until the language model is of satisfactory quality. The language building technique may be further enhanced by combining it with mixture modeling or class-based modeling.
-
Citations
21 Claims
-
1. A method for building a language model specific to a domain, comprising the steps of:
-
a) building a reference language model based on a seed corpus containing linguistic units relevant to said domain;
b) accessing an external corpus containing a large number of linguistic units;
c) using said reference language model, selectively extracting linguistic units from said external corpus that have a sufficient degree of relevance to said domain; and
d) updating said reference language model based on said seed corpus and said extracted linguistic units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
measuring quality of said updated language model; and
,repeating steps b), c) and d) if the measured quality is determined to be below a quality threshold, otherwise defining said updated language model as a final language model.
-
-
3. The method of claim 2 wherein the step of measuring quality comprises calculating perplexity of the updated reference language model using a test corpus containing linguistic units relevant to said domain.
-
4. The method of claim 2 wherein the step of measuring quality comprises:
-
providing a test corpus containing linguistic units relevant to said domain; and
evaluating speech recognition accuracy for said test corpus using said updated reference language model.
-
-
5. The method of claim 2 wherein the step of measuring quality comprises comparing the size of linguistic units extracted during a current linguistic unit extraction iteration to the size of linguistic units extracted during at least one prior extraction iteration.
-
6. The method of claim 1 wherein said step c) is performed by computing perplexity scores for individual linguistic units from said external corpus and selectively extracting those linguistic units having a perplexity score below a perplexity threshold.
-
7. The method of claim 6 wherein said perplexity threshold is computed dynamically, and corresponds to a percentile rank of perplexity measures of the linguistic units of said seed corpus, calculated according to the latest reference language model.
-
8. The method of claim 1, further comprising the steps of:
-
forming N subcorpora of linguistic units from said linguistic units extracted from a test corpus, grouped according to degree of relevance to said domain;
building N language models based on said seed corpus and said N subcorpora, respectively; and
wherein said step of updating said reference language model includes mixing said N language models.
-
-
9. The method of claim 1 wherein said linguistic units of said seed corpus and said external corpus comprise sentences.
-
10. The method of claim 1, further comprising the step of generating word classes from said linguistic units of said seed corpus and said linguistic units extracted from said external corpus;
- and
wherein said step d) of updating said reference language model is performed in accordance with said word classes so as to construct said updated reference language model as a class-based language model.
- and
-
11. The method of claim 1 wherein said step d) of updating said reference language model is performed after a predetermined number of linguistic units have been selectively extracted from said external corpus in step c).
-
12. An apparatus for building a language model for a specific domain, comprising:
-
a seed corpus containing linguistic units relevant to said domain;
a language model constructor for building a reference language model from said seed corpus;
a corpus extractor operative to access an external corpus and, using said reference language model, to selectively extract linguistic units which have a sufficient degree of relevance to said domain;
wherein said language model constructor updates said reference language model based on said seed corpus and said extracted linguistic units. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
a test corpus containing linguistic units relevant to said domain; and
said model checker measuring quality of said updated reference language model with at least one of;
(i) a speech recognition engine to measure speech recognition accuracy of said reference language model using linguistic units of said test corpus;
(ii) a perplexity calculator to calculate perplexity of said updated reference language model using said test corpus; and
(iii) an incremental size evaluator for evaluating the number of linguistic units selectively extracted from said external corpus during a current language building iteration.
-
-
15. The apparatus of claim 12 wherein said sufficient degree of relevance is dynamically determined by a threshold parameter generator of said corpus extractor which computes a perplexity threshold corresponding to a percentile rank of perplexity measures of the linguistic units of said seed corpus according to the latest version of said reference language model.
-
16. The apparatus of claim 12, further including a relevant corpus for storing said selectively extracted linguistic units, said relevant corpus comprising a plurality N of subcorpora grouped according to relevance to said domain, each dedicated to storing plural of said selectively extracted linguistic units falling within a certain range of relevance to said domain;
-
wherein said language model constructor is operative to construct N reference language models based on said seed corpus and said N subcorpora, respectively; and
said apparatus further includes a language model mixer to mix said N reference language models to form said updated reference language model.
-
-
17. The apparatus of claim 16, further comprising:
-
a test corpus containing linguistic units relevant to said domain; and
a model checker for measuring quality of said updated reference language model with at least one of;
(i) a speech engine to measure speech recognition accuracy of said reference language model using linguistic units of said test corpus;
(ii) a perplexity calculator to calculate perplexity of said updated reference language model using said test corpus; and
(iii) an incremental size evaluator for evaluating the number of linguistic units selectively extracted from said external corpus during a current language building iteration.
-
-
18. The apparatus of claim 12, further comprising a word class generator for generating word classes from said linguistic units of said seed corpus and said linguistic units extracted from said external corpus;
- and
wherein said language model constructor updates said reference language model in accordance with said word classes so as to construct said updated reference language model as a class-based language model.
- and
-
19. The apparatus of claim 12 wherein said sufficient degree of relevance is higher for an initial iteration of external corpus linguistic unit extraction than for subsequent iterations.
-
20. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to provide method steps for building a language model specific to a domain, said method steps comprising:
-
a) building a reference language model based on a seed corpus containing linguistic units relevant to said domain;
b) accessing an external corpus containing a large number of linguistic units;
c) using said reference language model, selectively extracting linguistic units from said external corpus that have a sufficient degree of relevance to said domain; and
d) updating said reference language model based on said seed corpus and said extracted linguistic units. - View Dependent Claims (21)
-
Specification