Statistical language model for inflected languages

US 5,835,888 A
Filed: 06/10/1996
Issued: 11/10/1998
Est. Priority Date: 06/10/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method for generating a language model, comprising:

providing to a classifier a corpus of words having a given sequence;

performing a plurality of transformations on the words in the corpus, while preserving the given sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations;

estimating statistical data representing each of the classes; and

weighting and combining the statistical data to form the language model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A statistical language model for inflected languages, having very large vocabularies, is generated by splitting words into stems, prefixes and endings, and deriving trigrams for the stems, ending and prefixes. The statistical dependence of endings and prefixes from each stem is also obtained, and the resulting language model is a weighted sum of these scores.

Citations

21 Claims

1. A method for generating a language model, comprising:
- providing to a classifier a corpus of words having a given sequence;
  
  performing a plurality of transformations on the words in the corpus, while preserving the given sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations;
  
  estimating statistical data representing each of the classes; and
  
  weighting and combining the statistical data to form the language model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20)
- - 2. The method of claim 1, wherein one of the transformations identifies a stem and ending, if present, in each word, and links the stem and ending via a linking character if an ending is present, and appends the linking character at the end of the stem if no ending is present.
  - 3. The method of claim 2, wherein the estimated statistical data includes a distribution of the occurrence of each of a plurality of endings, given each stem in the corpus.
  - 4. The method of claim 2, wherein the spellings for words in the model are split in the same place that words are split into stems and endings.
  - 5. The method of claim 1, wherein one of the transformations deletes the ending from each word in the corpus.
  - 6. The method of claim 5, wherein the estimated statistical data includes an n-gram distribution of each stem in the corpus.
  - 7. The method of claim 1, wherein one of the transformations removes the stem from all words in the corpus.
  - 8. The method of claim 7, wherein the estimated statistical data includes an n-gram distribution of each ending in the corpus.
  - 9. The method of claim 1, wherein one of the transformations separates the stem and ending, if present, of each word in the corpus.
  - 10. The method of claim 9, wherein the estimated statistical data includes an n-gram distribution of stems and endings, in their natural order, for each stem and ending in the corpus.
  - 11. The method of claim 1, wherein the plurality of the transformations comprises:
    - a first transformation that identifies a stem and ending, if present, in each word, and links the stem and ending with a linking character if an ending is present, and appends the linking character at the end of the stem if no ending is present;
      
      a second transformation that deletes the ending, if present, from each word in the corpus;
      
      a third transformation that deletes the stem from each word in the corpus;
      
      a fourth transformation that separates the stem and ending, if present, of each word in the corpus.
  - 12. The method of claim 11, wherein the estimated statistical data comprises:
    - a distribution of the occurrence of each of a plurality of endings, given each stem in the corpus;
      
      an n-gram distribution of each stem in the corpus;
      
      an n-gram distribution of each ending in the corpusan n-gram distribution of stems and endings, in their natural order, for each stem and ending in the corpus.
  - 13. The method of claim 12, wherein a zero n-gram probability is assigned to stem-ending pairs not occurring in the language.
  - 14. The method of claim 1, further comprising identifying a stem and an ending, if present, in each word in the corpus by comparing a list of endings in the language with each word in the corpus.
  - 15. The method of claim 1, further comprising identifying a stem and an ending in each word in the corpus by reference to a set of rules.
  - 20. The method of claim 1, wherein the possible endings in a language are contained in a precompiled list.

16. A system for generating a language model, comprising:
- means for storing a corpus of words having a meaningful sequence;
  
  classification means for performing a plurality of transformations on each word in the corpus, while preserving the sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations;
  
  means for estimating statistical data representing each of the classes;
  
  means for weighting and combining the statistical data to form the language model.
- View Dependent Claims (17)
- - 17. A speech recognition system including the system of claim 16, comprising:
    - means for computing the likelihood of an input word of speech based on the weighted combination of statistical data forming the language model.

18. A method for generating a language model for a language, comprising the steps of:
- a) identifying a plurality of stems and endings in the language;
  
  b) extracting the identified stems and endings from a plurality of words in the language;
  
  c) given a set of text data in the language, converting the text to a text composed of stems and endings, and considering stems and endings as separate dictionary entries;
  
  d) constructing a vocabulary from the text generated in step c);
  
  e) for each stem, generating a set of probabilities for endings, the set of probabilities for endings having a size given by the number of endings;
  
  f) using the results of step b), generating a language model for any combination of stems and endings which would give the lowest recognition score.
- View Dependent Claims (19)
- - 19. The method of claim 18, wherein the possible endings in a language are identified using a known vocabulary set for the language.

21. A method for generating a language model for a language, comprising the steps of:
- a) identifying a plurality of stems, endings and prefixes in the language;
  
  b) extracting the identified stems, endings and prefixes from a plurality of words in the language;
  
  c) given a set of text data in the language, converting the text to a text composed of prefixes, stems and endings, and considering prefixes, stems and endings as separate dictionary entries;
  
  d) constructing a vocabulary from the text generated in step c);
  
  e) for each stem, generating a set of probabilities for endings, a set of probabilities for prefixes and a set of probabilities for stems, the set of probabilities for prefixes having a size given by the number of prefixes, the set of probabilities for endings having a size given by the number of endings, and the set of probabilities for stems having a size given by the number of stems;
  
  f) using the results of step b), generating a language model for any combination of prefixes, stems and endings which would give the lowest recognition score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kanevsky, Dimitri, Roukos, Salim Estephan, Sedivy, Jan
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/662,726
Time in Patent Office

883 Days
Field of Search

704/9, 704/1, 704/10, 704/243, 704/241, 704/244, 704/255, 704/257, 704/256, 704/240, 707/531, 707/532, 707/533, 707/534
US Class Current

704/9
CPC Class Codes

G10L 15/14   using statistical models, e...

G10L 15/18   using natural language mode...

G10L 15/19   Grammatical context, e.g. d...

Statistical language model for inflected languages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Statistical language model for inflected languages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links