Statistical language model for inflected languages
First Claim
Patent Images
1. A method for generating a language model, comprising:
- providing to a classifier a corpus of words having a given sequence;
performing a plurality of transformations on the words in the corpus, while preserving the given sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations;
estimating statistical data representing each of the classes; and
weighting and combining the statistical data to form the language model.
1 Assignment
0 Petitions
Accused Products
Abstract
A statistical language model for inflected languages, having very large vocabularies, is generated by splitting words into stems, prefixes and endings, and deriving trigrams for the stems, ending and prefixes. The statistical dependence of endings and prefixes from each stem is also obtained, and the resulting language model is a weighted sum of these scores.
-
Citations
21 Claims
-
1. A method for generating a language model, comprising:
-
providing to a classifier a corpus of words having a given sequence; performing a plurality of transformations on the words in the corpus, while preserving the given sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations; estimating statistical data representing each of the classes; and weighting and combining the statistical data to form the language model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20)
-
-
16. A system for generating a language model, comprising:
-
means for storing a corpus of words having a meaningful sequence; classification means for performing a plurality of transformations on each word in the corpus, while preserving the sequence, to generate a plurality of classes, each of the plurality of classes having a different stream of word components associated therewith, said different stream of word components being generated in accordance with one of the transformations; means for estimating statistical data representing each of the classes; means for weighting and combining the statistical data to form the language model. - View Dependent Claims (17)
-
-
18. A method for generating a language model for a language, comprising the steps of:
-
a) identifying a plurality of stems and endings in the language; b) extracting the identified stems and endings from a plurality of words in the language; c) given a set of text data in the language, converting the text to a text composed of stems and endings, and considering stems and endings as separate dictionary entries; d) constructing a vocabulary from the text generated in step c); e) for each stem, generating a set of probabilities for endings, the set of probabilities for endings having a size given by the number of endings; f) using the results of step b), generating a language model for any combination of stems and endings which would give the lowest recognition score. - View Dependent Claims (19)
-
-
21. A method for generating a language model for a language, comprising the steps of:
-
a) identifying a plurality of stems, endings and prefixes in the language; b) extracting the identified stems, endings and prefixes from a plurality of words in the language; c) given a set of text data in the language, converting the text to a text composed of prefixes, stems and endings, and considering prefixes, stems and endings as separate dictionary entries; d) constructing a vocabulary from the text generated in step c); e) for each stem, generating a set of probabilities for endings, a set of probabilities for prefixes and a set of probabilities for stems, the set of probabilities for prefixes having a size given by the number of prefixes, the set of probabilities for endings having a size given by the number of endings, and the set of probabilities for stems having a size given by the number of stems; f) using the results of step b), generating a language model for any combination of prefixes, stems and endings which would give the lowest recognition score.
-
Specification