Word classing for language modeling
First Claim
1. In a language model employing a classing function defining classes of words, each of the classes grouping words sharing a similar likelihood of appearing in a production application context, a method of optimizing the classing function comprising:
- identifying a language context corresponding to a production application, the language context based on usage encountered by a language model invoked by the production application;
defining a training corpus having a set of clusters indicative of expected usage, the clusters being n-grams having a sequence of n words for defining a probability that the first n−
1 words in the sequence is followed by word n in the sequence; and
building a language model from a classing function applied to the training corpus, the classing function optimized to correspond to usage in the identified language context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram, optimizing the classing function further comprising;
employing a word based classing approach;
backing off, if the word based approach indicates a null probability; and
employing a class based approach;
further comprising;
determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words;
employing the word based classification if the cluster has a previous occurrence,identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster;
backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred;
the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and
the discount parameter defining an absolute discounting model, further comprising;
identify a discount parameter indicative of a reduction of a word count of words in a cluster;
determining if the cluster is to be pruned or retained in the corpus;
subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;
ordefining the discount count of the cluster as zero if the cluster is pruned.
2 Assignments
0 Petitions
Accused Products
Abstract
A language processing application employs a classing function optimized for the underlying production application context for which it is expected to process speech. A combination of class based and word based features generates a classing function optimized for a particular production application, meaning that a language model employing the classing function uses word classes having a high likelihood of accurately predicting word sequences encountered by a language model invoked by the production application. The classing function optimizes word classes by aligning the objective of word classing with the underlying language processing task to be performed by the production application. The classing function is optimized to correspond to usage in the production application context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram.
-
Citations
22 Claims
-
1. In a language model employing a classing function defining classes of words, each of the classes grouping words sharing a similar likelihood of appearing in a production application context, a method of optimizing the classing function comprising:
-
identifying a language context corresponding to a production application, the language context based on usage encountered by a language model invoked by the production application; defining a training corpus having a set of clusters indicative of expected usage, the clusters being n-grams having a sequence of n words for defining a probability that the first n−
1 words in the sequence is followed by word n in the sequence; andbuilding a language model from a classing function applied to the training corpus, the classing function optimized to correspond to usage in the identified language context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram, optimizing the classing function further comprising; employing a word based classing approach; backing off, if the word based approach indicates a null probability; and employing a class based approach;
further comprising;determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words; employing the word based classification if the cluster has a previous occurrence, identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred; the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and the discount parameter defining an absolute discounting model, further comprising; identify a discount parameter indicative of a reduction of a word count of words in a cluster; determining if the cluster is to be pruned or retained in the corpus; subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;
ordefining the discount count of the cluster as zero if the cluster is pruned. - View Dependent Claims (2, 3, 4, 5, 6, 20, 21, 22)
-
-
7. In a language model defining a probability for sequences of words, the language model invoked by a production application responsive to an end user for performing statistical language recognition services, a method of assigning words to classes comprising:
-
defining a language model for predicting a likelihood of sequences of words received from the production application, the language model having a classing function, the classing function for assigning words to classes, the word classes grouping words for receiving similar treatment as other words in the class; identifying a clustering, the clustering defining the number of words in sequence to which a prediction applies, in which an n-gram cluster defines a probability that, for an n−
1 sequence of words, the successive nth word will be found;identifying a language context corresponding to the usage of the language as received by the production application; defining the classing function, the classing function for scanning a learning set and identifying the word classes by; employing a word based classing approach; backing off, if the word based approach indicates a null probability; and employing a class based approach;
further comprising;determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words; employing the word based classification if the cluster has a previous occurrence, identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen words has occurred the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; the discount parameter defining an absolute discounting model, further comprising; identifying a discount parameter indicative;
of a reduction of a word count of words in a cluster;determining if the cluster is to be pruned or retained in the corpus; count of the observed word based count of the cluster to compute a count of the observed word based count of the cluster to compute a discounted count;
ordefining the discount count of the cluster as zero if the cluster is pruned; applying the classing function to the learning set to generate the word classes, the word classes indicative of words statistically likely to be employed based on predetermined sequences of words in the learning set; and optimizing the classing function by selecting the word classes based on an objective of the production application, optimizing further including; analyzing word counts and class counts of the learning set; and analyzing word frequency within an assigned class; the objective of the production application defined by the identified language context. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 18, 19)
-
-
16. A computer program product on a non-transitory computer readable storage medium having instructions for performing a method of defining a classing function employed for training a language model, the language model invoked for linguistic processing supporting a production application, the method comprising:
-
identifying trigrams representative of a language context, the language context derived from word sequences likely to be received from the production application for predicting a word; defining a language model responsive to the production application for the linguistic processing, the language model employing class-based processing, the classes grouping words for receiving similar treatment as other words in the class; defining a classing function for partitioning the words into the classes, the classing function for optimizing the classes such that a trigram including a word in a particular class are significant predictors of trigrams including other words in the particular class, the classing function further comprising; employing a word based classing approach; backing off, if the word based approach indicates a null probability; and employing a class based approach;
further comprising;determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words; employing the word based classification if the cluster has a previous occurrence, identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and the discount parameter defining an absolute discounting model, further comprising; identifying a discount parameter indicative of a reduction of a word count of words in a cluster; determining if the cluster is to be pruned or retained in the corpus; subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;
ordefining the discount count of the cluster as zero if the cluster is pruned; the trigrams denoting a frequency of a sequence of words, the classing function optimizing the model toward the production application by identifying, for a particular trigram, a count of the first word in the trigram a count of the second word in the trigram a count of the words in the class of the first word of the trigram; a count of the words in the class of the second word of the trigram; a frequency of the first word relative to other words in the class of the first word; and a frequency of the second word relative to the other words in the class of the second word; applying the defined classing function to a corpus of words for generating a model; and employing the model for identifying, for a first and second word of a trigram received from the production application, a likelihood of each of a set of candidate third words for the trigram. - View Dependent Claims (17)
-
Specification