Word classing for language modeling

US 9,367,526 B1
Filed: 07/26/2011
Issued: 06/14/2016
Est. Priority Date: 07/26/2011
Status: Active Grant

First Claim

Patent Images

1. In a language model employing a classing function defining classes of words, each of the classes grouping words sharing a similar likelihood of appearing in a production application context, a method of optimizing the classing function comprising:

identifying a language context corresponding to a production application, the language context based on usage encountered by a language model invoked by the production application;

defining a training corpus having a set of clusters indicative of expected usage, the clusters being n-grams having a sequence of n words for defining a probability that the first n−

1 words in the sequence is followed by word n in the sequence; and

building a language model from a classing function applied to the training corpus, the classing function optimized to correspond to usage in the identified language context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram, optimizing the classing function further comprising;

employing a word based classing approach;

backing off, if the word based approach indicates a null probability; and

employing a class based approach;

further comprising;

determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words;

employing the word based classification if the cluster has a previous occurrence,identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster;

backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred;

the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and

the discount parameter defining an absolute discounting model, further comprising;

identify a discount parameter indicative of a reduction of a word count of words in a cluster;

determining if the cluster is to be pruned or retained in the corpus;

subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;

ordefining the discount count of the cluster as zero if the cluster is pruned.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A language processing application employs a classing function optimized for the underlying production application context for which it is expected to process speech. A combination of class based and word based features generates a classing function optimized for a particular production application, meaning that a language model employing the classing function uses word classes having a high likelihood of accurately predicting word sequences encountered by a language model invoked by the production application. The classing function optimizes word classes by aligning the objective of word classing with the underlying language processing task to be performed by the production application. The classing function is optimized to correspond to usage in the production application context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram.

Citations

22 Claims

1. In a language model employing a classing function defining classes of words, each of the classes grouping words sharing a similar likelihood of appearing in a production application context, a method of optimizing the classing function comprising:
- identifying a language context corresponding to a production application, the language context based on usage encountered by a language model invoked by the production application;
  
  defining a training corpus having a set of clusters indicative of expected usage, the clusters being n-grams having a sequence of n words for defining a probability that the first n−
  
  1 words in the sequence is followed by word n in the sequence; and
  
  building a language model from a classing function applied to the training corpus, the classing function optimized to correspond to usage in the identified language context using class-based and word-based features by computing a likelihood of a word in an n-gram and a frequency of a word within a class of the n-gram, optimizing the classing function further comprising;
  
  employing a word based classing approach;
  
  backing off, if the word based approach indicates a null probability; and
  
  employing a class based approach;
  
  further comprising;
  
  determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words;
  
  employing the word based classification if the cluster has a previous occurrence,identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster;
  
  backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred;
  
  the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and
  
  the discount parameter defining an absolute discounting model, further comprising;
  
  identify a discount parameter indicative of a reduction of a word count of words in a cluster;
  
  determining if the cluster is to be pruned or retained in the corpus;
  
  subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;
  
  ordefining the discount count of the cluster as zero if the cluster is pruned.
- View Dependent Claims (2, 3, 4, 5, 6, 20, 21, 22)
- - 2. The method of claim 1 further comprising defining a classing function for generating a custom language model for customizing classes to a particular production application by invoking a smoothing function for learning classes customized to a final model in system, the custom language model being optimized toward a production system by backing off the word based approach and employing a class based approach by invoking a combination of word and class n-gram features.
  - 3. The method of claim 2 wherein the n-gram is a trigram and the production application is executable on a wireless communication device.
  - 4. The method of claim 2 further comprising a clustering, the clustering defining the number of words in sequence to which a prediction applies, the clustering being at least 3.
  - 5. The method of claim 4 further comprising identifying seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words, wherein the classing function identifies unseen clusters by:
    - employing the word based classification if the cluster has a previous occurrence, andemploying the class based classification if the cluster is previously unseen.
  - 6. The method of claim 5 further comprising optimizing word classes by aligning the objective of the word classing function with the serviceable language processing task by selecting classes that directly optimize the likelihood of the language model predicting a particular cluster.
  - 20. The method of claim 1 further comprising applying the discount parameter to compute a reduction from an observed maximum word count of a cluster by the discount parameter to compute a discounted count of a cluster.
  - 21. The method of claim 20 wherein the discount parameter reduces a count based on a word count of the words in the cluster toward a class count of the words in the cluster.
  - 22. The method of claim 1 further comprising selecting the classes based on a production application defining a context in which the clusters are expected to be encountered.

7. In a language model defining a probability for sequences of words, the language model invoked by a production application responsive to an end user for performing statistical language recognition services, a method of assigning words to classes comprising:
- defining a language model for predicting a likelihood of sequences of words received from the production application, the language model having a classing function, the classing function for assigning words to classes, the word classes grouping words for receiving similar treatment as other words in the class;
  
  identifying a clustering, the clustering defining the number of words in sequence to which a prediction applies, in which an n-gram cluster defines a probability that, for an n−
  
  1 sequence of words, the successive nth word will be found;
  
  identifying a language context corresponding to the usage of the language as received by the production application;
  
  defining the classing function, the classing function for scanning a learning set and identifying the word classes by;
  
  employing a word based classing approach;
  
  backing off, if the word based approach indicates a null probability; and
  
  employing a class based approach;
  
  further comprising;
  
  determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words;
  
  employing the word based classification if the cluster has a previous occurrence,identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and
  
  backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen words has occurred the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster;
  
  the discount parameter defining an absolute discounting model, further comprising;
  
  identifying a discount parameter indicative;
  
  of a reduction of a word count of words in a cluster;
  
  determining if the cluster is to be pruned or retained in the corpus;
  
  count of the observed word based count of the cluster to compute a count of the observed word based count of the cluster to compute a discounted count;
  
  ordefining the discount count of the cluster as zero if the cluster is pruned;
  
  applying the classing function to the learning set to generate the word classes, the word classes indicative of words statistically likely to be employed based on predetermined sequences of words in the learning set; and
  
  optimizing the classing function by selecting the word classes based on an objective of the production application, optimizing further including;
  
  analyzing word counts and class counts of the learning set; and
  
  analyzing word frequency within an assigned class;
  
  the objective of the production application defined by the identified language context.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 18, 19)
- - 8. The method of claim 7 wherein the clustering being at least 3.
  - 9. The method of claim 7 further comprising defining the learning set by including word clusters employed in the language context encountered by the production application, the language context including sequences of words likely to be employed by the production application for performing the statistical recognition.
  - 10. The method of claim 9 further comprising defining a language model responsive to cluster requests of a predetermined cluster size, the cluster request for requesting a likelihood of a next word from a sequence of words defining all but the final word in the cluster.
  - 11. The method of claim 7 further comprising identifying seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words, wherein the classing function identifies unseen clusters by:
    - employing the word based classification if the cluster has a previous occurrence, andemploying the class based classification if the cluster is previously unseen.
  - 12. The method of claim 7 wherein the classing function further generates a set of entries including each word and a corresponding class identifier, each word in the class of words sharing a common class identifier.
  - 13. The method of claim 12 wherein the word classes further comprise a set of tuples, each tuple including tuple including a word and corresponding class, the classing function assigning classes by receiving a word previously unseen in a cluster;
    - identifying a class of words appearing in a similar context;
      
      reading a class identifier of the identified class; and
      
      assigning the class identifier to the word.
  - 14. The method of claim 13 further comprising reassigning a group of words to another class by:
    - identifying, based the received word, a set of words occurring in a similar language context; and
      
      reassigning the received word and the set of words to another class by assigning a common class identifier.
  - 15. The method of claim 14 wherein optimizing further comprises receiving, in an iterative manner, clusters of words and reassigning class identifiers for increasing the likelihood of the language model predicting a cluster in the production application.
  - 18. The method of claim 7 wherein the classing function groups a plurality of classes of words independent from others of the plurality of classes of words, the classes including words sharing a similar likelihood of appearing in a production application context.
  - 19. The method of claim 7 further comprising classing words according to a hard classing approach such that each word appears in only one class.

16. A computer program product on a non-transitory computer readable storage medium having instructions for performing a method of defining a classing function employed for training a language model, the language model invoked for linguistic processing supporting a production application, the method comprising:
- identifying trigrams representative of a language context, the language context derived from word sequences likely to be received from the production application for predicting a word;
  
  defining a language model responsive to the production application for the linguistic processing, the language model employing class-based processing, the classes grouping words for receiving similar treatment as other words in the class;
  
  defining a classing function for partitioning the words into the classes, the classing function for optimizing the classes such that a trigram including a word in a particular class are significant predictors of trigrams including other words in the particular class, the classing function further comprising;
  
  employing a word based classing approach;
  
  backing off, if the word based approach indicates a null probability; and
  
  employing a class based approach;
  
  further comprising;
  
  determining seen and unseen clusters, the unseen clusters having a previously unoccurring sequence of words;
  
  employing the word based classification if the cluster has a previous occurrence,identifying a discount parameter, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and
  
  backing off using the discount parameter and employing a class based approach if the cluster is unseen, unseen clusters based on occurrence of any of the words in the cluster, the unseen cluster having a nonzero probability if any word in the class of words has occurred, the discount parameter reducing a count of word occurrences of a particular cluster in favor of a class count of words of the cluster; and
  
  the discount parameter defining an absolute discounting model, further comprising;
  
  identifying a discount parameter indicative of a reduction of a word count of words in a cluster;
  
  determining if the cluster is to be pruned or retained in the corpus;
  
  subtracting the discount parameter from a maximum count of the observed word based count of the cluster to compute a discounted count;
  
  ordefining the discount count of the cluster as zero if the cluster is pruned;
  
  the trigrams denoting a frequency of a sequence of words, the classing function optimizing the model toward the production application by identifying, for a particular trigram,a count of the first word in the trigrama count of the second word in the trigrama count of the words in the class of the first word of the trigram;
  
  a count of the words in the class of the second word of the trigram;
  
  a frequency of the first word relative to other words in the class of the first word; and
  
  a frequency of the second word relative to the other words in the class of the second word;
  
  applying the defined classing function to a corpus of words for generating a model; and
  
  employing the model for identifying, for a first and second word of a trigram received from the production application, a likelihood of each of a set of candidate third words for the trigram.
- View Dependent Claims (17)
- - 17. The method of claim 16 further comprising identifying seen and unseen clusters of trigrams, the unseen clusters having a previously unoccurring sequence of words, wherein the classing function identifies unseen clusters by:
    - employing the word based classification if the cluster has a previous occurrence, andemploying the class based classification if the cluster is previously unseen.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Chu, Stephen M., Vozila, Paul, Bisani, Maximilian, Su, Yi, Chen, Stanley F., Sarikaya, Ruhi, Ramabhadran, Bhuvana
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
Nguyen, Timothy

Application Number

US13/190,891
Time in Patent Office

1,785 Days
Field of Search

704/9
US Class Current

1/1
CPC Class Codes

G06F 16/3334   Selection or weighting of t...

G06F 16/3344   using natural language anal...

G06F 16/3346   using probabilistic model

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 40/117   Tagging; Marking up details...

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/30   Semantic analysis

G06Q 10/107   Computer-aided management o...

Word classing for language modeling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Word classing for language modeling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links