Speech recognition method and system using triphones, diphones, and phonemes
First Claim
Patent Images
1. A speech recognition method for recognizing a target vocabulary of words, phrases, or sentences, comprising the steps of:
- (a) selecting a training vocabulary;
(b) listing in a table (8) all triphones, diphones, and phonemes occurring in said training vocabulary;
(c) obtaining spoken samples of said training vocabulary;
(d) reducing said spoken samples to training data comprising sequences of labels;
(e) identifying, in said training data, segments corresponding to the triphones, diphones, and phonemes in said table (8);
(f) using the labels obtained in step (d) and segments identified in step (e) to construct a triphone HMM for each triphone in said table (8), and diphone HMM for each diphone in said table (8), and a phoneme HMM for each phoneme in said table (8);
(g) storing each triphone HMM, diphone HMM, and phoneme HMM constructed in step (f) in a first dictionary (9) consisting of the HMMs thus stored;
(h) creating HMMs for the target vocabulary by concatenating HMMs from said first dictionary (9), using triphones HMMs if available in said first dictionary (9), using diphone HMMs when triphone HMMs are not available, and using phoneme HMMs when neither triphone nor diphone HMMs are available.(i) storing the HMMs created in step (h) in a second dictionary (10); and
(j) recognizing an utterance by reducing the utterance to a sequence of labels, computing probabilities of producing said sequence of labels from each HMM in said second dictionary (10), and selecting an HMM giving maximum probability.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition system starts by training hidden Markov models for all triphones, diphones, and phonemes occurring in a small training vocabulary. Hidden Markov models of a target vocabulary are created by concatenating the triphone, diphone, and phoneme models, using triphone models if available, diphone HMMs when triphone models are not available, and phoneme models when neither triphone nor diphone models are available. Utterances from the target vocabulary are recognized by choosing a model with maximum probability of reproducing quantized utterance features.
294 Citations
22 Claims
-
1. A speech recognition method for recognizing a target vocabulary of words, phrases, or sentences, comprising the steps of:
-
(a) selecting a training vocabulary; (b) listing in a table (8) all triphones, diphones, and phonemes occurring in said training vocabulary; (c) obtaining spoken samples of said training vocabulary; (d) reducing said spoken samples to training data comprising sequences of labels; (e) identifying, in said training data, segments corresponding to the triphones, diphones, and phonemes in said table (8); (f) using the labels obtained in step (d) and segments identified in step (e) to construct a triphone HMM for each triphone in said table (8), and diphone HMM for each diphone in said table (8), and a phoneme HMM for each phoneme in said table (8); (g) storing each triphone HMM, diphone HMM, and phoneme HMM constructed in step (f) in a first dictionary (9) consisting of the HMMs thus stored; (h) creating HMMs for the target vocabulary by concatenating HMMs from said first dictionary (9), using triphones HMMs if available in said first dictionary (9), using diphone HMMs when triphone HMMs are not available, and using phoneme HMMs when neither triphone nor diphone HMMs are available. (i) storing the HMMs created in step (h) in a second dictionary (10); and (j) recognizing an utterance by reducing the utterance to a sequence of labels, computing probabilities of producing said sequence of labels from each HMM in said second dictionary (10), and selecting an HMM giving maximum probability. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A speech recognition system for recognizing words, phrases, or sentences in a target vocabulary, comprising:
-
a speech analyzer (1) for analyzing spoken utterances and producing feature vectors; a vector quantizer (2) for mapping said feature vectors onto a discrete set of labels; a text processor (3) for receiving training sequences of phoneme symbols, creating a table (8) of triphones, diphones, and phonemes occurring in said training sequences, receiving target sequences of phoneme symbols occurring in said target vocabulary, and dividing said target sequences into triphones, diphones, and phonemes occurring in said table (8), selecting triphones in preference to diphones, triphones in preference to phonemes, and diphones in preference to phonemes; an HMM trainer (4) for using labels output by said vector quantizer (2) to construct a first dictionary (9) comprising HMMs of the triphones, diphones, and phonemes in said table (8), and concatenating HMMs selected from said first dictionary (9) to construct a second dictionary (10) of HMMs of items in the target vocabulary; an HMM recognizer (5) for calculating probabilities that HMMs in said second dictionary (10) would produce a sequence of labels output by said vector quantizer (2), and selecting an HMM giving a maximum probability; a memory (6) for storing said table (8), said first dictionary (9) and said second dictionary (10); and a central control unit (7) coupled to control said speech analyzer (1), said vector quantizer (2), said text processor (3), said HMM trainer (4), said HMM system, and said memory (6). - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
Specification