Speech recognition method and system using triphones, diphones, and phonemes

US 5,502,790 A
Filed: 12/21/1992
Issued: 03/26/1996
Est. Priority Date: 12/24/1991
Status: Expired due to Term

First Claim

Patent Images

1. A speech recognition method for recognizing a target vocabulary of words, phrases, or sentences, comprising the steps of:

(a) selecting a training vocabulary;

(b) listing in a table (8) all triphones, diphones, and phonemes occurring in said training vocabulary;

(c) obtaining spoken samples of said training vocabulary;

(d) reducing said spoken samples to training data comprising sequences of labels;

(e) identifying, in said training data, segments corresponding to the triphones, diphones, and phonemes in said table (8);

(f) using the labels obtained in step (d) and segments identified in step (e) to construct a triphone HMM for each triphone in said table (8), and diphone HMM for each diphone in said table (8), and a phoneme HMM for each phoneme in said table (8);

(g) storing each triphone HMM, diphone HMM, and phoneme HMM constructed in step (f) in a first dictionary (9) consisting of the HMMs thus stored;

(h) creating HMMs for the target vocabulary by concatenating HMMs from said first dictionary (9), using triphones HMMs if available in said first dictionary (9), using diphone HMMs when triphone HMMs are not available, and using phoneme HMMs when neither triphone nor diphone HMMs are available.(i) storing the HMMs created in step (h) in a second dictionary (10); and

(j) recognizing an utterance by reducing the utterance to a sequence of labels, computing probabilities of producing said sequence of labels from each HMM in said second dictionary (10), and selecting an HMM giving maximum probability.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system starts by training hidden Markov models for all triphones, diphones, and phonemes occurring in a small training vocabulary. Hidden Markov models of a target vocabulary are created by concatenating the triphone, diphone, and phoneme models, using triphone models if available, diphone HMMs when triphone models are not available, and phoneme models when neither triphone nor diphone models are available. Utterances from the target vocabulary are recognized by choosing a model with maximum probability of reproducing quantized utterance features.

294 Citations

22 Claims

1. A speech recognition method for recognizing a target vocabulary of words, phrases, or sentences, comprising the steps of:
- (a) selecting a training vocabulary;
  
  (b) listing in a table (8) all triphones, diphones, and phonemes occurring in said training vocabulary;
  
  (c) obtaining spoken samples of said training vocabulary;
  
  (d) reducing said spoken samples to training data comprising sequences of labels;
  
  (e) identifying, in said training data, segments corresponding to the triphones, diphones, and phonemes in said table (8);
  
  (f) using the labels obtained in step (d) and segments identified in step (e) to construct a triphone HMM for each triphone in said table (8), and diphone HMM for each diphone in said table (8), and a phoneme HMM for each phoneme in said table (8);
  
  (g) storing each triphone HMM, diphone HMM, and phoneme HMM constructed in step (f) in a first dictionary (9) consisting of the HMMs thus stored;
  
  (h) creating HMMs for the target vocabulary by concatenating HMMs from said first dictionary (9), using triphones HMMs if available in said first dictionary (9), using diphone HMMs when triphone HMMs are not available, and using phoneme HMMs when neither triphone nor diphone HMMs are available.(i) storing the HMMs created in step (h) in a second dictionary (10); and
  
  (j) recognizing an utterance by reducing the utterance to a sequence of labels, computing probabilities of producing said sequence of labels from each HMM in said second dictionary (10), and selecting an HMM giving maximum probability.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the training vocabulary selected in step (a) includes examples of all phonemes occurring in the target vocabulary.
  - 3. The method of claim 1, wherein the step (d) of reducing the spoken samples comprises further steps of:
    - calculating feature vectors of each spoken sample; and
      
      mapping the calculated feature vectors onto a discrete set of labels.
  - 4. The method of claim 3, wherein the feature vectors comprise linear prediction coefficients.
  - 5. The method of claim 1, wherein the step (f) of constructing triphone, diphone, and phoneme HMMs is carried out using a forward-backward algorithm.
  - 6. The method of claim 1, wherein the step (h) of constructing HMMs for the target vocabulary comprises further steps of:
    - (h1) representing an item, in the target vocabulary as a sequence of phonemes, starting from a first phoneme;
      
      (h2) initializing a target HMM to a state representing no utterance;
      
      (h3) setting a pointer position at said first phoneme;
      
      (h4) searching said first dictionary (9) for a triphone HMM corresponding to three phonemes starting at said pointer position;
      
      (h5) if a triphone HMM is found in step (h4), concatenating that triphone HMM to said target HMM and advancing said pointer position by three phoneme positions in said sequence of phonemes;
      
      (h6) if no triphone HMM is found in step (h4), searching said first dictionary (9) for a diphone HMM corresponding to two phonemes starting at said pointer position;
      
      (h7) if a diphone HMM is found in step (h6), concatenating that diphone HMM to said target HMM and advancing said pointer position by two phoneme positions in said sequence of phonemes;
      
      (h8) if no diphone HMM is found in step (h6), searching said first dictionary (9) for a phoneme HMM corresponding to one phoneme at said pointer position;
      
      (h9) if a phoneme HMM is found in step (h8), concatenating that phoneme HMM to said target HMM and advancing said pointer position by one phoneme position in said sequence of phonemes;
      
      (h10) if no phoneme HMM is found in step (h6), issuing an error message; and
      
      (h11) repeating steps (h4) to (h10) until said sequence of phonemes is exhausted.
  - 7. The method of claim 1, wherein said triphone HMMs, said diphone HMMs, and said phoneme HMMs are left-to-right HMMs.
  - 8. The method of claim 1, wherein said triphone HMMs have more states than said diphone HMMs, and said diphone HMMs have more states than said phoneme HMMs.
  - 9. The method of claim 8, wherein said triphone HMMs have three states, said diphone HMMs have two states, and said phoneme HMMs have one state.
  - 10. The method of claim 1, wherein said triphone HMMs, said diphone HMMs, and said phoneme HMMs are concatenated without overlap in step (h).
  - 11. The method of claim 1, wherein pairs of said triphone HMMs are concatenated with overlap, if possible, in step (h).
  - 12. The method of claim 11 wherein, when a first triphone HMM ending in a last state is concatenated with overlap with a second triphone HMM beginning with a first state, said last state and said first state are combined to form a new middle state.
  - 13. The method of claim 12, wherein transition probabilities of said middle state are computed by averaging transition probabilities of said first state and said last state.
  - 14. The method of claim 12, wherein output probabilities of said middle state are computed by averaging output probabilities of said first state and said last state.
  - 15. The method of claim 1, comprising the additional step of selecting three positive integers n₁, n₂, and n₃, wherein:
    - each phoneme HMM constructed in said step (f) has n₁ states;
      
      each diphone HMM constructed in said step (f) has n₂ states; and
      
      each triphone HMM constructed in said step (f) has n₃ states.

16. A speech recognition system for recognizing words, phrases, or sentences in a target vocabulary, comprising:
- a speech analyzer (1) for analyzing spoken utterances and producing feature vectors;
  
  a vector quantizer (2) for mapping said feature vectors onto a discrete set of labels;
  
  a text processor (3) for receiving training sequences of phoneme symbols, creating a table (8) of triphones, diphones, and phonemes occurring in said training sequences, receiving target sequences of phoneme symbols occurring in said target vocabulary, and dividing said target sequences into triphones, diphones, and phonemes occurring in said table (8), selecting triphones in preference to diphones, triphones in preference to phonemes, and diphones in preference to phonemes;
  
  an HMM trainer (4) for using labels output by said vector quantizer (2) to construct a first dictionary (9) comprising HMMs of the triphones, diphones, and phonemes in said table (8), and concatenating HMMs selected from said first dictionary (9) to construct a second dictionary (10) of HMMs of items in the target vocabulary;
  
  an HMM recognizer (5) for calculating probabilities that HMMs in said second dictionary (10) would produce a sequence of labels output by said vector quantizer (2), and selecting an HMM giving a maximum probability;
  
  a memory (6) for storing said table (8), said first dictionary (9) and said second dictionary (10); and
  
  a central control unit (7) coupled to control said speech analyzer (1), said vector quantizer (2), said text processor (3), said HMM trainer (4), said HMM system, and said memory (6).
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The system of claim 16 wherein, as far as possible, said text processor (3) divides said target sequences into overlapping triphones.
  - 18. The system of claim 17 wherein, at least in cases of overlapping triphones, said HMM trainer (4) concatenates HMMs by combining a last state of one HMM with a first state of another HMM to create a new middle state.
  - 19. The system of claim 18, wherein said trainer HMM computes transition probabilities of said new middle state by averaging transition probabilities of said last state and said first state.
  - 20. The system of claim 18, wherein said trainer HMM computes output probabilities of said new middle state by averaging output probabilities of said last state and said first state.
  - 21. The system of claim 18 wherein, among the HMMs stored in said first dictionary, HMMs of triphones have more states than HMMs of diphones, and HMMs of diphones are longer than HMMs of phonemes.
  - 22. The system of claim 18, wherein lengths of the HMMs stored in said first dictionary (9) are selectable by a user of the system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
OKI Electric Industry Company Limited
Original Assignee
OKI Electric Industry Company Limited
Inventors
Yi, Jie
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
ONKA, THOMAS

Application Number

US07/993,395
Time in Patent Office

1,191 Days
Field of Search

395/2.65, 395/2.63, 395/2.64, 395/2.52, 381/41, 381/43, 381/36
US Class Current

704/256
CPC Class Codes

G10L 15/144 Training of HMMs

G10L 2015/022 Demisyllables, biphones or ...

Speech recognition method and system using triphones, diphones, and phonemes

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

294 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition method and system using triphones, diphones, and phonemes

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

294 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links