Method and apparatus for a parameter sharing speech recognition system

US 6,006,186 A
Filed: 10/16/1997
Issued: 12/21/1999
Est. Priority Date: 10/16/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A method for recognizing speech comprising the steps of:

receiving speech signals into a processor;

processing the received speech signals using a speech recognition system produced by generating a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and

generating signals representative of the received speech signals.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and an apparatus for a parameter sharing speech recognition system are provided. Speech signals are received into a processor of a speech recognition system. The speech signals are processed using a speech recognition system hosting a shared hidden Markov model (HMM) produced by generating a number of phoneme models, some of which are shared. The phoneme models are generated by retaining as a separate phoneme model any triphone model having a number of trained frames available that exceeds a prespecified threshold. A shared phoneme model is generated to represent each of the groups of triphone phoneme models for which the number of trained frames having a common biphone exceed the prespecified threshold. A shared phoneme model is generated to represent each of the groups of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceed the prespecified threshold. A shared phoneme model is generated to represent each of the groups of triphone phoneme models having the same center context. The generated phoneme models are trained, and shared phoneme model states are generated that are shared among the phoneme models. Shared probability distribution functions are generated that are shared among the phoneme model states. Shared probability sub-distribution functions are generated that are shared among the phoneme model probability distribution functions. The shared phoneme model hierarchy is reevaluated for further sharing in response to the shared probability sub-distribution functions. Signals representative of the received speech signals are generated.

60 Citations

View as Search Results

45 Claims

1. A method for recognizing speech comprising the steps of:
- receiving speech signals into a processor;
  
  processing the received speech signals using a speech recognition system produced by generating a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  generating signals representative of the received speech signals.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the speech recognition system is produced by:
    - training the plurality of phoneme models;
      
      generating a plurality of shared probability sub-distribution functions from the trained plurality of phoneme models; and
      
      evaluating the plurality of phoneme models for further sharing in response to the plurality of shared probability sub-distribution functions.
  - 3. The method of claim 2, wherein the plurality of shared probability sub-distribution functions are generated by:
    - generating a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      generating a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states; and
      
      generating a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions.
  - 4. The method of claim 3, wherein the plurality of shared probability sub-distribution functions are generated by:
    - generating a plurality of phoneme model states, wherein at least one of the plurality of phoneme model states are shared among the plurality of phoneme model states;
      
      generating a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of probability distribution functions; and
      
      generating a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability sub-distribution functions.
  - 5. The method of claim 2, wherein the plurality of phoneme models for further sharing are evaluated by:
    - generating a shared phoneme model probability distribution function to replace a plurality of probability distribution functions when each of the plurality of probability distribution functions has common probability sub-distribution functions;
      
      generating a shared phoneme model state to replace a plurality of states when each of the plurality of states has common phoneme model probability distribution functions; and
      
      generating a shared phoneme model to replace a plurality of models when each of the plurality of models has common phoneme model states.
  - 6. The method of claim 2, wherein the plurality of shared probability distribution functions for a discrete hidden Markov model are generated from a continuous distribution function of a continuous hidden Markov model.
  - 7. The method of claim 1, wherein the phoneme models are context dependent.
  - 8. The method of claim 1, wherein the speech recognition system is based on a statistical learning approach.
  - 9. The method of claim 8, wherein the statistical learning approach is a hidden Markov model.
  - 10. The method of claim 1, wherein a plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.
  - 11. The method of claim 10, wherein at least one shared phoneme model is generated comprising at least one context, the at least one context having statistical properties representative of a plurality of context phonemes.
  - 12. The method of claim 1, wherein sharing occurs among a plurality of levels of a speech recognition model.
  - 13. The method of claim 1, wherein sharing occurs within at least one level of a speech recognition model.
  - 14. The method of claim 1, wherein the step of evaluating the plurality of phoneme models for further sharing is repeated at least one time.
  - 15. The method of claim 1, wherein the plurality of phoneme models integrate discrete observation modeling and continuous observation modeling.

16. An apparatus for speech recognition comprising:
- an input for receiving speech signals into a processor;
  
  a processor configured to recognize the received speech signals using a speech recognition system to generate a signal representative of the received speech signal, the speech recognition system produced by generating and training a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  an output for providing a signal representative of the received speech signal.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The apparatus of claim 16, wherein the speech recognition system is produced by:
    - generating a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      generating a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states;
      
      generating a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions; and
      
      evaluating the plurality of phoneme models for further sharing in response to the plurality of shared probability sub-distribution functions.
  - 18. The apparatus of claim 17, wherein the plurality of phoneme models are evaluated for further sharing by:
    - generating a shared phoneme model probability distribution function to replace a plurality of probability distribution functions when each of the plurality of probability distribution functions has common probability sub-distribution functions;
      
      generating a shared phoneme model state to replace a plurality of states when each of the plurality of states has common phoneme model probability distribution functions; and
      
      generating a shared phoneme model to replace a plurality of models when each of the plurality of models has common phoneme model states.
  - 19. The apparatus of claim 16, wherein sharing occurs among a plurality of levels of a speech recognition model, and wherein sharing occurs within at least one level of a speech recognition model.
  - 20. The apparatus of claim 16, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.

21. A speech recognition process comprising a statistical learning technique that uses a model, the model produced by:
- generating and training a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes;
  
  generating a plurality of shared probability sub-distribution functions from the trained plurality of phoneme models; and
  
  evaluating the plurality of phoneme models for further sharing in response to the plurality of shared probability sub-distribution functions.
- View Dependent Claims (22, 23, 24, 25, 26)
- - 22. The speech recognition process of claim 21, wherein sharing occurs among a plurality of levels of a speech recognition model, and wherein sharing occurs within at least one level of a speech recognition model.
  - 23. The speech recognition process of claim 21, wherein the plurality of phoneme models are context dependent hidden Markov models, wherein the plurality of phoneme models integrate discrete observation modeling and continuous observation modeling.
  - 24. The speech recognition process of claim 21, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.
  - 25. The speech recognition process of claim 21, wherein the plurality of shared probability sub-distribution functions are generated by:
    - generating a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      generating a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states; and
      
      generating a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions.
  - 26. The speech recognition process of claim 21, wherein the plurality of phoneme models are evaluated for further sharing by:
    - generating a shared phoneme model probability distribution function to replace a plurality of probability distribution functions when each of the plurality of probability distribution functions has common probability sub-distribution functions;
      
      generating a shared phoneme model state to replace a plurality of states when each of the plurality of states has common phoneme model probability distribution functions; and
      
      generating a shared phoneme model to replace a plurality of models when each of the plurality of models has common phoneme model states.

27. A method for generating a plurality of phoneme models for use in a speech recognition system, the method comprising the steps of:
- retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
  
  generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
  
  generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
  
  generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.
- View Dependent Claims (28)
- - 28. The method of claim 27, wherein the phoneme models are hidden Markov models.

29. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform the steps for recognizing speech comprising:
- receiving speech signals into a processor;
  
  processing the received speech signals using a speech recognition system comprising a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  generating signals representative of the received speech signals.
- View Dependent Claims (30, 31, 32)
- - 30. The computer readable medium of claim 29, wherein the speech recognition system is produced by:
    - generating a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      generating a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states;
      
      generating a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions; and
      
      evaluating the plurality of phoneme models for further sharing in response to the plurality of shared probability sub-distribution functions.
  - 31. The computer readable medium of claim 29, wherein sharing occurs among a plurality of levels of a speech recognition model, and wherein sharing occurs within at least one level of a speech recognition model.
  - 32. The computer readable medium of claim 29, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.

33. A method for recognizing speech comprising the steps of:
- receiving speech signals into a processor;
  
  processing the received speech signals using a model comprising a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  generating signals representative of the received speech signals.
- View Dependent Claims (34, 35, 36)
- - 34. The method of claim 33, wherein the model further comprises:
    - a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states; and
      
      a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions.
  - 35. The method of claim 33, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.
  - 36. The method of claim 33, wherein a plurality of shared phoneme models are evaluated for further sharing by:
    - generating a shared phoneme model probability distribution function to replace a plurality of probability distribution functions when each of the plurality of probability distribution functions has common probability sub-distribution functions;
      
      generating a shared phoneme model state to replace a plurality of states when each of the plurality of states has common phoneme model probability distribution functions; and
      
      generating a shared phoneme model to replace a plurality of models when each of the plurality of models has common phoneme model states.

37. An apparatus for speech recognition comprising:
- an input configured to receive speech signals into a processor;
  
  a processor configured to process the received speech signals using a model comprising a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  an output configured to provide a signal representative of the received speech signal.
- View Dependent Claims (38, 39)
- - 38. The apparatus of claim 37, wherein the model further comprises:
    - a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states; and
      
      a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions.
  - 39. The apparatus of claim 37, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.

40. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform the steps for recognizing speech comprising:
- receiving speech signals into a processor;
  
  processing the received speech signals using a model comprising a plurality of context dependent phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  providing output signals representative of the received speech signals.
- View Dependent Claims (41, 42, 43)
- - 41. The computer readable medium of claim 40, wherein the model further comprises:
    - a plurality of phoneme model states, wherein at least one of the plurality of states are shared among the plurality of phoneme models;
      
      a plurality of phoneme model probability distribution functions, wherein at least one of the plurality of probability distribution functions are shared among the plurality of states; and
      
      a plurality of phoneme model probability sub-distribution functions, wherein at least one of the plurality of probability sub-distribution functions are shared among the plurality of phoneme model probability distribution functions.
  - 42. The computer readable medium of claim 40, wherein the plurality of phoneme models are generated by:
    - retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
      
      generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.
  - 43. The computer readable medium of claim 40, wherein a plurality of shared phoneme models are evaluated for further sharing by:
    - generating a shared phoneme model probability distribution function to replace a plurality of probability distribution functions when each of the plurality of probability distribution functions has common probability sub-distribution functions;
      
      generating a shared phoneme model state to replace a plurality of states when each of the plurality of states has common phoneme model probability distribution functions; and
      
      generating a shared phoneme model to replace a plurality of models when each of the plurality of models has common phoneme model states.

44. A system for recognizing speech comprising:
- means for receiving speech signals into a processor;
  
  means for processing the received speech signals using a speech recognition system produced by generating a plurality of phoneme models, wherein at least one of the plurality of phoneme models are shared among a plurality of phonemes, and at least a first one of the plurality of phoneme models are shared with at least a second one of the plurality of phoneme models; and
  
  means for generating signals representative of the received speech signals.

45. A system for generating a plurality of phoneme models for use in a speech recognition system, comprising:
- means for retaining as a separate phoneme model a triphone phoneme model for which a number of trained frames exceeds a threshold;
  
  means for generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having a common biphone exceeds the threshold;
  
  means for generating at least one shared phoneme model to represent a plurality of triphone phoneme models for which the number of trained frames having an equivalent effect on a phonemic context exceeds the threshold; and
  
  means for generating at least one shared phoneme model to represent a plurality of triphone phoneme models having the same center context.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Inventors
Olorenshaw, Lex S., Wu, Duanpei, Chen, Ruxin, Tanaka, Miyuki
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/953,026
Time in Patent Office

796 Days
Field of Search

704/256, 704/254, 704/255, 704/239, 704/240, 704/200, 704/249, 704/250
US Class Current

704/254
CPC Class Codes

G10L 15/142 Hidden Markov Models [HMMs]

G10L 15/148 Duration modelling in HMMs,...

Method and apparatus for a parameter sharing speech recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

60 Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for a parameter sharing speech recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links