Hidden markov model for speech processing with training method

US 9,020,816 B2
Filed: 08/13/2009
Issued: 04/28/2015
Est. Priority Date: 08/14/2008
Status: Active Grant

First Claim

Patent Images

1. A computerized method of detecting non-language speech sounds in an audio signal, comprising:

realizing with a computer a hidden Markov model comprising a plurality of states,wherein at least one of the plurality of states is associated with a non-language speech sound;

isolating a segment of the audio signal;

extracting a first feature set consisting of mel-frequency cepstral coefficients (MFCCs), pitch confidence, cepstral stationarity, and cepstral variance from the segment;

using the first feature set to associate the segment with one or more of the plurality of states of the hidden Markov model; and

classifying the segment as a language speech sound or a non-language speech sound accordingly.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system and apparatus are shown for identifying non-language speech sounds in a speech or audio signal. An audio signal is segmented and feature vectors are extracted from the segments of the audio signal. The segment is classified using a hidden Markov model (HMM) that has been trained on sequences of these feature vectors. Post-processing components can be utilized to enhance classification. An embodiment is described in which the hidden Markov model is used to classify a segment as a language speech sound or one of a variety of non-language speech sounds. Another embodiment is described in which the hidden Markov model is trained using discriminative learning.

Citations

27 Claims

1. A computerized method of detecting non-language speech sounds in an audio signal, comprising:
- realizing with a computer a hidden Markov model comprising a plurality of states,wherein at least one of the plurality of states is associated with a non-language speech sound;
  
  isolating a segment of the audio signal;
  
  extracting a first feature set consisting of mel-frequency cepstral coefficients (MFCCs), pitch confidence, cepstral stationarity, and cepstral variance from the segment;
  
  using the first feature set to associate the segment with one or more of the plurality of states of the hidden Markov model; and
  
  classifying the segment as a language speech sound or a non-language speech sound accordingly.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, further comprising:
    - providing a second feature set comprising the first feature set and one or more other signs of non-language speech;
      
      determining a second feature from the segment, wherein the second feature belongs to the second feature set; and
      
      using the second feature to associate the segment with one or more of the plurality of states of the hidden Markov model.
  - 3. The method of claim 2, wherein the second feature belongs to the first feature set.
  - 4. The method of claim 1, wherein the segment is one of a plurality of segments isolated from the audio signal, and further comprising:
    - determining a plurality of first feature values, wherein each one of the plurality of first feature values is a determination of the first feature for one of the plurality of segments isolated from the audio signal; and
      
      using the plurality of first feature values to associate each one of the plurality of segments isolated from the audio signal with one or more of the plurality of states of the hidden Markov model.
  - 5. The method of claim 4, wherein the plurality of segments isolated from the audio signal and associated with one or more of the plurality of states of the hidden Markov model comprise at least two segments that are associated with different states of the hidden Markov model.
  - 6. The method of claim 4, further comprising:
    - providing a second feature set comprising the first feature set and one or more other signs of non-language speech;
      
      determining a plurality of second feature values, wherein each one of the plurality of second feature values is a determination of the second feature for one of the plurality of segments isolated from the audio signal; and
      
      using the plurality of second feature values to associate each one of the plurality of segments isolated from the audio signal with one or more of the plurality of states of the hidden Markov model.
  - 7. The method of claim 6, wherein each of the plurality of second feature values belongs to the first feature set.
  - 8. The method of claim 1, wherein the non-language speech sound associated with at least one of the plurality of states comprises at least one of:
    - silence, filled pause, cough, laugh, lipsmack, microphone, background speech, noise, and breath.
  - 9. The method of claim 1, wherein at least one of the plurality of states is associated with a non-language speech sound comprising at least one of silence, filled pause, cough, laugh, lipsmack, microphone, background speech, noise, and breath.
  - 10. The method of claim 1, wherein the at least one of the plurality of states is associated with a non-language speech sound comprising at least two of silence, filled pause, cough, laugh, lipsmack, microphone, background speech, noise, and breath.
  - 11. The method of claim 1, wherein the first feature is used to associate the segment with one of the plurality of states of the hidden Markov model.
  - 12. The method of claim 1, wherein a user-specifiable detection threshold is used to classify the segment as a language speech sound or a non-language speech.
  - 13. The method of claim 12 further comprising computing the probability that the segment belongs to each of the plurality of states of the hidden Markov model.
  - 14. The method of claim 1, further comprising training the hidden Markov model.
  - 15. The method of claim 14, further comprising adapting the trained hidden Markov model to the speech signal that is to be classified.
  - 16. The method of claim 14, wherein training the hidden Markov Model comprises re-estimating one or more parameters of the hidden Markov model based on one or more observation sequences and a label sequence for each of the one or more observation sequences.
  - 17. The method of claim 14, further comprising using discriminative learning to train the hidden Markov Model.
  - 18. The method of claim 14, further comprising using maximum mutual information optimization criteria to train the hidden Markov Model.
  - 19. The method of claim 14, further comprising evaluating all possible classification label sequences for a plurality of input observation sequences.
  - 20. The method of claim 19, further comprising formatting the plurality of input observation sequences into a plurality of shorter observation sequences.
  - 21. The method of claim 1, wherein the hidden Markov model is ergodic.
  - 22. The method of claim 1, further comprising outputting classification data.

23. A computerized method of classifying sounds in an audio signal into language speech sounds and non-language speech sounds, the method comprising:
- realizing in a computer a hidden Markov model comprising a plurality of hidden Markov states;
  
  providing a plurality of classification labels such that there is a one-to-many mapping between each of the plurality of classification labels and the plurality of hidden Markov states;
  
  training the hidden Markov model, comprising;
  
  providing a plurality of input observation sequences,wherein each of the plurality of input observation sequences comprises a plurality of input observations;
  
  providing correct classification labels for each input observation sequence, such that one correct label is assigned to each of the plurality of input observations;
  
  determining an observation sequence associated with a plurality of segments isolated from the audio signal, wherein the observation sequence comprises at least one observation for each one of the plurality of segments isolated from the audio signal; and
  
  associating the observation sequence with a sequence of hidden Markov states,whereby the one-to-many mapping determines a classification label for each one of the plurality of segments isolated from the audio signal,wherein the plurality of classification labels comprises a label for non-language speech sounds, andwherein the at least one observation consists of;
  
  mel-frequency cepstral coefficients (MFCCs), a pitch confidence measurement, a cepstral stationarity measurement, and a cepstral variance measurement.
- View Dependent Claims (24, 25)
- - 24. The method of claim 23 further comprising computing label dependent forward and backward probabilities.
  - 25. The method of claim 23 further comprising:
    - formatting the plurality of input observation sequences into shorter observation sequences; and
      
      evaluating all possible classification label sequences for the shorter observation sequences.

26. An apparatus for detecting non-language speech sounds in an audio signal, comprising:
- a programmed processor; and
  
  computer-readable media storing instructions that, when executed on the programmed processor,provide a hidden Markov model comprising a plurality of states,wherein at least one of the plurality of states is associated with a non-language speech sound;
  
  isolate a segment of an audio signal;
  
  extract a first feature set consisting of mel-frequency cepstral coefficients (MFCCs), pitch confidence, cepstral stationarity, and cepstral variance from the segment;
  
  use the first feature set to associate the segment with one or more of the plurality of states of the hidden Markov mode; and
  
  classify the segment as a language speech sound or a non-language speech sound accordingly.

27. A computerized speech recognition system for detecting non-language speech sounds comprising:
- a pre-processor adapted to isolate a plurality of segments from an audio signal;
  
  a signal processor,the signal processor adapted to extract from each of the plurality of segments isolated from the audio signal the following feature set;
  
  mel-frequency cepstral coefficients (MFCCs), a pitch confidence measurement, a cepstral stationarity measurement, and a cepstral variance measurement;
  
  a computerized hidden Markov model comprising a plurality of hidden Markov states and many-to-one mappings between the plurality of hidden Markov states and a plurality of classification labels,at least one of the plurality of classification labels comprising at least one non-language speech sound,whereby the computerized hidden Markov model is adapted to use the feature set to associate each of the plurality of segments with one or more of the plurality of hidden Markov states and to classify each of the plurality of segments as a language speech sound or a non-language speech sound; and
  
  a post-processor coupled to the computerized hidden Markov model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
21CT Incorporated
Original Assignee
21CT Incorporated
Inventors
McClain, Matthew
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sirjani, Fariba

Application Number

US13/059,048
Publication Number

US 20110208521A1
Time in Patent Office

2,084 Days
Field of Search
US Class Current

704/233
CPC Class Codes

G10L 15/142   Hidden Markov Models [HMMs]

G10L 17/26   Recognition of special voic...

G10L 25/24   the extracted parameters be...

Hidden markov model for speech processing with training method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Hidden markov model for speech processing with training method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links