Speech detection fusing multi-class acoustic-phonetic, and energy features

US 20070033042A1
Filed: 08/03/2005
Published: 02/08/2007
Est. Priority Date: 08/03/2005
Status: Abandoned Application

First Claim

Patent Images

1. A computer implemented method for speech detection, the computer implemented method comprising:

receiving media input;

generating a current frame of features from the media input;

classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;

re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and

re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech detection system extracts a plurality of features from multiple input streams. In the acoustic model space, the tree of Gaussians in the model is pruned to include the active states. The Gaussians are mapped to Hidden Markov Model states for Viterbi phoneme alignment. Another feature space, such as the energy feature space is combined with the acoustic feature space. In the feature space, the features are combined and principal component analysis decorrelates the features to fewer dimensions, thus reducing the number of features. The Gaussians are also mapped to silence, disfluent phoneme, or voiced phoneme classes. The silence class is true silence and the voiced phoneme class is speech. The disfluent class may be speech or non-speech. If a frame is classified as disfluent, then that frame is re-classified as the silence class or the voiced phoneme class based on adjacent frames.

Citations

20 Claims

1. A computer implemented method for speech detection, the computer implemented method comprising:
- receiving media input;
  
  generating a current frame of features from the media input;
  
  classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;
  
  re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
  
  re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer implemented method of claim 1, wherein generating a current frame of features from the media input comprises:
    - extracting a first frame of features from the media input in a first feature space;
      
      extracting a second frame of features from the media input in a second feature space;
      
      combining at least the first frame of features and the second frame of features to form a combined frame of features;
      
      reducing the number of features in the combined frame of features to form the current frame of features.
  - 3. The computer implemented method of claim 2, wherein the first feature space is an energy feature space.
  - 4. The computer implemented method of claim 2, wherein the second feature space is an acoustic feature space.
  - 5. The computer implemented method of claim 2, wherein reducing the number of features in the combined frame of features comprises performing linear discriminant analysis on the combined frame of features.
  - 6. The computer implemented method of claim 1, wherein classifying the current frame of features comprises classifying the current frame using a Gaussian Mixture Model.
  - 7. The computer implemented method of claim 1, wherein the media input is pulse coded modulated audio.

8. A computer program product comprising:
- a computer usable medium having computer usable program code for speech detection, the computer usable program code comprising;
  
  computer usable program code for receiving media input;
  
  computer usable program code for generating a current frame of features from the media input;
  
  computer usable program code for classifying the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class; and
  
  computer usable program code for re-classifying the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
  
  computer usable program code for re-classifying the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein the computer usable program code for generating a current frame of features from the media input comprises:
    - computer usable program code for extracting a first frame of features from the media input in a first feature space;
      
      computer usable program code for extracting a second frame of features from the media input in a second feature space;
      
      computer usable program code for combining at least the first frame of features and the second frame of features to form a combined frame of features;
      
      computer usable program code for reducing the number of features in the combined frame of features to form the current frame of features.
  - 10. The computer program product of claim 9, wherein the first feature space is an energy feature space.
  - 11. The computer program product of claim 9, wherein the second feature space is an acoustic feature space.
  - 12. The computer program product of claim 9, wherein the computer usable program code for reducing the number of features in the combined frame of features comprises computer usable program code for performing linear discriminant analysis on the combined frame of features.
  - 13. The computer program product of claim 8, wherein the computer usable program code for classifying the current frame of features comprises computer usable program code for classifying the current frame using a Gaussian Mixture Model.
  - 14. The computer program product of claim 8, wherein the media input is pulse coded modulated audio.

15. A data processing system for speech detection, the data processing system comprising:
- a memory having stored therein computer program code; and
  
  a processor coupled to the memory, wherein the processor operates under control of the computer program code to receive media input;
  
  generate a current frame of features from the media input;
  
  classify the current frame of features as a class selected from a set including at least a silence class, a disfluent class, and a speech class;
  
  re-classify the current frame as the speech class if the current frame is classified as the disfluent class and lies between a previous frame classified as the silence class and a next frame classified as the speech class; and
  
  re-classify the current frame as the silence class if the current frame is classified as the disfluent class and does not lie between a previous frame classified as the silence class and a next frame classified as the speech class.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The data processing system of claim 15, wherein the processor operates under control of the computer program code to generate a current frame of features from the media input by extracting a first frame of features from the media input in a first feature space, extracting a second frame of features from the media input in a second feature space, combining at least the first frame of features and the second frame of features to form a combined frame of features, and reducing the number of features in the combined frame of features to form the current frame of features.
  - 17. The data processing system of claim 16, wherein the first feature space is an energy feature space.
  - 18. The data processing system of claim 16, wherein the second feature space is an acoustic feature space.
  - 19. The data processing system of claim 16, wherein the processor operates under control of the computer program code to reduce the number of features in the combined frame of features by performing linear discriminant analysis on the combined frame of features.
  - 20. The data processing system of claim 15, wherein.the processor operates under control of the computer program code to classify the current frame of features using a Gaussian Mixture Model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Visweswariah, Karthik, Marcheret, Etienne

Application Number

US11/196,698
Publication Number

US 20070033042A1
Time in Patent Office

Days
Field of Search
US Class Current

704/255
CPC Class Codes

G10L 2015/025 Phonemes, fenemes or fenone...

G10L 25/78 Detection of presence or ab...

Speech detection fusing multi-class acoustic-phonetic, and energy features

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech detection fusing multi-class acoustic-phonetic, and energy features

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links