Method and apparatus for detecting speech activity using cepstrum vectors

US 5,596,680 A
Filed: 12/31/1992
Issued: 01/21/1997
Est. Priority Date: 12/31/1992
Status: Expired due to Term

First Claim

Patent Images

1. A method for detecting an endpoint of speech in an input signal, wherein the input signal is sampled, said method comprising the steps of:

generating cepstrum vectors representing each spectrum of individual samples of the input signal;

generating a cepstrum vector for a steady state portion of the input signal; and

comparing the cepstrum vectors of individual samples with the cepstrum vector for the steady state portion of the input signal to identify the endpoint of speech as that portion of the input signal having a spectrum that converges to the steady state portion of the input signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for detecting speech activity in an input signal. The present invention includes performing begin point detection using power/zero crossing. Once the begin point has been detected, the present invention uses the cepstrum of the input signal to determine the endpoint of the sound in the signal. After both the beginning and ending of the sound are detected, the present invention uses vector quantization distortion to classify the sound as speech or noise.

Citations

31 Claims

1. A method for detecting an endpoint of speech in an input signal, wherein the input signal is sampled, said method comprising the steps of:
- generating cepstrum vectors representing each spectrum of individual samples of the input signal;
  
  generating a cepstrum vector for a steady state portion of the input signal; and
  
  comparing the cepstrum vectors of individual samples with the cepstrum vector for the steady state portion of the input signal to identify the endpoint of speech as that portion of the input signal having a spectrum that converges to the steady state portion of the input signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method as defined in claim 1 wherein the endpoint of speech is located where the spectrum of said portion of the input signal begins to converge to the steady state portion of the input signal.
  - 3. The method as defined in claim 1 further comprising the steps of:
    - generating a measure of speech to silence for a current frame corresponding to a current cepstrum based on the current cepstrum and a cepstrum indicative of a steady state portion of the input sound; and
      
      determining if the measure exceeds a predetermined speech threshold for a predetermined number of frames, such that the beginning point of speech is detected when the measure exceeds the predetermined speech threshold for a first predetermined number of frames.
  - 4. The method as defined in claim 3 wherein the step of generating the measure comprises the steps of:
    - generating a plurality of speech to silence measures, wherein one of the plurality of speech to silence measures corresponds to each cepstrum; and
      
      averaging the speech to silence measure for the current frame with speech to silence measures of a predetermined number of previous frames to produce an average measure; and
      
      detecting when the average measure exceeds the predetermined speech threshold for a predetermined number of frames to identify speech.
  - 5. The method as defined in claim 3 further comprising the step of detecting the end of speech when the measure remains below a silence threshold for a second predetermined number of frames.
  - 6. The method as defined in claim 3 wherein the first predetermined number of frames comprises a plurality of consecutive frames.
  - 7. The method as defined in claim 3 wherein the step of generating the measure includes the steps of:
    - computing an average cepstrum vector representing steady state background noise of the speech activity; and
      
      computing a distance from the cepstrum to the average cepstrum vector as the measure of speech to silence for the current frame.
  - 8. The method as defined in claim 3 wherein the step of generating the average measure comprises averaging the current cepstrum with a number of cepstrums corresponding to a predetermined number of frames prior to the current frame.

9. A method for detecting speech activity in an input signal comprising the steps of:
- detecting a beginning point of speech in the input signal;
  
  detecting an ending point of speech in the input signal, wherein the step of detecting an ending point of speech comprises the steps ofcomputing an average cepstrum vector for each frame to represent a steady state portion of the input signal,comparing cepstrum vectors for individual speech samples with the average cepstrum vector, including the step of determining distance of a current cepstrum vector for an individual speech sample from the average cepstrum vector to determine a variance, and identify the ending point of speech when the variance is at least at a predetermined variance indicative of whether the ending point of speech has been detected.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method as defined in claim 9 wherein the step of detecting the beginning point of speech comprises the steps of:
    - measuring energy contained in the input signal to determine the presence of voiced sound, wherein voiced sound occurs when the energy of the input signal is above a predetermined threshold; and
      
      measuring zero crossings, such that the beginning point of speech is located in the input signal where a total number of zero crossings is greater than a predetermined number of zero crossings.
  - 11. The method as defined in claim 9 further comprising the step of performing vector quantization to classify the input signal, such that the input signal is discriminated between speech and noise.
  - 12. The method as defined in claim 11 wherein said step of performing vector quantization includes the step of determining distortion between each input cepstrum vector and a plurality of representative cepstral vectors for each sound type being classified.
  - 13. The method as defined in claim 12 wherein each plurality of representative cepstral vectors for each sound type to be classified comprises a codebook.

14. A method for detecting speech activity in an input signal having a beginning point and an ending point, said method comprising the steps of:
- detecting the beginning point of speech in the input signal;
  
  detecting the ending point of speech in the input signal using cepstrum vectors, wherein the step of detecting the ending point of speech comprises the step ofcomparing the cepstrum vectors of individual speech samples of the input signal with a cepstrum vector for a steady state portion of the input signal to identify the ending point of speech;
  
  classifying the sound as speech or noise, such that speech recognition occurs on the input signal when the sound is classified as speech and speech recognition does not occur on the input signal when the sound is classified as noise.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method as defined in claim 14 wherein the step of classifying comprises the steps of:
    - computing a first distortion between a current cepstral vector and a codebook for speech;
      
      computing a second distortion between the current cepstral vector and a codebook for noise;
      
      comparing a first ratio of the first distortion and second distortion to a first threshold, such that sound of the input signal is classified as speech if the first ratio is less than the first threshold at least a first predetermined number of times for a first predetermined number of windows; and
      
      comparing a second ratio of the second distortion and the first distortion to a second threshold, such that sound of the input signal is classified as noise if the second ratio is less than the second threshold at least a second predetermined number of times for a second predetermined number of windows.
  - 16. The method as defined in claim 15 further comprising the step of classifying sound in the input signal as neither speech or noise if the first ratio is greater than the first threshold and the second ratio is greater than the second threshold.
  - 17. The method as defined in claim 15 wherein the step computing the first distortion and the step of computing the second distortion each comprises determining average distortion over a predetermined number of frames of the input signal.
  - 18. The method as defined in claim 15 wherein the first threshold is an inverse proportion of the second threshold.

19. A method for detecting speech activity in an input signal comprising the steps of:
- detecting the power and zero crossings of the input signal to determine a beginning point of sound in the input signal;
  
  detecting an end point of sound in the input signal, wherein the step of detecting an end point of sound comprises the steps ofgenerating cepstrum vectors representing each spectrum of individual samples of the input signal,generating a cepstrum vector for a steady state portion of the input signal, andcomparing the cepstrum vectors of individual speech samples for each frame with the cepstrum vector representing a steady state portion of the input signal and identifying the end point of sound as the point of the input signal where the current cepstrum vector converges to the cepstrum vector representing the steady state; and
  
  comparing the current cepstral vector with a speech codebook and a noise codebook, such that the sound is classified as speech or noise according to the distortion between current cepstral vector and a speech codebook and a noise codebook.

20. A system for recognizing speech from an input signal comprising:
- speech activity detection means for detecting speech in the input signal, wherein said speech activity detection means comprisesmeans for detecting power and zero crossings of the input signal to determine a beginning point of sound in the input signal;
  
  means for generating cepstral vectors representing each spectrum of individual samples of the input signal;
  
  means for generating a cepstral vector for a steady state portion of the input signal;
  
  means for comparing cepstral vectors of individual samples with the cepstral vector for the steady state portion of the input signal to identify the endpoint of speech as that portion of the input signal having a spectrum that converges to the steady state portion of the input signal; and
  
  means for comparing a current cepstral vector with a speech codebook and a noise codebook, such that sound in the input signal is classified as speech or noise according to a distortion between the current cepstral vector and a speech codebook and a noise codebook, wherein if the sound is classified as speech then the current cepstral vector is output as an output speech signal; and
  
  a recognition engine for receiving the output speech signal and recognizing the speech, such that at least one recognized word is generated.

21. A method of detecting speech activity in a data input stream comprising the steps of:
- (a) generating a set of spectral representation vectors to represent the data input stream, wherein each spectral representation vector of the set of spectral representation vectors represents a predetermined portion of the data input stream;
  
  (b) generating a steady state spectral representation vector indicative of the state of the data input stream at a first predetermined portion of the data input stream;
  
  (c) comparing a spectral representation vector corresponding to the first predetermined portion of the data input stream to the steady state spectral representation vector; and
  
  (d) determining a first end point of speech activity when the set of spectral representation vectors converges toward the steady state spectral representation vector.
- View Dependent Claims (22, 23, 24, 25, 26)
- - 22. The method of claim 21, further comprising the step of:
    - (e) determining a second end point of speech activity when the set of spectral representation vectors diverges from the steady state spectral representation vector.
  - 23. The method of claim 22, wherein the step (e) comprises determining the second end point of speech activity when a predetermined number of spectral representation vectors of the set of spectral representation vectors are within a predetermined distance of the steady state spectral representation vector for a continuous predetermined period of time.
  - 24. The method of claim 22, further comprising the step of:
    - (f) determining whether the speech activity more closely resembles a speech codebook or a noise codebook.
  - 25. The method of claim 24, wherein the step (f) comprises:
    - calculating a first distortion for each of a plurality of spectral representation vectors of the set of spectral representation vectors between each of the plurality of spectral representation vectors and the speech codebook;
      
      calculating a second distortion for each of a plurality of spectral representation vectors of the set of spectral representation vectors between each of the plurality of spectral representation vectors and the noise codebook; and
      
      classifying the speech activity as speech, provided the first distortion is greater than a speech threshold for a first predetermined period of time, otherwise classifying the speech activity as noise, provided the second distortion is greater than a noise threshold for the first predetermined period of time.
  - 26. The method of claim 21, wherein the step (d) comprises determining the first end point of speech activity when a predetermined number of spectral representation vectors of the set of spectral representation vectors are a predetermined distance away from the steady state spectral representation vector for a continuous predetermined period of time.

27. An apparatus for detecting speech activity in a data input stream comprising:
- a memory unit;
  
  an input device for receiving the data input stream;
  
  a processor coupled to the memory unit and the input device, wherein the processor generates a set of spectral representation vectors to represent the data input stream and stores the set of spectral representation vectors in the memory unit, wherein each spectral representation vector of the set of spectral representation vectors represents a predetermined portion of the data input stream, wherein the processor also generates a steady state spectral representation vector indicative of the state of the data input stream at a first predetermined portion of the data input stream and compares a spectral representation vector corresponding to the first predetermined portion of the data input stream to the steady state spectral representation vector, and determines a first end point of speech activity when the set of spectral representation vectors converges toward the steady state spectral representation vector.
- View Dependent Claims (28, 29, 30, 31)
- - 28. The apparatus of claim 27, wherein the processor determines a second end point of speech activity when the set of spectral representation vectors diverges from the steady state spectral representation vector.
  - 29. The apparatus of claim 28, wherein the processor determines the second end point of speech activity when a predetermined number of spectral representation vectors of the set of spectral representation vectors are within a predetermined distance of the steady state spectral representation vector for a continuous predetermined period of time.
  - 30. The apparatus of claim 28, wherein the processor also calculates a first distortion for each of a plurality of spectral representation vectors of the set of spectral representation vectors between each of the plurality of spectral representation vectors and a speech codebook, calculates a second distortion for each of a plurality of spectral representation vectors of the set of spectral representation vectors between each of the plurality of spectral representation vectors and the noise codebook, classifies the speech activity as speech, provided the first distortion is greater than a speech threshold for a first predetermined period of time, and classifies the speech activity as noise, provided the second distortion is greater than a noise threshold for the first predetermined period of time.
  - 31. The apparatus of claim 27, wherein the processor determines the first end point of speech activity when a predetermined number of spectral representation vectors of the set of spectral representation vectors are a predetermined distance away from the steady state spectral representation vector for a continuous predetermined period of time.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Computer Incorporated (Apple Inc.)
Inventors
Staats, Erik P., Chow, Yen-Lu
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Dorvil, Richemond

Application Number

US07/999,128
Time in Patent Office

1,482 Days
Field of Search

395/2, 395/2.5, 395/2.54, 395/2.57, 395/2.62, 395/2.22, 395/2.31, 395/2.59, 395/2.64
US Class Current

704/248
CPC Class Codes

G10L 25/09   the extracted parameters be...

G10L 25/24   the extracted parameters be...

G10L 25/87   Detection of discrete point...

Method and apparatus for detecting speech activity using cepstrum vectors

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for detecting speech activity using cepstrum vectors

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links