Detection of voice inactivity within a sound stream

US 7,756,709 B2
Filed: 02/02/2004
Issued: 07/13/2010
Est. Priority Date: 02/02/2004
Status: Active Grant

First Claim

Patent Images

1. A method of identifying end-of-speech within an audio stream, comprising:

analyzing each window of the audio stream in a speech discriminator;

assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence within said each window, and a third classification label corresponding to noise in said each window;

incrementing a speech counter when said each window is assigned the first classification label;

incrementing a silence counter when said each window is assigned the second classification label;

incrementing a noise counter when said each window is assigned the third classification label;

clearing the speech counter, the silence counter, and the noise counter when the speech counter exceeds a first limit;

weighting at least one of the silence counter and the noise counter to obtain weighted silence and noise values;

combining the weighted silence and noise values in a result;

comparing the result to a second limit; and

identifying end-of-speech within the audio stream when the non-voice counter reaches a second limit;

wherein the steps of analyzing, assigning, incrementing a speech counter, incrementing a silence counter, incrementing a noise counter, clearing, weighting, combining, comparing, and identifying are performed by at least one processor.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identifying end of voiced speech within an audio stream of a noisy environment employs a speech discriminator. The discriminator analyzes each window of the audio stream, producing an output corresponding to the window. The output is used to classify the window in one of several classes, for example, (1) speech, (2) silence, or (3) noise. A state machine processes the window classifications, incrementing counters as each window is classified: speech counter for speech windows, silence counter for silence, and noise counter for noise. If the speech counter indicates a predefined number of windows, the state machine clears all counters. Otherwise, the state machine appropriately weights the values in the silence and noise counters, adds the weighted values, and compares the sum to a limit imposed on the number of non-voice windows. When the non-voice limit is reached, the state machine terminates processing of the audio stream.

34 Citations

View as Search Results

27 Claims

1. A method of identifying end-of-speech within an audio stream, comprising:
- analyzing each window of the audio stream in a speech discriminator;
  
  assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence within said each window, and a third classification label corresponding to noise in said each window;
  
  incrementing a speech counter when said each window is assigned the first classification label;
  
  incrementing a silence counter when said each window is assigned the second classification label;
  
  incrementing a noise counter when said each window is assigned the third classification label;
  
  clearing the speech counter, the silence counter, and the noise counter when the speech counter exceeds a first limit;
  
  weighting at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  combining the weighted silence and noise values in a result;
  
  comparing the result to a second limit; and
  
  identifying end-of-speech within the audio stream when the non-voice counter reaches a second limit;
  
  wherein the steps of analyzing, assigning, incrementing a speech counter, incrementing a silence counter, incrementing a noise counter, clearing, weighting, combining, comparing, and identifying are performed by at least one processor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. A method according to claim 1, further comprising terminating recording of the audio stream when end-of-speech is identified.
  - 3. A method according to claim 1, further comprising terminating processing of the audio stream when end-of-speech is identified.
  - 4. A method according to claim 1, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.
  - 5. A method according to claim 4, further comprising processing the audio section using a speech recognizer.
  - 6. A method according to claim 4, further comprising segmenting the audio stream into the windows.
  - 7. A method according to claim 6, further comprising:
    - digitizing the audio stream to obtain a digitized audio stream; and
      
      dividing the digitized audio stream into digitized blocks;
      
      wherein the step of dividing is performed prior to the step of segmenting and the step of segmenting comprises a step of segmenting the digitized blocks.
  - 8. A method according to claim 7, wherein the windows are overlapping and the step of segmenting the digitized blocks comprises segmenting the digitized blocks into the overlapping windows.
  - 9. A method according to claim 6, wherein the windows are overlapping and the step of segmenting comprises segmenting the audio stream into the overlapping windows.
  - 10. A method according to claim 9, wherein the first limit corresponds to a time period between 0.7 and 2.5 seconds.
  - 11. A method according to claim 9, wherein said step of analyzing comprises observing energy content of sound in said each window.
  - 12. A method according to claim 11, wherein said step of observing energy content comprises comparing broadband energy content of the sound in said each window to a first sound energy threshold.
  - 13. A method according to claim 11, wherein said step of observing energy content comprises comparing band-limited energy content of the sound in said each window to a first sound energy threshold.
  - 14. A method according to claim 9, wherein said step of analyzing comprises observing zero crossings of the sound in said each window.
  - 15. A method according to claim 14, wherein said step of observing comprises determining zero-crossing rate of the sound in said each window.
  - 16. A method according to claim 14, wherein said step of observing comprises determining number of zero crossings of the sound in said each window.
  - 17. A method according to claim 14, wherein said step of analyzing further comprises observing energy content of the sound in said each window.
  - 18. A method according to claim 14, wherein said step of analyzing further comprises comparing band-limited energy content of the sound in said each block to a first sound energy threshold.
  - 19. A method according to claim 9, wherein said step of weighting comprises weighting the silence counter at about two times rate of weighting the noise counter.
  - 20. A method according to claim 4, wherein:
    - the audio stream comprises sound of a voice mail message; and
      
      said step of receiving comprises receiving the audio stream in digitized blocks from a computer telephony hoard.

21. A method of identifying end-of-speech within an audio stream, comprising:
- step for analyzing each window of the audio stream in a speech discriminator;
  
  step for assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence within said each window, and a third classification label corresponding to noise in said each window;
  
  incrementing a speech counter in response to said each window being assigned the first classification label;
  
  incrementing a silence counter in response to said each window being assigned the second classification label;
  
  incrementing a noise counter in response to said each window being assigned the third classification label;
  
  step for determining when the speech counter exceeds a first limit;
  
  clearing the speech counter, the silence counter, and the noise counter in response to the speech counter exceeds a first limit;
  
  step for weighting at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  step for combining the weighted silence and noise values in a result;
  
  step for comparing the result to a second limit; and
  
  step for identifying end-of-speech within the audio stream in response to the result reaching the second limit;
  
  wherein the steps for analyzing, assigning are performed by at least one processor.
- View Dependent Claims (22)
- - 22. A method according to claim 21, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.

23. Apparatus for processing an audio stream, comprising:
- a memory storing program code; and
  
  a digital processor under control of the program code;
  
  wherein the program code comprises;
  
  instructions to cause the processor to receive the audio stream in digitized blocks;
  
  instructions to segment the digitized blocks into windows;
  
  instructions to cause the processor to analyze each window in a speech discriminator;
  
  instructions to cause the processor to assign a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding, to silence in said each window, and a third classification label corresponding to noise in said each window;
  
  instructions to cause the processor to increment a speech counter in response to said each window being assigned the first classification label;
  
  instructions to cause the processor to increment a silence counter in response to said each window being assigned the second classification label;
  
  instructions to cause the processor to increment a noise counter in response to said each window being assigned the third classification label;
  
  instructions to cause the processor to clear the speech counter, the silence counter, and the noise counter in response to the speech counter exceeding a first limit;
  
  instructions to cause the processor to weight at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  instructions to cause the processor to combine the weighted silence and noise values in a result;
  
  instructions to cause the processor to compare the result to a second limit; and
  
  instructions to cause the processor to identify end-of-speech within the audio stream in response to the result reaching the second limit.
- View Dependent Claims (24, 25, 26, 27)
- - 24. Apparatus according to claim 23, further comprising a mass storage device, wherein:
    - the code further comprises instructions to cause the processor to record the audio stream on the mass storage device, andthe code further comprises instructions to cause the processor to terminate recording of the audio stream when end-of-speech is identified.
  - 25. Apparatus according to claim 23, wherein the code further comprises instructions to cause the processor to terminate processing of the audio stream when end-of-speech is identified.
  - 26. Apparatus according to claim 23, further comprising a computer telephony subsystem capable of sending the digitized blocks to the processor.
  - 27. Apparatus according to claim 23, wherein the program code further comprises instructions to cause the processor to delimit end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section, and to process the digitized audio section using a speech recognizer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Applied Voice & Speech Tech Incorporated
Original Assignee
Applied Voice & Speech Technologies Incorporated (Open Text Corporation)
Inventors
Gierach, Karl D.
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
Cyr; Leonard Saint

Application Number

US10/770,748
Publication Number

US 20050171768A1
Time in Patent Office

2,353 Days
Field of Search

704/231, 704/246, 704/208, 704/251, 704/253, 704/254
US Class Current

704/253
CPC Class Codes

G10L 25/87 Detection of discrete point...

Detection of voice inactivity within a sound stream

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of voice inactivity within a sound stream

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links