Detection of voice inactivity within a sound stream

US 20050171768A1
Filed: 02/02/2004
Published: 08/04/2005
Est. Priority Date: 02/02/2004
Status: Active Grant

First Claim

Patent Images

1. A method of identifying end-of-speech within an audio stream, comprising:

analyzing each window of the audio stream in a speech discriminator;

assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, and one or more classification labels corresponding to absence of speech in said each window;

incrementing a speech counter when said each window is assigned the first classification label;

incrementing a non-voice counter when said each window is assigned a classification label corresponding to absence of speech;

clearing the speech counter and the non-voice counter when the speech counter exceeds a first limit; and

identifying end-of-speech within the audio stream when the non-voice counter reaches a second limit.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identifying end of voiced speech within an audio stream of a noisy environment employs a speech discriminator. The discriminator analyzes each window of the audio stream, producing an output corresponding to the window. The output is used to classify the window in one of several classes, for example, (1) speech, (2) silence, or (3) noise. A state machine processes the window classifications, incrementing counters as each window is classified: speech counter for speech windows, silence counter for silence, and noise counter for noise. If the speech counter indicates a predefined number of windows, the state machine clears all counters. Otherwise, the state machine appropriately weights the values in the silence and noise counters, adds the weighted values, and compares the sum to a limit imposed on the number of non-voice windows. When the non-voice limit is reached, the state machine terminates processing of the audio stream.

58 Citations

View as Search Results

69 Claims

1. A method of identifying end-of-speech within an audio stream, comprising:
- analyzing each window of the audio stream in a speech discriminator;
  
  assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, and one or more classification labels corresponding to absence of speech in said each window;
  
  incrementing a speech counter when said each window is assigned the first classification label;
  
  incrementing a non-voice counter when said each window is assigned a classification label corresponding to absence of speech;
  
  clearing the speech counter and the non-voice counter when the speech counter exceeds a first limit; and
  
  identifying end-of-speech within the audio stream when the non-voice counter reaches a second limit.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 2. A method according to claim 1, further comprising terminating recording of the audio stream when end-of-speech is identified.
  - 3. A method according to claim 1, further comprising terminating processing of the audio stream when end-of-speech is identified.
  - 4. A method according to claim 1, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.
  - 5. A method according to claim 4, further comprising processing the audio section using a speech recognizer.
  - 6. A method according to claim 4, further comprising segmenting the audio stream into the windows.
  - 7. A method according to claim 6, further comprising:
    - digitizing the audio stream to obtain a digitized audio stream; and
      
      dividing the digitized audio stream into digitized blocks;
      
      wherein the step of dividing is performed prior to the step of segmenting and the step of segmenting comprises a step of segmenting the digitized blocks.
  - 8. A method according to claim 7, wherein the windows are overlapping and the step of segmenting the digitized blocks comprises segmenting the digitized blocks into the overlapping windows.
  - 9. A method according to claim 6, wherein the windows are overlapping and the step of segmenting comprises segmenting the audio stream into the overlapping windows.
  - 10. A method according to claim 9, wherein the windows overlap by between 2 and 20 percent.
  - 11. A method according to claim 9, wherein the windows overlap by between 4 and 12 percent.
  - 12. A method according to claim 9, wherein the windows overlap by about 10 percent.
  - 13. A method according to claim 9, wherein the step of digitizing comprises digitizing at a rate of about 8000 samples per second.
  - 14. A method according to claim 13, wherein said each window is about 200 milliseconds in length.
  - 15. A method according to claim 9, wherein the first limit corresponds to a time period of about 1.3 seconds.
  - 16. A method according to claim 9, wherein the first limit corresponds to a time period between 0.7 and 2.5 seconds.
  - 17. A method according to claim 9, wherein the first limit corresponds to a time period between 1 and 1.8 seconds.
  - 18. A method according to claim 9, wherein the first limit corresponds to a time period between 1 and 1.5 seconds.
  - 19. A method according to claim 9, wherein the first limit is seven windows.
  - 20. A method according to claim 9, wherein the second limit corresponds to a time period between 1 and 4 seconds.
  - 21. A method according to claim 9, wherein the second limit corresponds to a time period between 2.5 and 3.5 seconds.
  - 22. A method according to claim 9, wherein the second limit corresponds to a time period of about 3 seconds.
  - 23. A method according to claim 9, wherein the second limit is 15 windows.
  - 24. A method according to claim 9, wherein said step of analyzing comprises observing energy content of sound in said each window.
  - 25. A method according to claim 24, wherein said step of observing energy content comprises comparing broadband energy content of the sound in said each window to a first sound energy threshold.
  - 26. A method according to claim 24, wherein said step of observing energy content comprises comparing band-limited energy content of the sound in said each window to a first sound energy threshold.
  - 27. A method according to claim 9, wherein said step of analyzing comprises observing zero crossings of sound in said each window.
  - 28. A method according to claim 27, wherein said step of observing comprises determining zero-crossing rate of sound in said each window.
  - 29. A method according to claim 27, wherein said step of observing comprises determining number of zero crossings of sound in said each window.
  - 30. A method according to claim 27, wherein said step of analyzing further comprises observing energy content of sound in said each window.
  - 31. A method according to claim 9, wherein the one or more classification labels corresponding to absence of speech comprise (1) a second classification label corresponding to silence, and (2) a third classification label corresponding to noise.
  - 32. A method according to claim 4, further comprising:
    - receiving the audio stream in digitized blocks from a computer telephony board; and
      
      segmenting the digitized blocks of the audio stream into the windows;
      
      wherein the audio stream comprises sound of a voice mail message.
  - 33. A method according to claim 9, wherein said step of analyzing comprises processing each window using endpointer algorithm.
  - 34. A method according to claim 9, wherein said step of analyzing comprises step for analyzing each window in a speech discriminator.

35. A method of identifying end-of-speech within an audio stream, comprising:
- analyzing each window in a speech discriminator;
  
  assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence within said each window, and a third classification label corresponding to noise in said each window;
  
  incrementing a speech counter when said each window is assigned the first classification label;
  
  incrementing a silence counter when said each window is assigned the second classification label;
  
  incrementing a noise counter when said each window is assigned the third classification label;
  
  clearing the speech counter, the silence counter, and the noise counter when the speech counter exceeds a first limit;
  
  weighting at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  combining the weighted silence and noise values in a result;
  
  comparing the result to a second limit; and
  
  identifying end-of-speech within the audio stream when the result reaches the second limit.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
- - 36. A method according to claim 35, further comprising terminating recording of the audio stream when end-of-speech is identified.
  - 37. A method according to claim 35, further comprising terminating processing of the audio stream when end-of-speech is identified.
  - 38. A method according to claim 35, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.
  - 39. A method according to claim 38, further comprising processing the audio section using a speech recognizer.
  - 40. A method according to claim 38, further comprising segmenting the audio stream into the windows.
  - 41. A method according to claim 40, further comprising:
    - digitizing the audio stream to obtain a digitized audio stream; and
      
      dividing the digitized audio stream into digitized blocks;
      
      wherein the step of dividing is performed prior to the step of segmenting and the step of segmenting comprises a step of segmenting the digitized blocks.
  - 42. A method according to claim 41, wherein the windows are overlapping and the step of segmenting the digitized blocks comprises segmenting the digitized blocks into the overlapping windows.
  - 43. A method according to claim 40, wherein the windows are overlapping and the step of segmenting comprises segmenting the audio stream into the overlapping windows.
  - 44. A method according to claim 43, wherein the first limit corresponds to a time period between 0.7 and 2.5 seconds.
  - 45. A method according to claim 43, wherein said step of analyzing comprises observing energy content of sound in said each window.
  - 46. A method according to claim 45, wherein said step of observing energy content comprises comparing broadband energy content of the sound in said each window to a first sound energy threshold.
  - 47. A method according to claim 45, wherein said step of observing energy content comprises comparing band-limited energy content of the sound in said each window to a first sound energy threshold.
  - 48. A method according to claim 43, wherein said step of analyzing comprises observing zero crossings of the sound in said each window.
  - 49. A method according to claim 48, wherein said step of observing comprises determining zero-crossing rate of the sound in said each window.
  - 50. A method according to claim 48, wherein said step of observing comprises determining number of zero crossings of the sound in said each window.
  - 51. A method according to claim 48, wherein said step of analyzing further comprises observing energy content of the sound in said each window.
  - 52. A method according to claim 48, wherein said step of analyzing further comprises comparing band-limited energy content of the sound in said each block to a first sound energy threshold.
  - 53. A method according to claim 43, wherein said step of weighting comprises weighting the silence counter at about two times rate of weighting the noise counter.
  - 54. A method according to claim 38, wherein:
    - the audio stream comprises sound of a voice mail message; and
      
      said step of receiving comprises receiving the audio stream in digitized blocks from a computer telephony board.

55. A method of identifying end-of-speech within an audio stream, comprising:
- step for analyzing each window of the audio stream in a speech discriminator;
  
  step for assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, and one or more classification labels corresponding to absence of speech in said each window;
  
  incrementing a speech counter when said each window is assigned the first classification label;
  
  incrementing a non-voice counter when said each window is assigned a classification label corresponding to absence of speech;
  
  step for determining when the speech counter exceeds a first limit;
  
  clearing the speech counter and the non-voice counter when the speech counter exceeds the first limit;
  
  step for determining when the non-voice counter reaches a second limit; and
  
  step for identifying end-of-speech within the audio stream when the non-voice counter reaches the second limit.
- View Dependent Claims (56)
- - 56. A method according to claim 55, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.

57. A method of identifying end-of-speech within an audio stream, comprising:
- step for analyzing each window of the audio stream in a speech discriminator;
  
  step for assigning a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence within said each window, and a third classification label corresponding to noise in said each window;
  
  incrementing a speech counter when said each window is assigned the first classification label;
  
  incrementing a silence counter when said each window is assigned the second classification label;
  
  incrementing a noise counter when said each window is assigned the third classification label;
  
  step for determining when the speech counter exceeds a first limit;
  
  clearing the speech counter, the silence counter, and the noise counter when the speech counter exceeds the first limit;
  
  step for weighting at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  step for combining the weighted silence and noise values in a result;
  
  step for comparing the result to a second limit; and
  
  step for identifying end-of-speech within the audio stream when the result reaches the second limit.
- View Dependent Claims (58)
- - 58. A method according to claim 57, further comprising delimiting end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section.

59. Apparatus for processing an audio stream, comprising:
- a memory storing program code; and
  
  a digital processor under control of the program code;
  
  wherein the program code comprises;
  
  instructions to cause the processor to receive the audio stream in digitized blocks;
  
  instructions to segment the digitized blocks into windows;
  
  instructions to cause the processor to analyze each window in a speech discriminator;
  
  instructions to cause the processor to assign a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, and one or more classification labels corresponding to absence of speech in said each window;
  
  instructions to cause the processor to increment a speech counter when said each window is assigned the first classification label;
  
  instructions to cause the processor to increment a non-voice counter when said each window is assigned a classification label corresponding to absence of speech;
  
  instructions to cause the processor to clear the speech counter and the non-voice counter when the speech counter exceeds a first limit; and
  
  instructions to cause the processor to identify end-of-speech within the audio stream when the non-voice counter reaches a second limit.
- View Dependent Claims (60, 61, 62, 63)
- - 60. Apparatus according to claim 59, further comprising a mass storage device, wherein:
    - the code further comprises instructions to cause the processor to record the audio stream on the mass storage device, and the code further comprises instructions to cause the processor to terminate recording of the audio stream when end-of-speech is identified.
  - 61. Apparatus according to claim 60, further comprising a computer telephony subsystem capable of providing the digitized blocks to the processor.
  - 62. Apparatus according to claim 59, wherein the program code further comprises instructions to cause the processor to terminate processing of the audio stream when end-of-speech is identified.
  - 63. Apparatus according to claim 59, wherein the program code further comprises instructions to cause the processor to delimit end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section, and to process the audio section using a speech recognizer.

64. Apparatus for processing an audio stream, comprising:
- a memory storing program code; and
  
  a digital processor under control of the program code;
  
  wherein the program code comprises;
  
  instructions to cause the processor to receive the audio stream in digitized blocks;
  
  instructions to segment the digitized blocks into windows;
  
  instructions to cause the processor to analyze each window in a speech discriminator;
  
  instructions to cause the processor to assign a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, a second classification label corresponding to silence in said each window, and a third classification label corresponding to noise in said each window;
  
  instructions to cause the processor to increment a speech counter when said each window is assigned the first classification label;
  
  instructions to cause the processor to increment a silence counter when said each window is assigned the second classification label;
  
  instructions to cause the processor to increment a noise counter when said each window is assigned the third classification label;
  
  instructions to cause the processor to clear the speech counter, the silence counter, and the noise counter when the speech counter exceeds a first limit;
  
  instructions to cause the processor to weight at least one of the silence counter and the noise counter to obtain weighted silence and noise values;
  
  instructions to cause the processor to combine the weighted silence and noise values in a result;
  
  instructions to cause the processor to compare the result to a second limit; and
  
  instructions to cause the processor to identify end-of-speech within the audio stream when the result reaches the second limit.
- View Dependent Claims (65, 66, 67, 68)
- - 65. Apparatus according to claim 64, further comprising a mass storage device, wherein:
    - the code further comprises instructions to cause the processor to record the audio stream on the mass storage device, and the code further comprises instructions to cause the processor to terminate recording of the audio stream when end-of-speech is identified.
  - 66. Apparatus according to claim 64, wherein the code further comprises instructions to cause the processor to terminate processing of the audio stream when end-of-speech is identified.
  - 67. Apparatus according to claim 64, further comprising a computer telephony subsystem capable of sending the digitized blocks to the processor.
  - 68. Apparatus according to claim 64, wherein the program code further comprises instructions to cause the processor to delimit end of an audio section within the audio stream when end-of-speech is identified to obtain a delimited audio section, and to process the digitized audio section using a speech recognizer.

69. An article of manufacture comprising a machine-readable storage medium with instruction code stored in the medium, said instruction code, when executed by a data processing apparatus comprising a processor receiving an audio stream in digitized blocks, causes the processor to segment the digitized blocks into windows;
- analyze each window in a speech discriminator;
  
  assign a classification to said each window based on speech discriminator output corresponding to said each window, the classification being selected from a classification set comprising a first classification label corresponding to presence of speech within said each window, and one or more classification labels corresponding to absence of speech in said each window;
  
  increment a speech counter when said each window is assigned the first classification label;
  
  increment a non-voice counter when said each window is assigned a classification label corresponding to absence of speech;
  
  clear the speech counter and the non-voice counter when the speech counter exceeds a first limit; and
  
  identify end-of-speech within the audio stream when the non-voice counter reaches a second limit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Applied Voice & Speech Tech Incorporated
Original Assignee
Applied Voice & Speech Technologies Incorporated (Open Text Corporation)
Inventors
Gierach, Karl D.

Granted Patent

US 7,756,709 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/208
CPC Class Codes

G10L 25/87 Detection of discrete point...

Detection of voice inactivity within a sound stream

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

58 Citations

69 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of voice inactivity within a sound stream

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

69 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links