Primary transmission site switching in a multipoint videoconference environment based on human voice

US 5,991,277 A
Filed: 04/09/1998
Issued: 11/23/1999
Est. Priority Date: 10/20/1995
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining a talk/listen state using voice detection, comprising:

receiving an audio sample representing sound measured during a sample time interval;

detecting whether the audio sample includes voiced sound;

deriving an audio level from the audio sample, the audio level representing an average power level of the audio sample;

comparing the audio level to a threshold level;

determining the talk/listen state depending on a relation of the audio level to the threshold level and depending on whether the audio sample includes voiced sound.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for determining a talk/listen state using voice detection includes receiving an audio sample and detecting whether the audio sample includes voiced sound. The audio sample represents sound measured during a sample time interval. The method further includes deriving an audio level from the audio sample and comparing the audio level to a threshold level. The audio level represents an average power level of the audio sample. The method further includes determining the talk/listen state depending on a relation of the audio level to the threshold level and depending on whether the audio sample includes voiced sound.

111 Citations

View as Search Results

37 Claims

1. A method for determining a talk/listen state using voice detection, comprising:
- receiving an audio sample representing sound measured during a sample time interval;
  
  detecting whether the audio sample includes voiced sound;
  
  deriving an audio level from the audio sample, the audio level representing an average power level of the audio sample;
  
  comparing the audio level to a threshold level;
  
  determining the talk/listen state depending on a relation of the audio level to the threshold level and depending on whether the audio sample includes voiced sound.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. A method as recited in claim 1 wherein the determining the talk/listen state comprises:
    - determining the talk/listen state to be a listening state if audio level is below the threshold level;
      
      determining the talk/listen state to be the listening state if the audio sample does not include voiced sound; and
      
      determining the talk/listen state to be a talking state if the audio level is above the threshold level and the audio sample includes voiced sound.
  - 3. A method as recited in claim 2, whereineach of the determining the talk/listen state to be a listening state includes setting a talk/listen indication to a first value indicating listen;
    - andthe determining the talk/listen state to be a talking state includes setting the talk/listen indication to a second value indicating talk.
  - 4. A method as recited in claim 1 wherein the detecting whether the audio sample includes voiced sound comprises performing a cepstral analysis on the audio sample.
  - 5. A method as recited in claim 4 wherein the performing the cepstral analysis on the audio sample comprises:
    - taking the Fourier transform of the audio sample to generate a first intermediate result;
      
      taking the logarithm of the first intermediate result to generate a second intermediate result; and
      
      taking the inverse Fourier transform of the second intermediate result to generate a cepstrum of the audio signal.
  - 6. A method as recited in claim 5 wherein the performing the cepstral analysis on the audio sample further comprises:
    - determining whether the cepstrum includes peak values greater than a cepstrum threshold value.
  - 7. A method as recited in claim 4 wherein the performing the cepstral analysis on the audio sample comprises:
    - multiplying the audio sample by a data window to generate a first intermediate result;
      
      taking the Fourier transform of the first intermediate result to generate a second intermediate result;
      
      taking the logarithm of the second intermediate result to generate a third intermediate result; and
      
      taking the inverse Fourier transform of the third intermediate result to generate a cepstrum of the audio signal.
  - 8. A method as recited in claim 7 wherein the data window is a Hamming window.
  - 9. A method as recited in claim 1 the method further comprising:
    - setting a voice indication to a first value if the audio sample includes voiced sound; and
      
      setting the voice indication to a second value if the audio sample does not include voiced sound.
  - 10. A method as recited in claim 1 wherein the threshold level is a dynamic threshold level;
    - and the method further includessetting the threshold level by processing the audio level to set and maintain a dynamic level and using the dynamic level to determine a value for the threshold level;
      
      repeating the steps of the method for each audio sample in a sequential stream of received audio samples such that the threshold level is dynamically maintained and used to determine the talk/listen state.
  - 11. A method as recited in claim 10 wherein the setting the threshold level comprises processing the audio level to set and maintain a background level and using the background level to determine the value for the threshold level.
  - 12. A method as recited in claim 11 wherein the processing the audio level to set and maintain the background level comprises:
    - weighting the background level with the audio level at a first weight if the audio level is greater than the threshold level and the audio sample includes voiced sound; and
      
      weighting the background level with the audio level at a second weight if the audio level is less than the threshold level.
  - 13. A method as recited in claim 12 whereinthe first weight is 63:
    - 1; and
      
      the second weight is 511;
      
      1.
  - 14. A method as recited in claim 10 wherein the setting the threshold level comprises processing the audio level to set and maintain a foreground level and using the foreground level to determine the value for the threshold level.
  - 15. A method as recited in claim 14 wherein the processing the audio level to set and maintain the foreground level comprises:
    - weighting the foreground level with the audio level at a first weight if the audio level is greater than the threshold level and the audio sample includes voiced sound; and
      
      weighting the foreground level with the audio level at a second weight if the audio level is less than the threshold level.
  - 16. A method as recited in claim 15 whereinthe first weight is 2047:
    - 1; and
      
      the second weight is 127;
      
      1.
  - 17. A method as recited in claim 14, wherein the setting the threshold level further comprises processing the audio level to set and maintain a background level and using the background level and the foreground level to determine the value for the threshold level.
  - 18. A method as recited in claim 17, wherein the setting the threshold level further comprises processing the audio level to set and maintain a long term background level and using the background level, the foreground level and the long term background level to determine a value for the dynamic threshold level.
  - 19. A method as recited in claim 18, wherein the setting the threshold level comprises setting the threshold level as a weighted sum of the background level, the foreground level and the long term background level.
  - 20. A method as recited in claim 19, wherein the threshold level comprises a weighted sum having a ratio of 1:
    - 2;
      
      4 of the long term background level, the background level and the foreground level.
  - 21. A method as recited in claim 10, wherein the setting the threshold level comprises processing the audio level to set and maintain a long term background level and using the long term background level to determine the value for the threshold level.
  - 22. A method as recited in claim 10, the method further comprising:
    - setting a long term background level;
      
      setting a foreground level by weighting the foreground level with the audio level;
      
      setting a background level by weighting the background level with the audio level;
      
      setting the threshold level equal to a weighted sum of the long term background level, the foreground level and the background level.
  - 23. A method as recited in claim 1 whereinthe method is implemented in a multipoint control unit for a multipoint conference;
    - andthe receiving an audio sample comprises receiving an audio sample from a conference site.
  - 24. A method as recited in claim 1 wherein the receiving an audio sample comprises receiving an audio data packet that is part of a data stream including audio and video information, the audio data packet representing sound measured over a time interval.

25. An apparatus comprising:
- a voice detection unit detecting whether an audio signal includes voiced sound responsive to receiving the audio signal;
  
  a talk/listen determination unit coupled to the voice detection unit, the talk/listen determination unit deriving an average audio power level of the audio signal and deriving a dynamic threshold level based on the average audio power level and past average audio power levels responsive to receiving the audio signal, the talk/listen determination unit determining a talk/listen state depending on a comparison of the average audio power level and the dynamic threshold level and on whether the voice detection unit detects voiced sound.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 26. An apparatus as recited in claim 25 whereinthe talk/listen determination unit is operable to update the dynamic threshold level as a first weighted average of the average audio power level and the past average audio power levels when the average audio power level is below the dynamic threshold level;
    - andthe talk/listen determination unit is operable to update the dynamic threshold level as a second weighted average of the average audio power level and the past average audio power levels when the average audio power level is above the dynamic threshold and the audio signal includes voiced sound.
  - 27. An apparatus as recited in claim 26 whereinthe talk/listen state indicates talking when the average audio power level is greater than the dynamic threshold level and the audio signal includes voiced sound.
  - 28. An apparatus as recited in claim 27 whereinthe talk/listen state indicates listening when the average audio power level is less than the dynamic threshold level;
    - andthe talk/listen state indicates listening when the average audio power level is greater than the dynamic threshold level and the audio signal includes voiced sound.
  - 29. An apparatus as recited in claim 25, the apparatus further comprising:
    - a first interface unit for receiving an audio signal representing sound measured during a sample time interval;
      
      an audio memory coupled to receive the audio signal from the first interface unit, the audio memory storing the audio signal responsive to receiving the audio signal, the audio memory coupled to provide the audio signal to the voice detection unit; and
      
      a second interface unit coupled to receive the talk/listen state from the talk/listen determination unit.
  - 30. An apparatus as recited in claim 25 wherein the voice detection unit and the talk/listen determination unit are comprised within a conference unit for receiving and transmitting audio and video information from and to a conference site.
  - 31. An apparatus as recited in claim 25, the apparatus further comprising:
    - a controller unit; and
      
      a conference unit, the conference unit including the voice detection unit; and
      
      the talk/listen determination unit,at least one other conference unit, each of the at least one other conference unit including at least a corresponding voice detection unit and a corresponding talk/listen determination unit.
  - 32. An apparatus as recited in claim 31 whereineach of the conference units transmits and receives audio and video information between a corresponding one of a plurality of conference sites and the controller unit within a multipoint conference system using human voice detection and the dynamic threshold level to minimize switching based on unvoiced sound.
  - 33. An apparatus as recited in claim 32, the apparatus further comprising:
    - a multipoint control unit, the multipoint control unit includingthe conference units, each of the conference units being coupled to receive and process each audio sample in a sequential stream of audio samples received from a corresponding conference site, each of the conference units being coupled to provide a respective one of a plurality of corresponding notification signals indicating a talk/listen state of a corresponding conference site; and
      
      the controller unit, the controller unit being coupled to the conference units to receive each of the plurality of notification signals, the control unit using the notification signals to select a primary transmission site from among conference sites corresponding to the conference units.
  - 34. An apparatus as recited in claim 33 wherein the multipoint control unit is coupled within a multipoint conference system, the multipoint conference system comprising:
    - a plurality of sets of conference equipment, each set of conference equipment for being located at a conference site to transmit a sequential stream of audio samples, each audio sample from a corresponding conference site representing sound measured from the corresponding conference site for a sampled interval of time;
      
      the multipoint control unit coupled to the plurality of sets of conference equipment via respective conference units, the multipoint control unit receiving each sequential stream of audio samples, the multipoint conference system operable to determine whether each audio sample received from each conference equipment includes voiced sound to determine a talk/listen state of each conference site and to control voice-activated switching of video between the conference sites using the determined talk/listen states of the conference sites.
  - 35. An apparatus as recited in claim 34 wherein the multipoint conference system is further operable toset and maintain a dynamic level associated with each site and to set a dynamic threshold level associated with each conference site by processing each dynamic level and audio sample in the sequential stream received from each set of equipment;
    - andcompare an audio level of each audio sample to the dynamic threshold level to determine the talk/listen state of each conference site.

36. A voice activated switching device for selecting a primary transmission site from among a plurality of transmission devices, the voice activated switching device comprising:
- means for determining whether audio signals received from each of the transmission devices include voiced or unvoiced sound;
  
  means for repeatedly determining a dynamic threshold level for each of the transmission devices;
  
  means for comparing each of the audio signals received from each of the transmission devices to a corresponding dynamic threshold level;
  
  means for determining a talk/listen state for each transmission device based on whether each audio signal includes voiced sound and on whether a power level of each audio signal is greater than the dynamic threshold level.
- View Dependent Claims (37)
- - 37. The voice activated switching device as recited in claim 36, wherein the means for determining a talk/listen state for each transmission device comprises:
    - means for indicating a listen state for each transmission device if the power level of each audio signal is below the dynamic threshold level;
      
      means for indicating a listen state for each transmission device if the corresponding audio sample does not include voiced sound; and
      
      means for indicating a talk state for each transmission device if the power level of each audio signal is above the corresponding dynamic threshold level and each audio sample includes voiced sound.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Vtel Corporation
Inventors
Tischler, Paul V., Clements, Bill, Maeng, Joon
Primary Examiner(s)
Pham, Chi H.
Assistant Examiner(s)
Tran, Maikhanh

Application Number

US09/057,849
Time in Patent Office

593 Days
Field of Search

370/263, 370/210, 370/433, 370/435, 370/260, 370/264, 370/266, 370/270, 370/465, 379/157, 379/158, 379/202, 379/388-390
US Class Current

370/263
CPC Class Codes

H04M 2201/40   using speech recognition sp...

H04M 3/567   Multimedia conference systems

H04M 3/569   using the instant speaker's...

Primary transmission site switching in a multipoint videoconference environment based on human voice

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

111 Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Primary transmission site switching in a multipoint videoconference environment based on human voice

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

111 Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links