Speech Signal Enhancement Using Visual Information

US 20140337016A1
Filed: 10/17/2011
Published: 11/13/2014
Est. Priority Date: 10/17/2011
Status: Active Grant

First Claim

Patent Images

1-33. -33. (canceled)

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Visual information is used to alter or set an operating parameter of an audio signal processor, other than a beamformer. A digital camera captures visual information about a scene that includes a human speaker and/or a listener. The visual information is analyzed to ascertain information about acoustics of a room. A distance between the speaker and a microphone may be estimated, and this distance estimate may be used to adjust an overall gain of the system. Distances among, and locations of, the speaker, the listener, the microphone, a loudspeaker and/or a sound-reflecting surface may be estimated. These estimates may be used to estimate reverberations within the room and adjust aggressiveness of an anti-reverberation filter, based on an estimated ratio of direct to indirect (reverberated) sound energy expected to reach the microphone. In addition, orientation of the speaker or the listener, relative to the microphone or the loudspeaker, can also be estimated, and this estimate may be used to adjust frequency-dependent filter weights to compensate for uneven frequency propagation of acoustic signals from a mouth, or to a human ear, about a human head.

109 Citations

View as Search Results

53 Claims

1-33. -33. (canceled)

34. A method, comprising:
- providing a microphone to detect speech uttered by a speaker and generate audio signals from the speech received by the microphone;
  
  coupling an audio signal processor to the microphone to receive the audio signals and process the received audio signals;
  
  providing a camera that can be at least partially orientated toward the microphone to generate a scene image;
  
  coupling an image analyzer to the camera to automatically analyze the scene image for estimating a distance between the speaker and the microphone; and
  
  coupling a tuner to the image analyzer and to the audio signal processor to automatically alter an operating parameter of the audio signal processor, based at least in part on the estimated distance between the speaker and the microphone.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 35. The method according to claim 34, wherein the operating parameter comprises gain and the tuner to cause the gain to be set based on the estimated distance between the speaker and the microphone, such that a larger distance produces a larger gain.
  - 36. The method according to claim 34, wherein the audio signal processor comprises an anti-reverberation filter and the tuner to reduce aggressiveness of the anti-reverberation filter when the estimated distance is greater than a calculated value.
  - 37. The method according to claim 34, further including analyzing the scene image to estimate an orientation of the speaker, relative to the microphone;
    - andaltering the operating parameter of the audio signal processor, based at least in part on the estimated orientation of the speaker.
  - 38. The method according to claim 37, wherein:
    - the operating parameter comprises a plurality of gains, wherein each of the plurality of gains is associated with a range of frequencies; and
      
      causing at least one of the plurality of gains, associated with a high range of frequencies (“
      
      high-frequency gain”
      
      ), to be set, relative to another at least one of the plurality of gains, associated with a low range of frequencies (“
      
      low-frequency gain”
      
      ), based on the estimated orientation of the speaker, such that when the speaker is oriented away from the microphone, the high-frequency gain is set higher, relative to the low-frequency gain, than when the speaker is oriented toward the microphone.
  - 39. The method according to claim 34, further including:
    - detecting a sound-reflecting surface disposed proximate the microphone and the speaker and analyzing the scene image, so as to estimate a ratio of;
      
      sound energy reaching the microphone directly from the speaker andsound energy indirectly reaching the microphone from the speaker after being reflected from the sound-reflecting surface; and
      
      altering the operating parameter of the audio signal processor, based at least in part on the estimated ratio.
  - 40. The method according to claim 39, wherein the audio signal processor comprises an anti-reverberation filter and the tuner to reduce aggressiveness of the anti-reverberation filter when the estimated ratio is less than a predetermined value.
  - 41. The method according to claim 34, further including;
    - detecting a sound-reflecting surface disposed proximate the microphone and the speaker and analyzing the scene image, so as to estimate at least one of;
      
      a reverberation time influenced by the sound-reflecting surface and a reverberation distance influenced by the sound-reflecting surface; and
      
      altering the operating parameter of the audio signal processor, based at least in part on the at least one of the estimated reverberation time and the estimated reverberation distance.
  - 42. The method according to claim 34, further including:
    - generating a processed audio signal for amplification and thence for driving a loudspeaker;
      
      detecting a listener proximate the loudspeaker; and
      
      analyzing the scene image, so as to estimate a difference between arrival times at the detected listener of;
      
      a direct acoustic signal from the speaker anda corresponding indirect acoustic signal from the speaker, via the microphone, the audio signal processor and the loudspeaker; and
      
      altering the operating parameter of the audio signal processor, based at least in part on the estimated difference in arrival times.
  - 43. The method according to claim 42, further including altering the operating parameter of the audio signal processor, so as to reduce volume of the loudspeaker, if the estimated difference in arrival times is greater than a predetermined value.
  - 44. The method according to claim 42, further including altering the operating parameter of the audio signal processor, so as to reduce processing by the audio signal processor, if the estimated difference in arrival times is greater than a predetermined value.
  - 45. The method according to claim 34, further including:
    - estimating at least one attribute of a room, within which the microphone and the speaker are disposed; and
      
      analyzing the scene image, so as to estimate a reverberation time influenced by the at least one attribute; and
      
      altering the operating parameter of the audio signal processor, based at least in part on the estimated reverberation time.
  - 46. The method according to claim 45, wherein the at least one attribute of the room comprises an estimate of at least one of:
    - size of the room and amount of sound-absorbing material within the room.
  - 47. The method according to claim 34, wherein:
    - the microphone comprises a plurality of microphones, each of the plurality of microphones being associated with a respective potential speaker station;
      
      and further including;
      
      ascertaining absence of a respective speaker at each of the potential speaker stations; and
      
      causing the audio signal processor to ignore audio signals from each microphone that is associated with a potential speaker station having an absent speaker.
  - 48. The method according to claim 34, further including detecting mouth movement by a speaker and estimating a distance between the speaker and the microphone, based at least in part on the detected mouth movement.
  - 49. The method according to claim 34, further including:
    - detecting a plurality of potential speakers;
      
      detecting mouth movement by at least one of the plurality of speakers; and
      
      causing the audio signal processor to preferentially process audio signals associated with the at least one of the plurality of speakers having detected mouth movement.
  - 50. The method according to claim 34, further including:
    - detecting an utterance based at least in part on sound signal energy exceeding a threshold value; and
      
      adjusting the threshold value, based on the estimated distance between the speaker and the microphone.
  - 51. The method according to claim 34, further including altering the operating parameter of the audio signal processor before the audio signal processor receives the audio signals from the microphone.

52. An audio system for use by a plurality of speakers, the system comprising:
- a microphone configured to detect speech uttered by at least one of the plurality of speakers and generate corresponding audio signals;
  
  an audio signal processor coupled to the microphone to receive the audio signals and configured to process the received audio signals;
  
  a camera orientable at least partially toward the microphone at and configured to generate a scene image;
  
  an image analyzer coupled to the camera and configured to automatically analyze the scene image, so as to detect a gesture by at least one of the speakers; and
  
  a tuner coupled to the image analyzer and to the audio signal processor and configured to automatically alter an operating parameter of the audio signal processor, and process audio signals corresponding to the at least one of the speakers who gestured.

53. A tangible non-transitory computer-readable storage medium with an executable program stored thereon for automatically processing speech uttered by a speaker into a microphone, wherein the program enables a machine to:
- detect the speech uttered by the speaker and generating corresponding audio signals;
  
  process the audio signals by an audio signal processor, other than a beamformer;
  
  generate a scene image with a camera;
  
  analyze the scene image, so as to estimate a distance between the speaker and the microphone; and
  
  automatically altar an operating parameter of the audio signal processor, based at least in part on the estimated distance between the speaker and the microphone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Herbig, Tobias, Wolff, Tobias, Buck, Markus

Granted Patent

US 9,293,151 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/201
CPC Class Codes

G06T 2207/30196   Human being; Person

G06T 7/73   using feature-based methods

G10L 15/20   Speech recognition techniqu...

G10L 17/00   Speaker identification or v...

G10L 2021/02082   the noise being echo, rever...

G10L 25/27   characterised by the analys...

G10L 25/78   Detection of presence or ab...

H04M 3/568   audio processing specific t...

H04N 7/15   Conference systems

Speech Signal Enhancement Using Visual Information

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

109 Citations

53 Claims

Specification

Solutions

Use Cases

Quick Links

Speech Signal Enhancement Using Visual Information

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

53 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links