Method and system for microphone array input type speech recognition using band-pass power distribution for sound source position/direction estimation

US 6,009,396 A
Filed: 03/14/1997
Issued: 12/28/1999
Est. Priority Date: 03/15/1996
Status: Expired due to Term

First Claim

Patent Images

1. A microphone array input type speech recognition system, comprising:

a speech input unit for inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;

a frequency analysis unit for analyzing an input speech of each channel inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth;

a sound source position search unit for calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution;

a speech parameter extraction unit for extracting a speech parameter for speech recognition, from the band-pass power distribution for each frequency bandwidth calculated by the sound source position search unit, according to the sound source position or direction estimated by the sound source position search unit; and

a speech recognition unit for obtaining a speech recognition result by matching the speech parameter extracted by the speech parameter extraction unit with a recognition dictionary.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A microphone array input type speech recognition scheme capable of realizing a high precision sound source position or direction estimation by a small amount of calculations, and thereby realizing a high precision speech recognition. A band-pass waveform, which is a waveform for each frequency bandwidth, is obtained from input signals of the microphone array, and a band-pass power of the sound source is directly obtained from the band-pass waveform. Then, the obtained band-pass power is used as the speech parameter. It is also possible to realize the sound source estimation and the band-pass power estimation at high precision while further reducing an amount of calculations, by utilizing a sound source position search processing in which a low resolution position estimation and a high resolution position estimation are combined.

117 Citations

26 Claims

1. A microphone array input type speech recognition system, comprising:
- a speech input unit for inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  a frequency analysis unit for analyzing an input speech of each channel inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth;
  
  a sound source position search unit for calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution;
  
  a speech parameter extraction unit for extracting a speech parameter for speech recognition, from the band-pass power distribution for each frequency bandwidth calculated by the sound source position search unit, according to the sound source position or direction estimated by the sound source position search unit; and
  
  a speech recognition unit for obtaining a speech recognition result by matching the speech parameter extracted by the speech parameter extraction unit with a recognition dictionary.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the sound source position search unit includes:
    - a low resolution sound source position estimation unit for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation unit for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation unit, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant, wherein the speech parameter extraction unit extracts the speech parameter for speech recognition according to the accurate sound source position or direction.
  - 3. The system of claim 1, wherein the frequency analysis unit obtains the band-pass waveforms for each channel by using a band-pass filter bank.
  - 4. The system of claim 1, wherein the sound source position search unit calculates the band-pass power distribution for each frequency bandwidth, by calculating a band-pass power for each frequency bandwidth, in each one of a plurality of assumed sound source positions or directions within a prescribed search range.
  - 5. The system of claim 1, wherein the sound source position search unit calculates the band-pass power distribution for each frequency bandwidth by using a filter function configuration having a plurality of delay line taps for each channel.
  - 6. The system of claim 1, wherein the sound source position search unit calculates the band-pass power distribution for each frequency bandwidth by using a minimum variance method for minimizing an output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant.
  - 7. The system of claim 1, wherein the speech parameter extraction unit extracts the band-pass power distribution for each frequency bandwidth calculated by the sound source position search unit for the sound source position or direction estimated by the sound source position search unit directly as the speech parameter.
  - 8. The system of claim 1, wherein the sound source position search unit synthesizes the calculated band-pass power distributions for a plurality of frequency bandwidths by weighting the calculated band-pass power distributions with respective weights, and summing weighted band-pass power distributions.
  - 9. The system of claim 1, wherein the sound source position search unit estimates the sound source position or direction by detecting a peak in the synthesized band-pass power distribution and setting a position or direction corresponding to a detected peak as the sound source position or direction.

10. A microphone array input type speech analysis system, comprising:
- a speech input unit for inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  a frequency analysis unit for analyzing an input speech of each channel inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth;
  
  a sound source position search unit for calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution; and
  
  a speech parameter extraction unit for extracting a speech parameter from the band-pass power distribution for each frequency bandwidth estimated by the sound source position search unit, according to the sound source position or direction estimated by the sound source position search unit.
- View Dependent Claims (11)
- - 11. The system of claim 10, wherein the sound source position search unit includes:
    - a low resolution sound source position estimation unit for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation unit for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation unit, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant, wherein the speech parameter extraction unit extracts the speech parameter according to the accurate sound source position or direction.

12. A microphone array input type speech analysis system, comprising:
- a speech input unit for inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  a frequency analysis unit for analyzing an input speech of each channel inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth; and
  
  a sound source position search unit for calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution.
- View Dependent Claims (13)
- - 13. The system of claim 12, wherein the sound source position search unit includes:
    - a low resolution sound source position estimation unit for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation unit for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation unit, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant.

14. A microphone array input type speech recognition method, comprising the steps of:
- inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  analyzing an input speech of each channel inputted by the inputting step, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth;
  
  calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution;
  
  extracting a speech parameter for speech recognition, from the band-pass power distribution for each frequency bandwidth calculated by the calculating step, according to the sound source position or direction estimated by the calculating step; and
  
  obtaining a speech recognition result by matching the speech parameter extracted by the extracting step with a recognition dictionary.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
- - 15. The method of claim 14, wherein the calculating step includes the steps of:
    - a low resolution sound source position estimation step for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation step for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation step, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant, wherein the extracting step extracts the speech parameter for speech recognition according to the accurate sound source position or direction.
  - 16. The method of claim 14, wherein the analyzing step obtains the band-pass waveforms for each channel by using a band-pass filter bank.
  - 17. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each frequency bandwidth, by calculating a band-pass power for each frequency bandwidth, in each one of a plurality of assumed sound source positions or directions within a prescribed search range.
  - 18. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each frequency bandwidth by using a filter function configuration having a plurality of delay line taps for each channel.
  - 19. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each frequency bandwidth by using a minimum variance method for minimizing an output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant.
  - 20. The method of claim 14, wherein the extracting step extracts the band-pass power distribution for each frequency bandwidth calculated by the calculating step for the sound source position or direction estimated by the calculating step directly as the speech parameter.
  - 21. The method of claim 14, wherein the calculating step synthesizes the calculated band-pass power distributions for a plurality of frequency bandwidths by weighting the calculated band-pass power distributions with respective weights, and summing weighted band-pass power distributions.
  - 22. The method of claim 14, wherein the calculating step estimates the sound source position or direction by detecting a peak in the synthesized band-pass power distribution and setting a position or direction corresponding to a detected peak as the sound source position or direction.

23. A microphone array input type speech analysis method, comprising the steps of:
- inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  analyzing an input speech of each channel inputted by the inputting step, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth;
  
  calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution; and
  
  extracting a speech parameter from the band-pass power distribution for each frequency bandwidth calculated by the calculating step, according to the sound source position or direction estimated by the calculating step.
- View Dependent Claims (24)
- - 24. The method of claim 23, wherein the calculating step includes the steps of:
    - a low resolution sound source position estimation step for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation step for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation step, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant, wherein the extracting step extracts the speech parameter according to the accurate sound source position or direction.

25. A microphone array input type speech analysis method, comprising the steps of:
- inputting speeches in a plurality of channels using a microphone array formed by a plurality of microphones;
  
  analyzing an input speech of each channel inputted by the inputting step, and obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth; and
  
  calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power distribution.
- View Dependent Claims (26)
- - 26. The method of claim 25, wherein the calculating step includes the steps of:
    - a low resolution sound source position estimation step for estimating a rough sound source position or direction, by minimizing an output power of the microphone array under constraints that responses of the microphone array for a plurality of directions or positions are to be maintained constant; and
      
      a high resolution sound source position estimation step for estimating an accurate sound source position or direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound source position estimation step, by minimizing the output power of the microphone array under constraints that a response of the microphone array for one direction or position is to be maintained constant.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Nagata, Yoshifumi
Primary Examiner(s)
Voeltz, Emanuel Todd
Assistant Examiner(s)
SOFOCLEOUS, MICHAEL D

Application Number

US08/818,672
Time in Patent Office

1,019 Days
Field of Search

381/92, 704/233, 704/275, 367/119, 367/120-127
US Class Current

704/270
CPC Class Codes

G10L 15/26 Speech to text systems G10L...

G10L 2021/02166 Microphone arrays; Beamforming

Method and system for microphone array input type speech recognition using band-pass power distribution for sound source position/direction estimation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

117 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for microphone array input type speech recognition using band-pass power distribution for sound source position/direction estimation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

117 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links