Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems

US 5,737,485 A
Filed: 03/07/1995
Issued: 04/07/1998
Est. Priority Date: 03/07/1995
Status: Expired due to Term

First Claim

Patent Images

1. A method for preprocessing speech input signals from a microphone array receiving speech from a distant-talking speaker, for converting features of the measured speech of distant-talking reverberant speech input to be substantially similar to those of features of close-talking speech input signals used to train a speech recognition system and/or a speaker recognition system, comprising the steps of:

simultaneously recording the close-talking speech from said speaker positioned close to a microphone, and distant-talking reverberant speech from said speaker positioned a distance from said microphone array, for a predetermined number of sentences;

extracting features of said close-talking speech and said distant talking reverberant speech;

connecting features of the distant-talking reverberant speech to input nodes of a neural network system;

connecting features of the close-talking speech to output nodes of said neural network;

training said neural network system to convert said features of distant-talking reverberant speech to a form substantially similar to said features of close-talking speech relative to said speaker;

disconnecting the close-talking speech from said output nodes after said neural network is trained; and

permitting said speaker to speak a distance from said microphone array unencumbered by a close-talking microphone, by providing corrected cepstrum coefficients of said features of the speech representative of close-talking speech from said output nodes of said neural network, for connection to either one or both of said speech recognition system and said speaker recognition system.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A neural network is trained to transform distant-talking cepstrum coefficients, derived from a microphone array receiving speech from a speaker distant therefrom, into a form substantially similar to close-talking cepstrum coefficients that would be derived from a microphone close to the speaker, for providing robust hands-free speech and speaker recognition in adverse practical environments with existing speech and speaker recognition systems which have been trained on close-talking speech.

153 Citations

16 Claims

1. A method for preprocessing speech input signals from a microphone array receiving speech from a distant-talking speaker, for converting features of the measured speech of distant-talking reverberant speech input to be substantially similar to those of features of close-talking speech input signals used to train a speech recognition system and/or a speaker recognition system, comprising the steps of:
- simultaneously recording the close-talking speech from said speaker positioned close to a microphone, and distant-talking reverberant speech from said speaker positioned a distance from said microphone array, for a predetermined number of sentences;
  
  extracting features of said close-talking speech and said distant talking reverberant speech;
  
  connecting features of the distant-talking reverberant speech to input nodes of a neural network system;
  
  connecting features of the close-talking speech to output nodes of said neural network;
  
  training said neural network system to convert said features of distant-talking reverberant speech to a form substantially similar to said features of close-talking speech relative to said speaker;
  
  disconnecting the close-talking speech from said output nodes after said neural network is trained; and
  
  permitting said speaker to speak a distance from said microphone array unencumbered by a close-talking microphone, by providing corrected cepstrum coefficients of said features of the speech representative of close-talking speech from said output nodes of said neural network, for connection to either one or both of said speech recognition system and said speaker recognition system.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein said connecting step for said input nodes includes passing the distant-talking speech through a feature extractor, and connecting the extracted features in the form of cepstrum coefficients to said input nodes.
  - 3. The method of claim 2, wherein said connecting step for said output nodes includes passing the close talking speech through another feature extractor, and connecting the extracted features in the cepstrum coefficients to said output nodes.
  - 4. The method of claim 1, wherein said connecting step for said output nodes includes passing the close-talking speech through a feature extractor, and connecting the extracted features in the form of cepstrum coefficients to said output nodes.
  - 5. The methods of claim 1, further including the steps of:
    - said connecting step for said input nodes including passing the distant-talking speech through a first feature extractor, and connecting output signals of said first feature extractor to said input nodes, respectively; and
      
      said connecting step for said output nodes including passing the close-talking speech through a second feature extractor, and connecting output signals of said second feature extractor to said output nodes, respectively.

6. A system for converting "distant-talking" reverberant speech input signals from sound detecting apparatus at a distance from a speaker, to signals substantially similar to those obtained from "close-talking" where the speaker is close to a microphone, the converted speech features being connected to the input terminals of either one or both of a speech recognized system, and speaker recognized system, each of which was trained with close-talking speech, said conversion system comprising:
- said microphone for close-talking speech reception;
  
  said sound detecting apparatus for distant-talking reverberant speech reception;
  
  means for extracting multiple features of both said close-talking speech, and distant-talking reverberant speech, in the form of cepstrum coefficients, respectively;
  
  a neural network system having a plurality of input nodes and a plurality of output nodes;
  
  mode means selectively operable for placing said neural network into either a training mode or a recognition mode, wherein when in said training mode, features of said distant-talking reverberant speech outputted from said means for extracting are applied to said input nodes, and simultaneously close-talking speech outputted from said means for extracting is applied to said output nodes, with the speaker uttering a plurality of sentences selected for training, and when in said recognition mode said output nodes are disconnected from said microphone, thereby making said output nodes available for connection to an input port of either one or both of a speech recognition system and speaker recognition system, for providing speech features thereto that have been converted by said neural network from distant-talking reverberant form to corresponding close-talking form capable of being accurately recognized by said speech and speaker recognition systems, respectively.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The conversion system of claim 6, wherein said neural network consists of a multi-layer perceptrons neural network including input, output, and hidden layers.
  - 8. The conversion system of claim 6, wherein said extracting means includes a first feature extractor connected between said sound detecting apparatus and said input nodes, respectively, for inputting features of said output signals of said sound detecting apparatus to said input nodes, respectively, in the form of a plurality of cepstrum coefficients.
  - 9. The conversion system of claim 8, wherein said extracting means further includes a second feature extractor selectively connectable via said mode means between said microphone and said output nodes, respectively, for extracting from an output signal from said microphone a plurality cepstrum coefficients for connection to said output nodes, respectively in the training mode.
  - 10. The conversion system of claim 9, where said mode means consists of a plurality of single-pole-double-throw switching means having individual poles connected to individual ones of said plurality of output nodes, respectively, for selectively connecting each said output node individually to either one output terminal of a plurality of output terminals of said feature extractor, respectively, or to individual input terminals of either one or both of said speech and recognizer systems, respectively.
  - 11. The conversion system of claim 6, wherein said sound detecting apparatus includes a microphone array.
  - 12. The conversion system of claim 11, wherein said microphone array includes a one-dimensional beamforming line array of a plurality of microphones.
  - 13. The conversion system of claim 12, wherein said plurality of microphones of said microphone array are non-uniformly positioned in a line for providing harmonical nesting over four octaves.
  - 14. The conversion system of claim 6, wherein said mode means consists of a plurality of single-pole-double-throw switching means having individual poles connected to individual ones of said plurality of output nodes, respectively, for selectively connecting each said output node individually to either said microphone or to a plurality of input terminals of either or both of said speech and speaker recognizers, respectively.

15. A method for improving the speech recognition accuracy in speech and/or speaker recognition systems in which a speaker roams unencumbered while speaking to provide distant-talking speech in a reverberant environment, comprising the steps of:
- training speech and/or speaker recognizer systems with close-talking speech derived from the speaker speaking close to a microphone;
  
  extracting features of said close-talking speech, and said distant talking reverberant speech in the form of cepstrum coefficients, respectively; and
  
  training a neural network to transform distant-talking reverberant speech features derived from sound pickup apparatus a distance from said speaker, into close-talking speech features that can be accurately detected by said speech and/or speaker recognizer systems, during normal recognition of speech from the roaming speaker.
- View Dependent Claims (16)
- - 16. The method of claim 15, further including the steps of:
    - configuring a microphone array to provide said sound pickup apparatus.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rutgers University
Original Assignee
Rutgers University
Inventors
Rahim, Mazin, Che, Chiwei, Lin, Qiguang, Flanagan, James L.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Storm, Donald L.

Application Number

US08/399,445
Time in Patent Office

1,127 Days
Field of Search

395/2.09, 395/2.1, 395/2.11, 395/2.4, 395/2.41, 395/2.35, 395/2.36, 395/2.37, 395/2.42, 395/2.5, 395/2.56, 395/2.6, 395/2.61, 395/2.67, 395/2.68, 395/21 , 395/23, 395/24, 395/2.53
US Class Current

704/232
CPC Class Codes

G10L 15/16 using artificial neural net...

G10L 25/24 the extracted parameters be...

Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

153 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

153 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others