Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems
First Claim
1. A method for preprocessing speech input signals from a microphone array receiving speech from a distant-talking speaker, for converting features of the measured speech of distant-talking reverberant speech input to be substantially similar to those of features of close-talking speech input signals used to train a speech recognition system and/or a speaker recognition system, comprising the steps of:
- simultaneously recording the close-talking speech from said speaker positioned close to a microphone, and distant-talking reverberant speech from said speaker positioned a distance from said microphone array, for a predetermined number of sentences;
extracting features of said close-talking speech and said distant talking reverberant speech;
connecting features of the distant-talking reverberant speech to input nodes of a neural network system;
connecting features of the close-talking speech to output nodes of said neural network;
training said neural network system to convert said features of distant-talking reverberant speech to a form substantially similar to said features of close-talking speech relative to said speaker;
disconnecting the close-talking speech from said output nodes after said neural network is trained; and
permitting said speaker to speak a distance from said microphone array unencumbered by a close-talking microphone, by providing corrected cepstrum coefficients of said features of the speech representative of close-talking speech from said output nodes of said neural network, for connection to either one or both of said speech recognition system and said speaker recognition system.
3 Assignments
0 Petitions
Accused Products
Abstract
A neural network is trained to transform distant-talking cepstrum coefficients, derived from a microphone array receiving speech from a speaker distant therefrom, into a form substantially similar to close-talking cepstrum coefficients that would be derived from a microphone close to the speaker, for providing robust hands-free speech and speaker recognition in adverse practical environments with existing speech and speaker recognition systems which have been trained on close-talking speech.
153 Citations
16 Claims
-
1. A method for preprocessing speech input signals from a microphone array receiving speech from a distant-talking speaker, for converting features of the measured speech of distant-talking reverberant speech input to be substantially similar to those of features of close-talking speech input signals used to train a speech recognition system and/or a speaker recognition system, comprising the steps of:
-
simultaneously recording the close-talking speech from said speaker positioned close to a microphone, and distant-talking reverberant speech from said speaker positioned a distance from said microphone array, for a predetermined number of sentences; extracting features of said close-talking speech and said distant talking reverberant speech; connecting features of the distant-talking reverberant speech to input nodes of a neural network system; connecting features of the close-talking speech to output nodes of said neural network; training said neural network system to convert said features of distant-talking reverberant speech to a form substantially similar to said features of close-talking speech relative to said speaker; disconnecting the close-talking speech from said output nodes after said neural network is trained; and permitting said speaker to speak a distance from said microphone array unencumbered by a close-talking microphone, by providing corrected cepstrum coefficients of said features of the speech representative of close-talking speech from said output nodes of said neural network, for connection to either one or both of said speech recognition system and said speaker recognition system. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for converting "distant-talking" reverberant speech input signals from sound detecting apparatus at a distance from a speaker, to signals substantially similar to those obtained from "close-talking" where the speaker is close to a microphone, the converted speech features being connected to the input terminals of either one or both of a speech recognized system, and speaker recognized system, each of which was trained with close-talking speech, said conversion system comprising:
-
said microphone for close-talking speech reception; said sound detecting apparatus for distant-talking reverberant speech reception; means for extracting multiple features of both said close-talking speech, and distant-talking reverberant speech, in the form of cepstrum coefficients, respectively; a neural network system having a plurality of input nodes and a plurality of output nodes; mode means selectively operable for placing said neural network into either a training mode or a recognition mode, wherein when in said training mode, features of said distant-talking reverberant speech outputted from said means for extracting are applied to said input nodes, and simultaneously close-talking speech outputted from said means for extracting is applied to said output nodes, with the speaker uttering a plurality of sentences selected for training, and when in said recognition mode said output nodes are disconnected from said microphone, thereby making said output nodes available for connection to an input port of either one or both of a speech recognition system and speaker recognition system, for providing speech features thereto that have been converted by said neural network from distant-talking reverberant form to corresponding close-talking form capable of being accurately recognized by said speech and speaker recognition systems, respectively. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for improving the speech recognition accuracy in speech and/or speaker recognition systems in which a speaker roams unencumbered while speaking to provide distant-talking speech in a reverberant environment, comprising the steps of:
-
training speech and/or speaker recognizer systems with close-talking speech derived from the speaker speaking close to a microphone; extracting features of said close-talking speech, and said distant talking reverberant speech in the form of cepstrum coefficients, respectively; and training a neural network to transform distant-talking reverberant speech features derived from sound pickup apparatus a distance from said speaker, into close-talking speech features that can be accurately detected by said speech and/or speaker recognizer systems, during normal recognition of speech from the roaming speaker. - View Dependent Claims (16)
-
Specification