Enhanced automatic speech recognition using multiple directional microphones
First Claim
1. A voice activation system for user control of a plurality of automatic speech recognition (ASR) systems, comprising:
- a plurality of ASR systems;
a plurality of microphones dispersed within a three-dimensional domain within audible range of a user, wherein each one of the plurality of microphones provides a respective audible voice signal responsive to speech from the user; and
a controller coupled to said plurality of microphones and to each of said plurality of ASR systems for selectively enabling one of said plurality of ASR systems;
wherein each one of the plurality of ASR systems is responsive to the controller, and wherein the controller is responsive to each one of the received respective audible voice signals.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and system for simultaneously controlling a plurality of automatic speech recognition (ASR) systems within the working volume of a room, or for controlling multiple devices as a single unified ASR system. Multiple microphones feed a signal processing system which determines or estimates both a user'"'"'s location and a user'"'"'s orientation within a room as the user issues voice commands, and further determines the microphone providing the best signal. Diversity noise cancellation may be further applied to the microphone signals. Based on the apparent direction the user is facing, the system may then enable one of the ASR systems for voice command recognition and/or execution of voice commands. The invention reduces the number of false command executions and improves the performance, accuracy, and ease of use of voice control.
-
Citations
43 Claims
-
1. A voice activation system for user control of a plurality of automatic speech recognition (ASR) systems, comprising:
-
a plurality of ASR systems;
a plurality of microphones dispersed within a three-dimensional domain within audible range of a user, wherein each one of the plurality of microphones provides a respective audible voice signal responsive to speech from the user; and
a controller coupled to said plurality of microphones and to each of said plurality of ASR systems for selectively enabling one of said plurality of ASR systems;
wherein each one of the plurality of ASR systems is responsive to the controller, and wherein the controller is responsive to each one of the received respective audible voice signals. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
wherein at least one of the plurality of ASR systems has associated with it at least one respective one of the plurality of microphones; - and
wherein at least one of the plurality of microphones associated with a corresponding respective one of the plurality ASR systems is co-located with the corresponding respective ASR system.
-
-
6. The voice activation system as in claim 1,
wherein each one of the plurality of ASR systems further comprises a respective audio input, wherein the voice activation system further comprises an audio delay storage means coupled to the controller for selectively storing and delaying an audio signal; - and
wherein the stored and delayed audio signal is selectively coupled by the controller to the audio input of one of the ASR systems.
- and
-
7. The voice activation system as in claim 1, wherein each one of the ASR systems further comprises a respective enable input coupled to the controller, wherein the respective ASR system is selectively enabled responsive to it'"'"'s respective enable input.
-
8. The voice activation system as in claim 1, wherein the controller is operable in at least a learning mode and an operating mode.
-
9. The voice activation system as in claim 8, wherein the operating mode is continuously active for at least one of the plurality of microphones.
-
10. The voice activation system as in claim 1, wherein the controller further comprises means for providing a database containing at least a respective location within the three-dimensional domain of each one of a plurality of possible targets of a user voice command.
-
11. The voice activation system as in claim 10, wherein the database further comprises an angular tolerance for each respective possible target of a user voice command.
-
12. The voice activation system as in claim 10, wherein the controller further comprises:
-
means to determine a position and an orientation of the user responsive to a received audible voice signal from the user; and
means to match the position and orientation of the user to the respective target in the database.
-
-
13. The voice activation system as in claim 12, further comprising:
-
means to select a target command subset responsive to the means to match; and
means to associate a particular command from the target command subset responsive to the audible voice signal.
-
-
14. The voice activation system as in claim 12, wherein the means to match further comprises a programmable angular tolerance with respect to the orientation of the user.
-
15. The voice activation system as in claim 1, wherein the controller further comprises means to determine a position and an orientation of the user responsive to a received audible voice signal from the user.
-
16. The voice activation system as in claim 15, wherein the controller further comprises at least one of a digital signal processor (DSP) and a signal processing circuit.
-
17. The voice activation system as in claim 16,
wherein the controller further comprises a database containing at least a respective location within the three-dimensional domain of each one of the plurality of microphones; - and
wherein the leaning mode is automatically triggered by detection of a set of input signals inconsistent with the location of the microphones in the database.
- and
-
18. The voice activation system as in claim 1, wherein the controller further comprises means for providing a database containing at least a respective location within the three-dimensional domain of each one of the plurality of microphones.
-
19. The voice activation system as in claim 18, wherein the means for providing a database further comprises:
-
a first graphical user interface interaction means to allow the user to interactively design at least one scale drawing of the three-dimensional domain; and
a second graphical user interface interaction means to allow the user to interactively locate each one of the plurality of microphones with respect to the scale drawing.
-
-
20. The voice activation system as in claim 18, further comprising:
-
triangulation means to detect a position and an orientation of the user within the three-dimensional domain, responsive to the plurality of audible voice signals;
checking means to determine if the detected position and orientation is inconsistent with the location within the three-dimensional domain of each one of the plurality of microphones in the database;
database selection means to determine the largest set of the plurality of microphone locations in the database that are consistent with the determination of the position and the orientation of the user within the three-dimensional domain; and
database modification means to modify the database to change the stored location within the three-dimensional domain of each one of the plurality of microphones not in the largest set, to be consistent with the determination of the position and the orientation of the user within the three-dimensional domain.
-
-
21. The voice activation system as in claim 18, further comprising:
-
timing means to determine an exact arrival time for each respective one of the plurality of audible voice signals;
cross correlation means to calculate a delay number for each respective one of the plurality of audible voice signals by cross correlating the exact arrival times for all of the plurality of audible voice signals;
boundary means to compute a spherical boundary on which the user'"'"'s location must be present, relative to the location of each respective one of the plurality of microphones as stored in the data base, responsive to the respective delay number for each respective one of the audible voice signals;
intersection means to compute the intersection of all the computed spherical boundaries, wherein said intersection defines the detected position of the user within the three-dimensional domain; and
wherein the controller is further responsive to the intersection means.
-
-
22. The voice activation system as in claim 21, further comprising:
-
amplitude and frequency analysis means to analyze the plurality of audible voice signals producing respective amplitude data and respective frequency data for each of the plurality of audible voice signals;
amplitude and frequency adjustment means to adjust the respective amplitude data and the respective frequency data for each of the plurality of audible voice signals responsive to the intersection means;
amplitude and frequency measurement means to measure the adjusted respective amplitude data and the adjusted respective frequency data for each of the plurality of audible voice signals, producing a measure of the attenuation of high frequency content for each respective audible voice signal;
frequency attenuation comparison means to compare all the measures of attenuation of high frequency content for all respective audible voice signals, to determine the audible voice signals having a smallest measure of high frequency attenuation;
database lookup means to determine the microphone location corresponding to the audible voice signal having the smallest measure of high frequency attenuation as determined by the frequency attenuation comparison means;
user direction determination means to determine the direction the user is facing, responsive to the database lookup means and the intersection means; and
wherein the controller is further responsive to the user direction determination means.
-
-
23. The voice activation system as in claim 1, wherein the plurality of microphones are arranged in a predefined pattern within the three-dimensional domain.
-
24. The voice activation system as in claim 23, wherein the predefined pattern comprises at least one microphone located at each location that may be a target of an audible voice signal from the user.
-
25. The voice activation system as in claim 23, wherein the predefined pattern comprises at least one microphone located at each corner of the three-dimensional domain.
-
26. The voice activation system as in claim 23, wherein the predefined pattern comprises at least one microphone located at the center of each wall in the three-dimensional domain.
-
27. The voice activation system as in claim 23,
wherein a set of significant architectural features are defined to include at least one of doors, windows, large furniture, internal obstructions, and partitions; - and
wherein the predefined pattern comprises at least one microphone located in a relation to each of the set of significant architectural features.
- and
-
28. The voice activation system as in claim 23, wherein all of the microphones in the plurality of microphones have a unidirectional pattern.
-
29. The voice activation system as in claim 28, wherein all of the microphones in the plurality of microphones have a same frequency response, a same dynamic range, and a same sensitivity.
-
30. A method for simultaneously controlling a plurality of automatic speech recognition (ASR) systems, the method comprising the steps:
-
accepting a plurality of voice input signals from a respective plurality of microphones located in a three-dimensional domain within audible range of a user;
determining a position and an orientation of the user within the three-dimensional domain responsive to the plurality of voice input signals;
selecting one of the plurality of ASR systems responsive to the determining of the user'"'"'s position and orientation;
enabling the selected ASR system for voice command operation; and
disabling a set of ASR systems wherein the set comprises the plurality of ASR systems excluding the selected ASR system. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
wherein the step of determining the position and the orientation of the user further comprises the method of storing samples of at least one voice input signal during the determining; - and
wherein the step of enabling the selected ASR system further comprises relaying the stored samples of the voice input signal to the selected ASR system.
-
-
35. The method as in claim 30, wherein the step of selecting one of the plurality of ASR systems responsive to the determining of the user'"'"'s position and orientation further comprises:
choosing the ASR system most likely desired by the user for voice command operation responsive to a programmable angular tolerance for determination of the user'"'"'s orientation.
-
36. The method as in claim 30, further comprising the step of maintaining a database, the database containing at least a location within the three-dimensional domain of each possible target of a user voice command.
-
37. The method as in claim 36, wherein the step of maintaining the database further comprises the method of:
-
interactively designing a scale drawing of the three-dimensional domain; and
locating each possible target of a user voice command with respect to the scale drawing.
-
-
38. The method as in claim 37, wherein the step of maintaining a database further comprises the method:
-
establishing each of the plurality of microphones at a respective unique user controlled location within the three-dimensional domain;
sounding a user test sound at each of a plurality of possible targets of a user voice command within the three-dimensional domain producing system training data; and
creating an initial database containing location information for each of the plurality of microphones, and location information for each of the plurality of possible targets of a user voice command, responsive to the system training data.
-
-
39. The method as in claim 30, further comprising the step of maintaining a database, the database containing at least the respective location within the three-dimensional domain of each one of the plurality of microphones.
-
40. The method as in claim 39, wherein the step of maintaining the database further comprises the method of:
-
interactively designing at least one scale drawing of the three-dimensional domain; and
locating each one of the plurality of microphones with respect to the scale drawing.
-
-
41. The method as in claim 39, wherein the step of maintaining the database further comprises the method of:
-
detecting a position and an orientation of the user within the three-dimensional domain responsive to the plurality of voice input signals inconsistent with the location within the three-dimensional domain of each one of the plurality of microphones in the database;
determining the largest set of the plurality of microphone locations in the database that are consistent with the determination of the position and the orientation of the user within the three-dimensional domain; and
modifying the database to change the stored location within the three-dimensional domain of each one of the plurality of microphones not in the largest set to be consistent with the determination of the position and the orientation of the user within the three-dimensional domain.
-
-
42. The method as in claim 39, wherein the step of determining the position and the orientation of the user within the three-dimensional domain further comprises the method of:
-
determining an exact arrival time for each respective one of the plurality of voice input signals;
calculating a delay number for each respective one of the plurality of voice input signals by cross correlating the exact arrival times for all of the plurality of voice input signals;
computing a spherical boundary on which the user must be present, relative to the location of each respective one of the plurality of microphones as stored in the data base, responsive to the respective delay number for each reactive one of the voice input signals;
intersecting all the spherical boundaries at a common minimal intersection, wherein said common minimal intersection defines the detected position of the user within the three-dimensional domain.
-
-
43. The method as in claim 42, wherein the step of determining the position and the orientation of the user further comprises the method of:
-
analyzing the plurality of voice input signals producing respective amplitude data and respective frequency data for each of the plurality of voice input signals;
adjusting the respective amplitude data and the respective frequency data for each of the plurality of voice input signals responsive to the triangulated position of the user;
measuring the adjusted respective amplitude data and the adjusted respective frequency data for each of the plurality of voice input signals producing a measure of the attenuation of high frequency content for each respective voice input signal;
comparing all the measures of attenuation of high frequency content for all respective voice input signals to determine the voice input signal having a smallest measure of high frequency attenuation;
identifying a selected one of the plurality of microphones as corresponding to the voice input signal having the smallest measure of high frequency attenuation;
determining a selected one of the plurality of target locations responsive to looking up the location of the selected one of the plurality of microphones in the database; and
determining the orientation responsive to the triangulated position of the user to the selected location.
-
Specification