Distributed realtime speech recognition system
DC CAFCFirst Claim
1. A machine executable program for assisting a computing system to effectuate distributed voice query recognition comprising:
- a first audio signal receiving routine for receiving user speech utterance signals representing speech utterances to be recognized, said speech utterances including sentences comprised of one or more words; and
a first signal processing routine adapted to generate representative speech data values from said speech utterance signals, said representative speech data values being characterized by a first data content that is substantially inadequate by itself for permitting recognition of words articulated in said speech utterance; and
a formatting routine for rendering said representative speech data values into a transmission format suitable for transmission over a communications channel to a second processing routine executing on a separate computing system wherein said representative speech data values are transmitted continuously during said speech utterances within streaming packets and without waiting for silence to be detected and/or said speech utterances to be completed; and
wherein said first data content in said representative speech data values is used by said second processing routine to compute additional data content that when combined with said first data content is sufficient for a speech recognition routine to complete recognition of words articulated in said speech utterance at said separate computing system and further wherein an amount of said first data content transmitted is configured and can be varied for said speech utterances based on signal processing capabilities of the computing system and/or transmission characteristics of said communications channel.
7 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A real-time system incorporating speech recognition and linguistic processing for recognizing a spoken query by a user and distributed between client and server, is disclosed. The system accepts user'"'"'s queries in the form of speech at the client where minimal processing extracts a sufficient number of acoustic speech vectors representing the utterance. These vectors are sent via a communications channel to the server where additional acoustic vectors are derived. Using Hidden Markov Models (HMMs), and appropriate grammars and dictionaries conditioned by the selections made by the user, the speech representing the user'"'"'s query is fully decoded into text (or some other suitable form) at the server. This text corresponding to the user'"'"'s query is then simultaneously sent to a natural language engine and a database processor where optimized SQL statements are constructed for a full-text search from a database for a recordset of several stored questions that best matches the user'"'"'s query. Further processing in the natural language engine narrows the search to a single stored question. The answer corresponding to this single stored question is next retrieved from the file path and sent to the client in compressed form. At the client, the answer to the user'"'"'s query is articulated to the user using a text-to-speech engine in his or her native natural language. The system requires no training and can operate in several natural languages.
742 Citations
57 Claims
-
1. A machine executable program for assisting a computing system to effectuate distributed voice query recognition comprising:
-
a first audio signal receiving routine for receiving user speech utterance signals representing speech utterances to be recognized, said speech utterances including sentences comprised of one or more words; and
a first signal processing routine adapted to generate representative speech data values from said speech utterance signals, said representative speech data values being characterized by a first data content that is substantially inadequate by itself for permitting recognition of words articulated in said speech utterance; and
a formatting routine for rendering said representative speech data values into a transmission format suitable for transmission over a communications channel to a second processing routine executing on a separate computing system wherein said representative speech data values are transmitted continuously during said speech utterances within streaming packets and without waiting for silence to be detected and/or said speech utterances to be completed; and
wherein said first data content in said representative speech data values is used by said second processing routine to compute additional data content that when combined with said first data content is sufficient for a speech recognition routine to complete recognition of words articulated in said speech utterance at said separate computing system and further wherein an amount of said first data content transmitted is configured and can be varied for said speech utterances based on signal processing capabilities of the computing system and/or transmission characteristics of said communications channel. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 31, 32)
-
-
11. A distributed voice recognition system comprising:
-
a sound processing circuit adapted to receive a speech utterance and to generate associated speech utterance signals therefrom; and
a first signal processing circuit adapted to generate a first set of speech data values from said speech utterance signals, said first set of speech data values being insufficient by themselves for permitting recognition of words articulated in said speech utterance; and
a transmission circuit for formatting and transmitting said first set of speech data values over a communications channel to a second signal processing circuit;
wherein said first set of speech data values are sent in a streaming fashion over said channel before silence is detected and/or said speech utterance is completed; and
said second signal processing circuit being configured to generate a second set of speech data values based on receiving and processing said speech data values during said speech utterance and before silence is detected, such that second set of speech data values contain sufficient information to be usable by a word recognition engine for recognizing words in said speech utterance;
and further wherein at least some words are recognized in real-time and output as text before said speech utterance is completed. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A system for recognizing speech information, comprising:
-
storage means for storing one or more words to be recognized by the system based on the speech information; and
means for capturing speech signals corresponding to the speech information; and
first processing means for generating partially recognized speech data from said speech signals, said first processing means performing a first signal processing operation on said speech signals, said first signal processing operation being insufficient to permit said partially recognized speech data to be correlated with said one or more words; and
second processing means for generating fully recognized speech data from said partially recognized speech data, using a second signal processing operation, such that said fully recognized speech data can be correlated with said one or more words, said second processing means being distinct and physically separated from said first processing means; and
third processing means for generating recognized sentence data from said one or more words using natural language processing operations including word phrase analysis performed on said one or more words;
a non-permanent data transmission connection coupling said first and second processing means;
transmitting means for transmitting said partially processed speech data signals from said first processing means through said non-permanent data transmission connection to said second processing means, said transmitting means using a continuously generated byte stream that is transmitted while a speech utterance is occurring and before silence is detected;
wherein the system recognizes a complete sentence included in the speech information based on said recognized sentence data and determines a best response to said complete sentence in real-time. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 33)
-
-
34. A method of performing speech recognition comprising the steps of:
-
(a) receiving user speech utterance signals representing speech utterances to be recognized, said speech utterances including sentences comprised of one or more words; and
(b) generating representative speech data values with a first computing device which by themselves are insufficient for completing recognition of said one or more words contained in said speech utterance signals; and
(c) formatting said representative speech data values into a transmission format suitable for transmission over a communications channel from said first computing device to a second computing device; and
wherein said representative speech data values are transmitted continuously during said speech utterances within streaming packets and without waiting for silence to be detected and/or said speech utterances to be completed; and
(d) performing a recognition of said one or more words at said second computing device using said representative speech data values and additional speech data values derived from said representative speech data values to generate recognized text;
(e) performing a natural language processing operation on said recognized text to determine a meaning associated with said sentences in real-time. - View Dependent Claims (35, 36, 37)
-
-
38. A speech recognition program operating on a server coupled through a network to a client device, the program comprising:
-
a receiving routine for receiving speech data from the client device, said speech data being associated with a speech utterance from a user of the client device;
wherein said speech data has a data content that is adjusted based on signal processing capabilities of the client device and resources available to the server for processing speech data;
further wherein said speech data is transmitted continuously by the client device during said speech utterance within streaming packets and without waiting for silence to be detected and/or said speech utterance to be completed;
a speech recognition processing routine, for recognizing words contained in said speech utterance;
wherein when said speech data from the client device is insufficient for recognizing words, additional speech related data is computed by the server to generate additional speech data to be combined with said speech data as an input to said speech recognition processing routine;
a natural language processing routine, for recognizing word sentences contained in said speech utterance by performing one or more natural language processing operations on words contained in said speech utterance. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47)
-
-
48. A method of performing speech recognition at a server coupled to a client device through a network, the method comprising the steps of:
-
(a) determining signal processing capabilities of the client device available for processing speech data associated with a user speech utterance;
(b) determining resources available to the server for processing speech related data that is to be transmitted by the client device to the server;
(c) configuring a data content to be used for transmitting said speech related data through the network based on the results of steps (a) and (b);
(d) receiving said speech related data with said data content at the server continuously from the client device during said speech utterance within streaming packets and without waiting for silence to be detected and/or said speech utterance to be completed;
(e) processing said speech related data at the server so that additional speech related data is computed at the server from said speech data and is used to augment said speech related data when said speech related data contains acoustic feature data from said speech utterance that is insufficient for the server to perform accurate recognition of words contained in said speech utterance;
(f) generating a speech data observation vector suitable for processing by a speech recognition routine, said speech data observation vector being based on said speech related data as received from the client device, and/or said speech related data as augmented by the server;
(g) recognizing words contained in said speech utterance with a speech recognition engine using said speech data observation vector;
(h) recognizing word sentences contained in said speech utterance by performing one or more natural language processing operations on words contained in said speech utterance. - View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57)
-
Specification