Systems and methods for responding to natural language speech utterance
DCFirst Claim
1. A system for processing multi-modal natural language inputs, comprising:
- a multi-modal voice user interface configured to receive a multi-modal input, the multi-modal input including a natural language utterance and a non-speech input, wherein a transcription module coupled to the multi-modal voice user interface is configured to transcribe the non-speech input to create a non-speech-based transcription;
a multi-pass speech recognition module configured to transcribe the natural language utterance into text;
a merging module configured to merge the text of the transcribed utterance and the non-speech-based transcription to create a merged transcription;
a plurality of domain agents, wherein a context description grammar includes one or more grammar expression entries that one or more of the plurality of domain agents are configured to use to process requests in respective contexts;
a knowledge-enhanced speech recognition engine configured to determine a most likely context for the multi-modal input, the knowledge-enhanced speech recognition engine further configured to;
identify one or more contexts that completely or partially match one or more text combinations contained in the merged transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;
score each of the identified matching contexts; and
select the matching context having a highest score as the most likely context for the multi-modal input; and
a response generating module configured to identify one or more of the plurality of domain agents that are configured to process requests in the most likely context for the multi-modal input, the response generating module configured to;
communicate a request to the identified domain agents, the request formulated using at least one grammar expression entry in the context description grammar; and
generate a response to the multi-modal input using content gathered as a result of the identified domain agents processing the request.
7 Assignments
Litigations
1 Petition
Accused Products
Abstract
Systems and methods are provided for receiving speech and non-speech communications of natural language questions and/or commands, transcribing the speech and non-speech communications to textual messages, and executing the questions and/or commands. The invention applies context, prior information, domain knowledge, and user specific profile data to achieve a natural environment for one or more users presenting questions or commands across multiple domains. The systems and methods creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech and non-speech communications and presenting the expected results for a particular question or command.
-
Citations
16 Claims
-
1. A system for processing multi-modal natural language inputs, comprising:
-
a multi-modal voice user interface configured to receive a multi-modal input, the multi-modal input including a natural language utterance and a non-speech input, wherein a transcription module coupled to the multi-modal voice user interface is configured to transcribe the non-speech input to create a non-speech-based transcription; a multi-pass speech recognition module configured to transcribe the natural language utterance into text; a merging module configured to merge the text of the transcribed utterance and the non-speech-based transcription to create a merged transcription; a plurality of domain agents, wherein a context description grammar includes one or more grammar expression entries that one or more of the plurality of domain agents are configured to use to process requests in respective contexts; a knowledge-enhanced speech recognition engine configured to determine a most likely context for the multi-modal input, the knowledge-enhanced speech recognition engine further configured to; identify one or more contexts that completely or partially match one or more text combinations contained in the merged transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack; score each of the identified matching contexts; and select the matching context having a highest score as the most likely context for the multi-modal input; and a response generating module configured to identify one or more of the plurality of domain agents that are configured to process requests in the most likely context for the multi-modal input, the response generating module configured to; communicate a request to the identified domain agents, the request formulated using at least one grammar expression entry in the context description grammar; and generate a response to the multi-modal input using content gathered as a result of the identified domain agents processing the request. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for interpreting natural language utterances using multi-pass automatic speech recognition, comprising:
-
receiving a natural language utterance at a computer comprising a multi-pass speech recognition module configured to use at least one of a dictation grammar or a virtual dictation grammar to transcribe the utterance into text, wherein the dictation grammar includes an unconstrained large vocabulary of words, and wherein the virtual dictation grammar includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words; dynamically generating the constrained vocabulary of words based on one or more previously successful transcriptions; and transcribing the utterance using the multi-pass speech recognition module. - View Dependent Claims (10, 11)
-
-
9. A method for interpreting natural language utterances using multi-pass automatic speech recognition, comprising:
-
receiving a natural language utterance at a computer comprising a multi-pass speech recognition module configured to use at least one of a dictation grammar or a virtual dictation grammar to transcribe the utterance into text, wherein the dictation grammar includes an unconstrained large vocabulary of words, wherein the virtual dictation grammar includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words, and wherein the decoy words include utility words, nonsense words, isolated syllables, and isolated distinct sounds associated with a particular spoken language; and transcribing the utterance using the multi-pass speech recognition module.
-
-
12. A method for interpreting natural language utterances using knowledge-enhanced speech recognition engine, wherein the knowledge-enhanced speech recognition engine is configured to determine an intent and correct false recognitions of the natural language utterances, comprising:
-
receiving a transcription of a natural language utterance at a computer comprising the knowledge-enhanced speech recognition engine; identifying one or more contexts that completely or partially match one or more text combinations contained in the transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack; scoring each of the identified matching contexts; selecting the matching context having a highest score to determine a most likely context for the utterance; and communicating a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.
-
-
13. A method for processing natural language utterances, comprising:
-
receiving a natural language utterance at a computer comprising a multi-pass speech recognition module; transcribing the utterance using the multi-pass speech recognition module, the multi-pass speech recognition module configured to transcribe the utterance into text; identifying one or more contexts that completely or partially match one or more text combinations contained in the text of the transcribed utterance, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack; scoring each of the identified matching contexts; selecting the matching context having a highest score to determine a most likely context for the utterance; and communicating a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.
-
-
14. A system for interpreting natural language utterances, comprising:
-
a voice user interface configured to receive a natural language utterance; and a platform that includes a multi-pass speech recognition module, wherein the multi-pass speech recognition module is configured to; use a dictation grammar that includes an unconstrained large vocabulary of words to create a speech-based transcription of the natural language utterance if the dictation grammar is available on the platform; use a virtual dictation grammar that includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words to create the speech-based transcription if the dictation grammar is not available on the platform; and dynamically generate the constrained vocabulary of words based on one or more previously successful transcriptions.
-
-
15. A system for interpreting natural language utterances, comprising:
-
a voice user interface configured to receive a natural language utterance; a multi-pass speech recognition module configured to transcribe the natural language utterance into text; and a knowledge-enhanced speech recognition engine configured to determine a most likely context for the natural language utterance, the knowledge-enhanced speech recognition engine further configured to; receive the text of the transcribed natural language utterance; identify one or more contexts that completely or partially match one or more text combinations contained in the text of the transcribed utterance, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack; score each of the identified matching contexts; select the matching context having a highest score to determine the most likely context for the utterance; and communicate a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.
-
-
16. A system for interpreting natural language utterances, comprising:
-
a voice user interface configured to receive a natural language utterance; and a platform that includes a multi-pass speech recognition module, wherein the multi-pass speech recognition module is configured to; use a dictation grammar that includes an unconstrained large vocabulary of words to create a speech-based transcription of the natural language utterance if the dictation grammar is available on the platform; and use a virtual dictation grammar that includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words to create the speech-based transcription if the dictation grammar is not available on the platform, wherein the decoy words include utility words, nonsense words, isolated syllables, and isolated distinct sounds associated with a particular spoken language.
-
Specification