Systems and methods for responding to natural language speech utterance

US 7,640,160 B2
Filed: 08/05/2005
Issued: 12/29/2009
Est. Priority Date: 08/05/2005
Status: Active Grant

- Alert
- Pin

Associated Cases

Associated Defendants

First Claim

Patent Images

1. A system for processing multi-modal natural language inputs, comprising:

a multi-modal voice user interface configured to receive a multi-modal input, the multi-modal input including a natural language utterance and a non-speech input, wherein a transcription module coupled to the multi-modal voice user interface is configured to transcribe the non-speech input to create a non-speech-based transcription;

a multi-pass speech recognition module configured to transcribe the natural language utterance into text;

a merging module configured to merge the text of the transcribed utterance and the non-speech-based transcription to create a merged transcription;

a plurality of domain agents, wherein a context description grammar includes one or more grammar expression entries that one or more of the plurality of domain agents are configured to use to process requests in respective contexts;

a knowledge-enhanced speech recognition engine configured to determine a most likely context for the multi-modal input, the knowledge-enhanced speech recognition engine further configured to;

identify one or more contexts that completely or partially match one or more text combinations contained in the merged transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;

score each of the identified matching contexts; and

select the matching context having a highest score as the most likely context for the multi-modal input; and

a response generating module configured to identify one or more of the plurality of domain agents that are configured to process requests in the most likely context for the multi-modal input, the response generating module configured to;

communicate a request to the identified domain agents, the request formulated using at least one grammar expression entry in the context description grammar; and

generate a response to the multi-modal input using content gathered as a result of the identified domain agents processing the request.

View all claims

7 Assignments

Timeline View

Assignment View

Litigations

1 Petition

Accused Products

Abstract

Systems and methods are provided for receiving speech and non-speech communications of natural language questions and/or commands, transcribing the speech and non-speech communications to textual messages, and executing the questions and/or commands. The invention applies context, prior information, domain knowledge, and user specific profile data to achieve a natural environment for one or more users presenting questions or commands across multiple domains. The systems and methods creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech and non-speech communications and presenting the expected results for a particular question or command.

Citations

16 Claims

1. A system for processing multi-modal natural language inputs, comprising:
- a multi-modal voice user interface configured to receive a multi-modal input, the multi-modal input including a natural language utterance and a non-speech input, wherein a transcription module coupled to the multi-modal voice user interface is configured to transcribe the non-speech input to create a non-speech-based transcription;
  
  a multi-pass speech recognition module configured to transcribe the natural language utterance into text;
  
  a merging module configured to merge the text of the transcribed utterance and the non-speech-based transcription to create a merged transcription;
  
  a plurality of domain agents, wherein a context description grammar includes one or more grammar expression entries that one or more of the plurality of domain agents are configured to use to process requests in respective contexts;
  
  a knowledge-enhanced speech recognition engine configured to determine a most likely context for the multi-modal input, the knowledge-enhanced speech recognition engine further configured to;
  
  identify one or more contexts that completely or partially match one or more text combinations contained in the merged transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;
  
  score each of the identified matching contexts; and
  
  select the matching context having a highest score as the most likely context for the multi-modal input; and
  
  a response generating module configured to identify one or more of the plurality of domain agents that are configured to process requests in the most likely context for the multi-modal input, the response generating module configured to;
  
  communicate a request to the identified domain agents, the request formulated using at least one grammar expression entry in the context description grammar; and
  
  generate a response to the multi-modal input using content gathered as a result of the identified domain agents processing the request.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the domain agents identified by the response generating module include a plurality of the domain agents, and wherein the response generated by the response generating module includes an aggregation of content gathered as a result of the plurality of identified domain agents processing the request.
  - 3. The system of claim 1, wherein the multi-modal voice user interface is further configured to subsequently receive one or more follow-up multi-modal inputs, the follow-up multi-modal inputs including at least one of a follow-up natural language utterance or a follow-up non-speech input.
  - 4. The system of claim 3, wherein the identified domain agents are further configured to update the context stack in response to successfully processing the request, and wherein the knowledge-enhanced speech recognition engine is further configured to determine a most likely context for the follow-up multi-modal input using the updated context stack.
  - 5. The system of claim 1, wherein the grammar expression entries in the context description grammar include relationships between criteria associated with requests in the respective contexts.
  - 6. The system of claim 5, wherein the grammar expression entries in the context description grammar further include sequence orders for the criteria associated with the requests.
  - 7. The system of claim 5, wherein the identified domain agents are further configured to push criteria onto the context stack in response to successfully processing the request.

8. A method for interpreting natural language utterances using multi-pass automatic speech recognition, comprising:
- receiving a natural language utterance at a computer comprising a multi-pass speech recognition module configured to use at least one of a dictation grammar or a virtual dictation grammar to transcribe the utterance into text, wherein the dictation grammar includes an unconstrained large vocabulary of words, and wherein the virtual dictation grammar includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words;
  
  dynamically generating the constrained vocabulary of words based on one or more previously successful transcriptions; and
  
  transcribing the utterance using the multi-pass speech recognition module.
- View Dependent Claims (10, 11)
- - 10. The method of claim 8, wherein the text of the transcribed utterance includes a complete transcription of the utterance.
  - 11. The method of claim 8, wherein the text of the transcribed utterance includes a partial transcription of the utterance.

9. A method for interpreting natural language utterances using multi-pass automatic speech recognition, comprising:
- receiving a natural language utterance at a computer comprising a multi-pass speech recognition module configured to use at least one of a dictation grammar or a virtual dictation grammar to transcribe the utterance into text, wherein the dictation grammar includes an unconstrained large vocabulary of words, wherein the virtual dictation grammar includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words, and wherein the decoy words include utility words, nonsense words, isolated syllables, and isolated distinct sounds associated with a particular spoken language; and
  
  transcribing the utterance using the multi-pass speech recognition module.

12. A method for interpreting natural language utterances using knowledge-enhanced speech recognition engine, wherein the knowledge-enhanced speech recognition engine is configured to determine an intent and correct false recognitions of the natural language utterances, comprising:
- receiving a transcription of a natural language utterance at a computer comprising the knowledge-enhanced speech recognition engine;
  
  identifying one or more contexts that completely or partially match one or more text combinations contained in the transcription, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;
  
  scoring each of the identified matching contexts;
  
  selecting the matching context having a highest score to determine a most likely context for the utterance; and
  
  communicating a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.

13. A method for processing natural language utterances, comprising:
- receiving a natural language utterance at a computer comprising a multi-pass speech recognition module;
  
  transcribing the utterance using the multi-pass speech recognition module, the multi-pass speech recognition module configured to transcribe the utterance into text;
  
  identifying one or more contexts that completely or partially match one or more text combinations contained in the text of the transcribed utterance, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;
  
  scoring each of the identified matching contexts;
  
  selecting the matching context having a highest score to determine a most likely context for the utterance; and
  
  communicating a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.

14. A system for interpreting natural language utterances, comprising:
- a voice user interface configured to receive a natural language utterance; and
  
  a platform that includes a multi-pass speech recognition module, wherein the multi-pass speech recognition module is configured to;
  
  use a dictation grammar that includes an unconstrained large vocabulary of words to create a speech-based transcription of the natural language utterance if the dictation grammar is available on the platform;
  
  use a virtual dictation grammar that includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words to create the speech-based transcription if the dictation grammar is not available on the platform; and
  
  dynamically generate the constrained vocabulary of words based on one or more previously successful transcriptions.

15. A system for interpreting natural language utterances, comprising:
- a voice user interface configured to receive a natural language utterance;
  
  a multi-pass speech recognition module configured to transcribe the natural language utterance into text; and
  
  a knowledge-enhanced speech recognition engine configured to determine a most likely context for the natural language utterance, the knowledge-enhanced speech recognition engine further configured to;
  
  receive the text of the transcribed natural language utterance;
  
  identify one or more contexts that completely or partially match one or more text combinations contained in the text of the transcribed utterance, wherein identifying the matching contexts includes comparing the text combinations against the grammar expression entries in the context description grammar and against one or more expected contexts stored in a context stack;
  
  score each of the identified matching contexts;
  
  select the matching context having a highest score to determine the most likely context for the utterance; and
  
  communicate a request to a domain agent configured to process requests in the most likely context for the utterance, the request formulated using at least one grammar expression entry in the context description grammar.

16. A system for interpreting natural language utterances, comprising:
- a voice user interface configured to receive a natural language utterance; and
  
  a platform that includes a multi-pass speech recognition module, wherein the multi-pass speech recognition module is configured to;
  
  use a dictation grammar that includes an unconstrained large vocabulary of words to create a speech-based transcription of the natural language utterance if the dictation grammar is available on the platform; and
  
  use a virtual dictation grammar that includes a constrained vocabulary of words and a plurality of decoy words for out-of-vocabulary words to create the speech-based transcription if the dictation grammar is not available on the platform, wherein the decoy words include utility words, nonsense words, isolated syllables, and isolated distinct sounds associated with a particular spoken language.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Dialect, LLC
Original Assignee
VoiceBox Technologies, Inc. (Microsoft Corporation)
Inventors
Armstrong, Lynn Elise, Ke, Min, Di Cristo, Philippe, Kennewick, Robert A.
Primary Examiner(s)
Hudspeth; David R
Assistant Examiner(s)
ALBERTALLI, BRIAN LOUIS

Application Number

US11/197,504
Publication Number

US 20070033005A1
Time in Patent Office

1,607 Days
Field of Search

None
US Class Current

704/257
CPC Class Codes

G06F 16/3329   Natural language query form...

G06F 16/335   Filtering based on addition...

G06F 40/30   Semantic analysis

G06F 40/35   Discourse or dialogue repre...

G10L 15/18   using natural language mode...

G10L 15/1822   Parsing for meaning underst...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

Systems and methods for responding to natural language speech utterance

First Claim

7 Assignments

Litigations

1 Petition

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for responding to natural language speech utterance

First Claim

7 Assignments

Subscription Required

Subscription Required

Litigations

1 Petition

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links