SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

US 20110131036A1
Filed: 02/07/2011
Published: 06/02/2011
Est. Priority Date: 08/10/2005
Status: Active Grant

First Claim

Patent Images

1. A system for processing natural language utterances, comprising:

a multimodal device configured to receive a natural language utterance;

a speech recognition engine configured to recognize one or more words from the natural language utterance;

a parser configured to generate an interpretation of the natural language utterance from the one or more recognized words, and further configured to generate a request based on the interpretation of the natural language utterance;

a domain agent configured to process the generated request; and

an adaptive misrecognition engine configured to monitor one or more actions associated with the domain agent processing the request and determine whether the interpretation of the natural language utterance is correct or incorrect based on the one or more monitored actions.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are provided for receiving speech and/or non-speech communications of natural language questions and/or commands and executing the questions and/or commands. The invention provides a conversational human-machine interface that includes a conversational speech analyzer, a general cognitive model, an environmental model, and a personalized cognitive model to determine context, domain knowledge, and invoke prior information to interpret a spoken utterance or a received non-spoken message. The system and method creates, stores, and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech or non-speech communication and presenting the expected results for a particular question or command.

565 Citations

44 Claims

1. A system for processing natural language utterances, comprising:
- a multimodal device configured to receive a natural language utterance;
  
  a speech recognition engine configured to recognize one or more words from the natural language utterance;
  
  a parser configured to generate an interpretation of the natural language utterance from the one or more recognized words, and further configured to generate a request based on the interpretation of the natural language utterance;
  
  a domain agent configured to process the generated request; and
  
  an adaptive misrecognition engine configured to monitor one or more actions associated with the domain agent processing the request and determine whether the interpretation of the natural language utterance is correct or incorrect based on the one or more monitored actions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The system of claim 1, wherein the adaptive misrecognition engine is further configured to generate an unrecognized event in response to determining that the interpretation of the natural language utterance is incorrect.
  - 3. The system of claim 2, further comprising an analyzer configured to:
    - analyze the unrecognized event to determine how the natural language utterance was incorrectly interpreted; and
      
      determine one or more tuning parameters for at least one of the speech recognition engine or the parser based on how the natural language utterance was incorrectly interpreted, wherein the tuning parameters are used to improve interpretations of subsequent natural language utterances relating to the request.
  - 4. The system of claim 1, further comprising an analyzer configured to:
    - track an interaction pattern with the system over time for a user that provided the natural language utterance;
      
      generate a personalized cognitive model for the user based on the interaction pattern tracked for the user; and
      
      use the personalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 5. The system of claim 1, further comprising an analyzer configured to track interaction patterns with the system over time for a plurality users.
  - 6. The system of claim 5, wherein the analyzer is further configured to generate a generalized cognitive model for the plurality of users based on the interaction patterns tracked for the plurality of users, wherein the generalized cognitive model includes a statistical abstract that corresponds to the tracked interaction patterns.
  - 7. The system of claim 6, wherein the analyzer is further configured to use the generalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 8. The system of claim 1, further comprising an analyzer configured to generate an environmental model that includes information associated with at least one of environmental conditions or surroundings associated with a user that provided the natural language utterance.
  - 9. The system of claim 8, wherein the environmental conditions or surroundings include one or more of a global position of the user, movement information associated with the user, quiet or noisy conditions associated with an environment of the user, or a vicinity to one or more voice-enabled devices.
  - 10. The system of claim 8, wherein the environmental model provides one or more of context, domain knowledge, preferences, or cognitive qualities to enhance the interpretation of the natural language utterance.
  - 11. The system of claim 1, further comprising:
    - a knowledge-enhanced speech recognition engine configured to determine a most likely context for the natural language utterance, wherein the knowledge-enhanced speech recognition engine is further configured to;
      
      compare one or more text combinations against one or more grammar expression entries in a context description grammar to identify one or more contexts that completely or partially match the one or more text combinations;
      
      provide a relevance score for each of the identified matching contexts; and
      
      select the matching context having a highest score as the most likely context for the natural language utterance, wherein the domain agent configured to process the generated request is associated with the selected context; and
      
      a response generating module configured to;
      
      communicate the request to the domain agent associated with the selected context; and
      
      generate a response to the natural language utterance using content gathered as a result of the domain agent processing the request, wherein the response arranges the content in an order based on the relevance scores for the identified matching contexts.
  - 12. The system of claim 11, wherein the response generated by the response generating module includes an aggregation of the content gathered as a result of the domain agent processing the request.
  - 13. The system of claim 11, further comprising a personality module configured to format the response.
  - 14. The system of claim 11, wherein the knowledge-enhanced speech recognition engine is further configured to compare the text combinations against a context stack that stores one or more expected contexts to identify the one or more contexts.
  - 15. The system of claim 11, wherein the knowledge-enhanced speech recognition engine is further configured to apply prior probabilities or fuzzy possibilities to at least one of keyword matching, user profiles, or a dialog history to identify the one or more contexts.
  - 16. The system of claim 11, wherein the domain agent is further configured to direct a query to at least one of a local information source or a network information source to process the request.
  - 17. The system of claim 16, wherein the domain agent is further configured to evaluate a Plurality of responses to the query to process the request.
  - 18. The system of claim 11, wherein the domain agent is further configured to direct a command to at least one of a local device or a remote device to process the request.
  - 19. The system of claim 1, wherein the multimodal device includes at least one of a personal digital assistant, a cellular telephone, a portable computer, or a desktop computer.
  - 20. The system of claim 1, wherein the multimodal device is further configured to subsequently receive one or more follow-up multimodal inputs.
  - 21. The system of claim 20, wherein the speech recognition engine is further configured to recognize one or more words from a natural language utterance provided in the follow-up multimodal input, and wherein the parser is further configured to generate an interpretation of the follow-up multimodal input from the one or more words recognized from the natural language utterance provided in the follow-up multimodal input.
  - 22. The system of claim 20, wherein the follow-up multimodal input includes a follow-up request associated with a same context as the request being processed by the domain agent.
  - 23. The system of claim 1, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user providing a subsequent request to stop the request being processed by the domain agent.
  - 24. The system of claim 1, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user repeating the natural language utterance.
  - 25. The system of claim 1, wherein the multimodal device is further configured to receive a non-speech input relating to the natural language utterance, and wherein the system further comprises:
    - a transcription module configured to transcribe the non-speech input to create a non-speech-based transcription; and
      
      a merging module configured to merge the recognized words and the non-speech-based transcription to create a merged transcription, wherein the parser is further configured to generate the interpretation of the natural language utterance from the merged transcription.

26. A method for processing natural language utterances, comprising:
- receiving a natural language utterance at a multimodal device;
  
  recognizing one or more words from the natural language utterance using a speech recognition engine coupled to the multimodal device;
  
  generating an interpretation of the natural language utterance from the one or more recognized words using a parser coupled to the multimodal device, wherein the parser generates a request based on the interpretation of the natural language utterance;
  
  invoking a domain agent configured to process the generated request;
  
  monitoring one or more actions associated with the domain agent processing the request using an adaptive misrecognition engine; and
  
  determining, at the adaptive misrecognition engine, whether the interpretation of the natural language utterance is correct or incorrect based on the one or more monitored actions.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 27. The method of claim 26, wherein the adaptive misrecognition engine generates an unrecognized event in response to determining that the interpretation of the natural language utterance is incorrect.
  - 28. The method of claim 27, further comprising:
    - analyzing the unrecognized event to determine how the natural language utterance was incorrectly interpreted; and
      
      determining one or more tuning parameters for at least one of the speech recognition engine or the parser based on how the natural language utterance was incorrectly interpreted, wherein the tuning parameters are used to improve interpretations of subsequent natural language utterances relating to the request.
  - 29. The method of claim 26, further comprising:
    - tracking an interaction pattern over time for a user that provided the natural language utterance;
      
      generating a personalized cognitive model for the user based on the interaction pattern tracked for the user; and
      
      using the personalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 30. The method of claim 26, further comprising tracking interaction patterns over time for a plurality of users.
  - 31. The method of claim 30, further comprising generating a generalized cognitive model for the plurality of users based on the interaction patterns tracked for the plurality of users, wherein the generalized cognitive model includes a statistical abstract that corresponds to the tracked interaction patterns.
  - 32. The method of claim 31, further comprising using the generalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 33. The method of claim 26, further comprising generating an environmental model that includes information associated with at least one of environmental conditions or surroundings associated with a user that provided the natural language utterance.
  - 34. The method of claim 33, wherein the environmental conditions or surroundings include one or more of a global position of the user, movement information associated with the user, quiet or noisy conditions associated with an environment of the user, or a vicinity to one or more voice-enabled devices.
  - 35. The method of claim 33, wherein the environmental model provides one or more of context, domain knowledge, preferences, or cognitive qualities to enhance the interpretation of the natural language utterance.
  - 36. The method of claim 26, further comprising determining a most likely context for the natural language utterance using a knowledge-enhanced speech recognition engine, wherein determining the most likely context further includes:
    - comparing one or more text combinations against one or more grammar expression entries in a context description grammar to identify one or more contexts that completely or partially match the one or more text combinations;
      
      providing a relevance score for each of identified matching contexts;
      
      selecting the matching context having a highest score as the most likely context for the natural language utterance, wherein the domain agent configured to process the generated request is associated with the selected context;
      
      communicating the request to the domain agent associated with the selected context; and
      
      generating a response to the natural language utterance using content gathered as a result of the domain agent processing the request, wherein the response arranges the content in an order based on the relevance scores for the identified matching contexts.
  - 37. The method of claim 36, wherein the response includes an aggregation of the content gathered as a result of the domain agent processing the request.
  - 38. The method of claim 36, further comprising formatting the response using a personality module.
  - 39. The method of claim 36, wherein the knowledge-enhanced speech recognition engine further compares the text combinations against a context stack that stores one or more expected contexts to identify the one or more contexts.
  - 40. The method of claim 36, wherein the knowledge-enhanced speech recognition engine further applies prior probabilities or fuzzy possibilities to at least one of keyword matching, user profiles, or a dialog history to identify the one or more contexts.
  - 41. The method of claim 26, further comprising receiving one or more follow-up multimodal inputs at the multimodal device.
  - 42. The method of claim 26, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user providing a subsequent request to stop the request being processed by the domain agent.
  - 43. The method of claim 26, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user repeating the natural language utterance.
  - 44. The method of claim 26, further comprising:
    - receiving a non-speech input relating to the natural language utterance at the multimodal device;
      
      transcribing the non-speech input to create a non-speech-based transcription; and
      
      merging the recognized words and the non-speech-based transcription to create a merged transcription, wherein the parser is further configured to generate the interpretation of the natural language utterance from the merged transcription.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Dialect, LLC
Original Assignee
VoiceBox Technologies, Inc. (Microsoft Corporation)
Inventors
Weider, Chris, DiCristo, Philippe, Kennewick, Robert A.

Granted Patent

US 8,620,659 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

565 Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

565 Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links