SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

US 20100023320A1
Filed: 10/01/2009
Published: 01/28/2010
Est. Priority Date: 08/10/2005
Status: Active Grant

First Claim

Patent Images

1. A system for processing natural language utterances, comprising:

a multimodal device configured to receive a natural language utterance;

a speech recognition engine configured to recognize one or more words from the natural language utterance;

a parser configured to generate an interpretation of the natural language utterance from the one or more recognized words, and further configured to generate a request based on the interpretation of the natural language utterance;

a domain agent configured to process the generated request; and

an adaptive misrecognition engine configured to;

monitor one or more actions associated with the domain agent processing the request; and

determine whether the interpretation of the natural language utterance is correct or incorrect based on the one or more monitored actions.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are provided for receiving speech and/or non-speech communications of natural language questions and/or commands and executing the questions and/or commands. The invention provides a conversational human-machine interface that includes a conversational speech analyzer, a general cognitive model, an environmental model, and a personalized cognitive model to determine context, domain knowledge, and invoke prior information to interpret a spoken utterance or a received non-spoken message. The system and method creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech or non-speech communication and presenting the expected results for a particular question or command.

Citations

50 Claims

1. A system for processing natural language utterances, comprising:
- a multimodal device configured to receive a natural language utterance;
  
  a speech recognition engine configured to recognize one or more words from the natural language utterance;
  
  a parser configured to generate an interpretation of the natural language utterance from the one or more recognized words, and further configured to generate a request based on the interpretation of the natural language utterance;
  
  a domain agent configured to process the generated request; and
  
  an adaptive misrecognition engine configured to;
  
  monitor one or more actions associated with the domain agent processing the request; and
  
  determine whether the interpretation of the natural language utterance is correct or incorrect based on the one or more monitored actions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 2. The system of claim 1, wherein the adaptive misrecognition engine is further configured to generate an unrecognized event in response to determining that the interpretation of the natural language utterance is incorrect.
  - 3. The system of claim 2, farmer comprising an analyzer configured to:
    - track an interaction pattern with the system over time for a user that provided the natural language utterance;
      
      generate a personalized cognitive model for the user based on the interaction pattern tracked for the user; and
      
      use the personalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 4. The system of claim 3, wherein the analyzer is further configured to update the personalized cognitive model based on a frequency of incorrect interpretations for the request.
  - 5. The system of claim 3, wherein the parser is further configured to generate a plurality of interpretations of the natural language utterance, and wherein the analyzer is further configured to use the personalized cognitive model to select a next best interpretation from the plurality of interpretations in response to the adaptive misrecognition engine determining that the interpretation generated by the parser is incorrect.
  - 6. The system of claim 3, wherein the analyzer is further configured to track interaction patterns with the system over time for a plurality users.
  - 7. The system of claim 6, wherein the analyzer is further configured to generate a generalized cognitive model for the plurality of users based on the interaction patterns tracked for the plurality of users, wherein the generalized cognitive model includes a statistical abstract that corresponds to the tracked interaction patterns.
  - 8. The system of claim 7, wherein the analyzer is further configured to use the generalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 9. The system of claim 7, wherein the analyzer is further configured to update the generalized cognitive model based on a frequency of incorrect interpretations for the request.
  - 10. The system of claim 3, wherein the analyzer is further configured to generate an environmental model that includes information associated with at least one of environmental conditions or surroundings associated with the user.
  - 11. The system of claim 10, wherein the environmental conditions or surroundings include one or more of a global position of the user, movement information associated with the user, quiet or noisy conditions associated with an environment of the user, or a vicinity to one or more voice-enabled devices.
  - 12. The system of claim 10, wherein the environmental model provides one or more of context, domain knowledge, preferences, or cognitive qualities to enhance the interpretation of the natural language utterance.
  - 13. The system of claim 2, further comprising an analyzer configured to:
    - analyze the unrecognized event to determine how the natural language utterance was incorrectly interpreted; and
      
      determining one or more tuning parameters for at least one of the speech recognition engine or the parser based on how the natural language utterance was incorrectly interpreted, wherein the tuning parameters are used to improve interpretations of subsequent natural language utterances relating to the request.
  - 14. The system of claim 1, further comprising:
    - a knowledge-enhanced speech recognition engine configured to determine a most likely context for the natural language utterance, wherein the knowledge-enhanced speech recognition engine is further configured to;
      
      identify one or more contexts that completely or partially match one or more text combinations, wherein identifying the one or more contexts includes comparing the text combinations against one or more grammar expression entries in a context description grammar;
      
      provide a relevance score for each of the identified matching contexts; and
      
      select the matching context having a highest score as the most likely context for the natural language utterance, wherein the domain agent configured to process the generated request is associated with the selected context; and
      
      a response generating module configured to;
      
      communicate the request to the domain agent associated with the selected context; and
      
      generate a response to the natural language utterance using content gathered as a result of the domain agent processing the request, wherein the response arranges the content in an order based on the relevance scores for the identified matching contexts.
  - 15. The system of claim 14, wherein the response generated by the response generating module includes an aggregation of the content gathered as a result of the domain agent processing the request.
  - 16. The system of claim 14, further comprising a personality module configured to format the response.
  - 17. The system of claim 14, wherein the knowledge-enhanced speech recognition engine identifying the one or more contexts further includes comparing the text combinations against a context stack that stores one or more expected contexts.
  - 18. The system of claim 14, wherein the knowledge-enhanced speech recognition engine identifying the one or more contexts further includes applying prior probabilities or fuzzy possibilities to at least one of keyword matching, user profiles, or a dialog history.
  - 19. The system of claim 14, wherein the domain agent is further configured to direct a query to at least one of a local information source or a network information source to process the request.
  - 20. The system of claim 19, wherein the domain agent is further configured to evaluate a plurality of responses to the query to process the request.
  - 21. The system of claim 14, wherein the domain agent is further configured to direct a command to at least one of a local device or a remote device to process the request.
  - 22. The system of claim 1, wherein the multimodal device includes at least one of a personal digital assistant, a cellular telephone, a portable computer, or a desktop computer.
  - 23. The system of claim 1, wherein the multimodal device is further configured to subsequently receive one or more follow-up multimodal inputs.
  - 24. The system of claim 23, wherein the speech recognition engine is further configured to recognize one or more words from a natural language utterance provided in the follow-up multimodal input, and wherein the parser is further configured to generate an interpretation of the follow-up multimodal input from the one or more words recognized from the natural language utterance provided in the follow-up multimodal input.
  - 25. The system of claim 24, wherein the follow-up multimodal input includes a follow-up request associated with a same context as the request being processed by the domain agent.
  - 26. The system of claim 1, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user providing a subsequent request to stop the request being processed by the domain agent.
  - 27. The system of claim 1, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to a user repeating the natural language utterance.
  - 28. The system of claim 1, wherein the multimodal device is Other configured to receive a non-speech input relating to the natural language utterance, and wherein the system further comprises:
    - a transcription module configured to transcribe the non-speech input to create a non-speech-based transcription; and
      
      a merging module configured to merge the recognized words and the non-speech-based transcription to create a merged transcription, wherein the parser is farther configured to generate the interpretation of the natural language utterance from the merged transcription.

29. A method for processing natural language utterances, comprising:
- receiving a natural language utterance at a multimodal device;
  
  recognizing one or more words from the natural language utterance using a speech recognition engine coupled to the multimodal device;
  
  generating an interpretation of the natural language utterance from the one or more recognized words using a parser coupled to the multimodal device, wherein the parser further generates a request based on the interpretation of the natural language utterance;
  
  invoking a domain agent configured to process the generated request;
  
  monitoring one or more actions associated with the domain agent processing the request using an adaptive misrecognition engine; and
  
  determining whether the interpretation of the natural language utterance is correct or incorrect based on the actions monitored using the adaptive misrecognition engine.
- View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)
- - 30. The method of claim 29, further comprising generating an unrecognized event in response to the adaptive misrecognition engine determining that the interpretation of the natural language utterance is incorrect.
  - 31. The method of claim 30, further comprising:
    - tracking an interaction pattern over time for a user that provided the natural language utterance;
      
      generating a personalized cognitive model for the user based on the interaction pattern tracked for the user; and
      
      using the personalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 32. The method of claim 31, further comprising updating the personalized cognitive model based on a frequency of incorrect interpretations for the request.
  - 33. The method of claim 31, further comprising:
    - generating a plurality of interpretations of the natural language utterance; and
      
      using the personalized cognitive model to select a next best interpretation from the plurality of interpretations in response to the adaptive misrecognition engine determining that the interpretation generated by the parser is incorrect.
  - 34. The method of claim 31, further comprising tracking interaction patterns over time for a plurality of users.
  - 35. The method of claim 34, further comprising generating a generalized cognitive model for the plurality of users based on the interaction patterns tracked for the plurality of users, wherein the generalized cognitive model includes a statistical abstract that corresponds to the tracked interaction patterns.
  - 36. The method of claim 35, further comprising using the generalized cognitive model to predict the one or more actions associated with the domain agent processing the request.
  - 37. The method of claim 35, further comprising updating the generalized cognitive model based on a frequency of incorrect interpretations for the request.
  - 38. The method of claim 30, hither comprising generating an environmental model that includes information associated with at least one of environmental conditions or surroundings associated with the user.
  - 39. The method of claim 38, wherein the environmental conditions or surroundings include one or more of a global position of the user, movement information associated with the user, quiet or noisy conditions associated with an environment of the user, or a vicinity to one or more voice-enabled devices.
  - 40. The method of claim 38, wherein the environmental model provides one or more of context, domain knowledge, preferences, or cognitive qualities to enhance the interpretation of the natural language utterance.
  - 41. The method of claim 30, hither comprising:
    - analyzing the unrecognized event to determine how the natural language utterance was incorrectly interpreted; and
      
      determine one or more tuning parameters for at least one of the speech recognition engine or the parser based on how the natural language utterance was incorrectly interpreted, wherein the tuning parameters are used to improve interpretations of subsequent natural language utterances relating to the request.
  - 42. The method of claim 29, further comprising determining a most likely context for the natural language utterance using a knowledge-enhanced speech recognition engine, wherein determining the most likely context further includes:
    - identifying one or more contexts completely or partially match one or more text combinations, wherein identifying the one or more contexts includes comparing the text combinations against one or more grammar expression entries in a context description grammar;
      
      providing a relevance score for each of identified matching contexts;
      
      selecting the matching context having a highest score as the most likely context for the natural language utterance, wherein the domain agent configured to process the generated request is associated with the selected context;
      
      communicating the request to the domain agent associated with the selected context; and
      
      generating a response to the natural language utterance using content gathered as a result of the domain agent processing the request, wherein the response arranges the content in an order based on the relevance scores for the identified matching contexts.
  - 43. The method of claim 42, wherein the response includes an aggregation of the content gathered as a result of the domain agent processing the request.
  - 44. The method of claim 42, further comprising formatting the response using a personality module.
  - 45. The method of claim 42, wherein identifying the one or more contexts further includes comparing the text combinations against a context stack that stores one or more expected contexts.
  - 46. The method of claim 45, wherein identifying the one or more contexts further includes applying prior probabilities or fuzzy possibilities to at least one of keyword matching, user profiles, or a dialog history.
  - 47. The method of claim 29, further comprising receiving one or more follow-up multimodal inputs at the multimodal device.
  - 48. The method of claim 29, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to the user providing a subsequent request to stop the request being processed by the domain agent.
  - 49. The method of claim 29, wherein the adaptive misrecognition engine determines that the interpretation of the natural language utterance was incorrect in response to the user repeating the natural language utterance.
  - 50. The method of claim 29, further comprising:
    - receiving a non-speech input relating to the natural language utterance at the multimodal device;
      
      transcribing the non-speech input to create a non-speech-based transcription; and
      
      merging the recognized words and the non-speech-based transcription to create a merged transcription wherein the parser is further configured to generate the interpretation of the natural language utterance from the merged transcription.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Dialect, LLC
Original Assignee
VoiceBox Technologies, Inc. (Microsoft Corporation)
Inventors
Di Cristo, Philippe, Kennewick, Robert A., Weider, Chris

Granted Patent

US 8,332,224 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD OF SUPPORTING ADAPTIVE MISRECOGNITION IN CONVERSATIONAL SPEECH

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links