System and method for a cooperative conversational voice user interface

US 8,073,681 B2
Filed: 10/16/2006
Issued: 12/06/2011
Est. Priority Date: 10/16/2006
Status: Active Grant

- Alert
- Pin

First Claim

Patent Images

1. A method for providing a cooperative conversational voice user interface, comprising:

receiving an utterance at a voice input device during a current conversation with a user, wherein the utterance includes one or more words that have different meanings in different contexts;

accumulating short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;

accumulating long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;

determining an intended meaning for the utterance, wherein determining the intended meaning for the utterance includes;

identifying, at a conversational speech engine, a context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge; and

establishing the intended meaning within the identified context, wherein the conversational speech engine establishes the intended meaning within the identified context to disambiguate an intent that the user had in speaking the one or more words that have the different meanings in the different contexts; and

generating a response to the utterance, wherein the conversational speech engine grammatically or syntactically adapts the response based on the intended meaning established within the identified context.

View all claims

9 Assignments

Timeline View

Assignment View

Litigations

1 Petition

Accused Products

Abstract

A cooperative conversational voice user interface is provided. The cooperative conversational voice user interface may build upon short-term and long-term shared knowledge to generate one or more explicit and/or implicit hypotheses about an intent of a user utterance. The hypotheses may be ranked based on varying degrees of certainty, and an adaptive response may be generated for the user. Responses may be worded based on the degrees of certainty and to frame an appropriate domain for a subsequent utterance. In one implementation, misrecognitions may be tolerated, and conversational course may be corrected based on subsequent utterances and/or responses.

937 Citations

42 Claims

1. A method for providing a cooperative conversational voice user interface, comprising:
- receiving an utterance at a voice input device during a current conversation with a user, wherein the utterance includes one or more words that have different meanings in different contexts;
  
  accumulating short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;
  
  accumulating long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  determining an intended meaning for the utterance, wherein determining the intended meaning for the utterance includes;
  
  identifying, at a conversational speech engine, a context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge; and
  
  establishing the intended meaning within the identified context, wherein the conversational speech engine establishes the intended meaning within the identified context to disambiguate an intent that the user had in speaking the one or more words that have the different meanings in the different contexts; and
  
  generating a response to the utterance, wherein the conversational speech engine grammatically or syntactically adapts the response based on the intended meaning established within the identified context.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein accumulating the short-term shared knowledge about the current conversation includes populating a short-term context stack with information about the utterance received during the current conversation.
  - 3. The method of claim 2, wherein accumulating the short-term shared knowledge about the current conversation further includes expiring the information about the utterance from the short-term context stack after a psychologically appropriate amount of time.
  - 4. The method of claim 3 wherein accumulating the long-term shared knowledge about the user includes updating one or more long-term profiles associated with the user to include information about the utterance received during the current conversation and relevant data associated with the information expired from the short, term context stack.
  - 5. The method of claim 1, wherein determining the intended meaning for the utterance further includes:
    - identifying a conversational goal associated with the utterance, roles associated with the user and one or more other participants in the current conversation, and an information allocation among the user and the one or more other participants in the current conversation; and
      
      classifying one or more of the utterance or the current conversation into a conversation type based on one or more of the identified conversational goal, the identified roles, or the identified information allocation, wherein the conversational speech engine further establishes the intended meaning based on the conversation type.
  - 6. The method of claim 5 wherein the established intended meaning comprises a hypothesis having a degree of certainty about the intent that the user had in speaking the one or more words in the utterance.
  - 7. The method of claim 6, further comprising generating a preliminary interpretation of the utterance at a speech recognition engine coupled to the voice input device and the conversational speech engine, wherein the conversational speech engine assigns the degree of certainty to the hypothesis based on one or more of the conversation type, information associated with the identified context, or a confidence level associated with the preliminary interpretation generated at the speech recognition engine.
  - 8. The method of claim 5, wherein the conversational speech engine further grammatically or syntactically adapts the response based on the conversation type.
  - 9. The method of claim 1, wherein the conversational speech engine grammatically or syntactically adapts the response to influence a subsequent reply utterance that the conversational speech engine expects from the user during the current conversation.
  - 10. The method of claim 1, further comprising:
    - generating multiple preliminary interpretations of the utterance at a speech recognition engine coupled to the voice input device and the conversational speech engine, wherein an initial interpretation of the utterance comprises one of the multiple preliminary interpretations having a highest confidence level; and
      
      updating the short-term shared knowledge about the current conversation to remove the initial interpretation from the multiple preliminary interpretations in response to determining that the initial interpretation was incorrect, wherein the conversational speech engine determines the intended meaning based on one of the multiple preliminary interpretations having a next highest confidence level.
  - 11. The method of claim 1, wherein the user speaks the utterance in a multi-modal input that further includes one or more non-voice inputs relating to the utterance.
  - 12. The method of claim 1, wherein the conversational speech engine generates the response in a multi-modal output that includes one or more non-voice outputs that relate to the utterance or one or more tasks executed to process a request identified from the intended meaning.

13. A non-transitory computer readable medium containing computer-executable instructions for providing a cooperative conversational voice user interface, the computer-executable instructions operable when executed to:
- receive an utterance at a voice input device, during a current conversation with a user, wherein the utterance includes one or more words that have different meanings in different contexts;
  
  accumulate short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received at the voice during the current conversation;
  
  accumulate long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  identify a context associated with the utterance, wherein a conversational speech engine identifies the context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge;
  
  establish an intended meaning for the utterance within the identified context, wherein the conversational speech engine establishes the intended meaning within the identified context to disambiguate an intent that the user had in speaking the one or more words that have the different meanings in the different contexts; and
  
  generate a response to the utterance, wherein the conversational speech engine grammatically or syntactically adapts the response based on the intended meaning established within the identified context.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The non-transitory computer readable medium of claim 13, wherein to accumulate the short-term shared knowledge about the current conversation, the computer-executable instructions are further operable when executed to populate a short-term context stack with information about the utterance received during the current conversation.
  - 15. The non-transitory computer readable medium of claim 14, wherein to accumulate the short-term shared knowledge about the current conversation, the computer-executable instructions are further operable when executed to expire the information about the utterance from the short-term context stack after a psychologically appropriate amount of time.
  - 16. The non-transitory computer readable medium of claim 15, wherein to accumulate the long-term shared knowledge about the user, the computer-executable instructions are further operable when executed to update one or more long-term profiles associated with the user to include information about the utterance received during the current conversation and relevant data associated with the information expired from the short term context stack.
  - 17. The non-transitory computer readable medium of claim 13, wherein the computer-executable instructions are further operable when executed to:
    - identify a conversational goal associated with the utterance, roles associated with the user and one or more other participants in the current conversation, and an information allocation among the user and the one or more other participants in the current conversation; and
      
      classify one or more of the utterance or the current conversation into a conversation type based on one or more of the identified conversational goal, the identified roles, or the identified information allocation, wherein the conversational speech engine further establishes the intended meaning based on the conversation type.
  - 18. The non-transitory computer readable medium of claim 17, wherein the established intended meaning comprises a hypothesis having a degree of certainty about the intent that the user had in speaking the one or more words in the utterance.
  - 19. The computer-readable medium of claim 18, wherein the computer-executable instructions are further operable when executed to generate a preliminary interpretation of the utterance at a speech recognition engine, wherein the conversational speech engine assigns the degree of certainty to the hypothesis based on one or more of the conversation type, information associated with the identified context, or a confidence level associated with the preliminary interpretation generated at the speech recognition engine.
  - 20. The non-transitory computer readable medium of claim 17, wherein the conversational speech engine further grammatically or syntactically adapts the response based on conversation type.
  - 21. The non-transitory computer readable medium of claim 13, wherein the conversational speech engine grammatically or syntactically adapts more the response to influence a subsequent reply utterance that the conversational speech engine expects from the user during the current conversation.
  - 22. The non-transitory computer readable medium of claim 13, wherein the computer-executable instructions are further operable when executed to:
    - generate multiple preliminary interpretations of the utterance at a speech recognition engine, wherein an initial interpretation of the utterance comprises one of the multiple preliminary interpretations having a highest confidence level; and
      
      update the short-term shared knowledge about the current conversation to remove the initial interpretation from the multiple preliminary interpretations in response to determining that the initial interpretation was incorrect, wherein the conversational speech engine identifies the context associated with the utterance and establishes the intended meaning for the utterance based on one of the multiple preliminary interpretations having a next highest confidence level.
  - 23. The non-transitory computer readable medium of claim 13, wherein the user speaks the utterance in a multi-modal input that further includes one or more non-voice inputs relating to the utterance.
  - 24. The non-transitory computer readable medium of claim 13, wherein the conversational speech engine generates the response in a multi-modal output that includes one or more non-voice outputs that relate to the utterance or one or more tasks executed to process a request identified from the intended meaning.

25. A system for providing a cooperative conversational voice user interface, comprising:
- a voice input device configured to receive an utterance during a current conversation with a user, wherein the utterance includes one or more words that have different meanings in different contexts; and
  
  a conversational speech engine, wherein the conversational speech engine includes one or more processors configured to;
  
  accumulate short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;
  
  accumulate long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  identify a context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge;
  
  establish an intended meaning for the utterance within the identified context to disambiguate an intent that the user had in speaking the one or more words that have the different meanings in the different contexts; and
  
  generate a grammatically or syntactically adapted response to the utterance based on the intended meaning established within the identified context.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 26. The system of claim 25, wherein to accumulate the short-term shared knowledge about the current conversation, the one or more processors are further configured to populate a short-term context stack with information about the utterance received during the current conversation.
  - 27. The system of claim 26, wherein to accumulate the short-term shared knowledge about the current conversation, the one or more processors are further configured to expire the information about the utterance from the short-term context stack after a psychologically appropriate amount of time.
  - 28. The system of claim 27, wherein to accumulate the long-term shared knowledge about the user, the one or more processors are further configured to update one or more long-term profiles associated with the user to include information about the utterance received during the current conversation and relevant data associated with the information expired from the short term context stack.
  - 29. The system of claim 25, wherein the one or more processors are further configured to:
    - identify a conversational goal associated with the utterance, roles associated with the user and one or more other participants in the current conversation, and an information allocation among the user and the one or more other participants in the current conversation; and
      
      classify one or more of the utterance or the current conversation into a conversation type based on one or more of the identified conversational goal, the identified roles, or the identified information allocation, wherein the one or more processors are further configured to establish the intended meaning based on the conversation type.
  - 30. The system of claim 29 wherein the established intended meaning comprises a hypothesis having a degree of certainty about the intent that the user had in speaking the one or more words in the utterance.
  - 31. The system of claim 30, further comprising a speech recognition engine configured to generate a preliminary interpretation of the utterance, wherein the one or more processors are further configured to assign the degree of certainty to the hypothesis based on one or more of the conversation type, information associated with the identified context, a confidence level associated with the preliminary interpretation generated at the speech recognition engine.
  - 32. The system of claim 29 wherein the one or more processors are further configured to generate the grammatically syntactically adapted response based on the conversation type.
  - 33. The system of claim 25, wherein the one or more processors are further configured to generate the grammatically or syntactically adapted response to influence a subsequent reply utterance expected from the user during the current conversation.
  - 34. The system of claim 25, further comprising a speech recognition engine configured to:
    - generate multiple preliminary interpretations of the utterance, wherein an initial interpretation of the utterance comprises one of the multiple preliminary interpretations having a highest confidence level; and
      
      update the short-term shared knowledge about the current conversation to remove the initial interpretation from the multiple preliminary interpretations in response to determining that the initial interpretation was incorrect, wherein the one or more processors are configured to identify the context associated with the utterance and establish the intended meaning for the utterance based on one of the multiple preliminary interpretations having a next highest confidence level.
  - 35. The system of claim 25, wherein the user speaks the utterance in a multi-modal input that further includes one or more non-voice inputs relating to the utterance.
  - 36. The system of claim 25, wherein the grammatically or syntactically adapted response comprises a multi-modal output that includes one or more non-voice outputs that relate to the utterance or one or more tasks executed to process a request identified from the intended meaning.

37. A method for providing a cooperative conversational voice user interface, comprising:
- receiving an utterance at a voice input device during a current conversation with a user;
  
  accumulating short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;
  
  accumulating long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  determining an intended meaning for the utterance, wherein determining the intended meaning for the utterance includes;
  
  identifying, at a conversational speech engine, a context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge;
  
  inferring additional information about the utterance from the short-term shared knowledge and the long-term shared knowledge in response to determining that the utterance contains insufficient information to complete a request in the identified context; and
  
  establishing the intended meaning within the identified context based on the additional information inferred about the utterance; and
  
  generating a response to the utterance based on the intended meaning established within the identified context.
- View Dependent Claims (38)
- - 38. The method of claim 37, wherein the established intended meaning comprises an implicit hypothesis having a corresponding degree of certainty about an intent that the user had in speaking the utterance.

39. A non-transitory computer readable medium containing computer-executable instructions for providing a cooperative conversational voice user interface, the computer-executable instructions operable when executed to:
- receive an utterance at a voice input device during a current conversation with a user;
  
  accumulate short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;
  
  accumulate long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  identify a context associated with the utterance, wherein a conversational speech engine identifies the context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge;
  
  infer additional information about the utterance from the short-term shared knowledge and the long-term shared knowledge in response to determining that the utterance contains insufficient information to complete a request in the identified context;
  
  establish an intended meaning for the utterance within the identified context based on the additional information inferred about the utterance; and
  
  generate a response to the utterance based on the intended meaning established within the identified context.
- View Dependent Claims (40)
- - 40. The non-transitory computer readable medium of claim 39, wherein the established intended meaning comprises an implicit hypothesis having a corresponding degree of certainty about an intent that the user had in speaking the utterance.

41. A system for providing a cooperative conversational voice user interface, comprising:
- a voice input device configured to receive an utterance during a current conversation with a user; and
  
  a conversational speech engine, wherein the conversational speech engine includes one or more processors configured to;
  
  accumulate short-term shared knowledge about the current conversation, wherein the short-term shared knowledge includes knowledge about the utterance received during the current conversation;
  
  accumulate long-term shared knowledge about the user, wherein the long-term shared knowledge includes knowledge about one or more past conversations with the user;
  
  identify a context associated with the utterance from the short-term shared knowledge and the long-term shared knowledge;
  
  infer additional information about the utterance from the short-term shared knowledge and the long-term shared knowledge in response to determining that the utterance contains insufficient information to complete a request in the identified context;
  
  establish an intended meaning for the utterance within the identified identify a context based on the additional information inferred about the utterance; and
  
  generate a response to the utterance based on the intended meaning established within the identified context.
- View Dependent Claims (42)
- - 42. The system of claim 41, wherein the established intended meaning comprises an implicit hypothesis having a corresponding degree of certainty about an intent that the user had in speaking the utterance.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
VB Assets, LLC
Original Assignee
VoiceBox Technologies, Inc. (Microsoft Corporation)
Inventors
Baldwin, Larry, Freeman, Tom, Tjalve, Michael, Ebersold, Blane, Weider, Chris
Primary Examiner(s)
YEN, ERIC L

Application Number

US11/580,926
Publication Number

US 20080091406A1
Time in Patent Office

1,877 Days
Field of Search

704270-275, 704/9
US Class Current

704/9
CPC Class Codes

G06F 3/167   Audio in a user interface, ...

G06F 40/30   Semantic analysis

G10L 15/18   using natural language mode...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/1822   Parsing for meaning underst...

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 17/22   Interactive procedures; Man...

G10L 2015/0631   Creating reference template...

G10L 2015/225   Feedback of the input speech

G10L 2015/228   of application context

G10L 2021/02166   Microphone arrays; Beamforming

G10L 25/51   for comparison or discrimin...

G10L 25/63   for estimating an emotional...

System and method for a cooperative conversational voice user interface

First Claim

9 Assignments

Litigations

1 Petition

Accused Products

Abstract

937 Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for a cooperative conversational voice user interface

First Claim

9 Assignments

Subscription Required

Subscription Required

Litigations

1 Petition

Subscription Required

Accused Products

Subscription Required

Abstract

937 Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links