REDUCING THE NEED FOR MANUAL START/END-POINTING AND TRIGGER PHRASES

US 20150348548A1
Filed: 09/30/2014
Published: 12/03/2015
Est. Priority Date: 05/30/2014
Status: Active Grant

First Claim

Patent Images

1. A method for operating a virtual assistant on an electronic device, the method comprising:

receiving, at the electronic device, an audio input;

monitoring the audio input to identify a first spoken user input;

identifying the first spoken user input in the audio input;

determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input;

in response to a determination to respond to the first spoken user input;

generating a response to the first spoken user input; and

monitoring the audio input to identify a second spoken user input; and

in response to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the response to the first spoken user input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for selectively processing and responding to a spoken user input are provided. In one example, audio input containing a spoken user input can be received at a user device. The spoken user input can be identified from the audio input by identifying start and end-points of the spoken user input. It can be determined whether or not the spoken user input was intended for a virtual assistant based on contextual information. The determination can be made using a rule-based system or a probabilistic system. If it is determined that the spoken user input was intended for the virtual assistant, the spoken user input can be processed and an appropriate response can be generated. If it is instead determined that the spoken user input was not intended for the virtual assistant, the spoken user input can be ignored and/or no response can be generated.

Citations

38 Claims

1. A method for operating a virtual assistant on an electronic device, the method comprising:
- receiving, at the electronic device, an audio input;
  
  monitoring the audio input to identify a first spoken user input;
  
  identifying the first spoken user input in the audio input;
  
  determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input;
  
  in response to a determination to respond to the first spoken user input;
  
  generating a response to the first spoken user input; and
  
  monitoring the audio input to identify a second spoken user input; and
  
  in response to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the response to the first spoken user input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 2. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying one or more predetermined words at the start of the first spoken user input.
  - 3. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying a physical or virtual button input received prior to receiving the first spoken user input.
  - 4. The method of claim 1, wherein generating the response to the first spoken user input comprises one or more of:
    - performing speech-to-text conversion on the first spoken user input;
      
      determining a user intent based on the first spoken user input;
      
      determining a task to be performed based on the first spoken user input;
      
      determining a parameter for the task to be performed based on the first spoken user input;
      
      performing the task to be performed;
      
      displaying a text response to the first spoken user input; and
      
      outputting an audio response to the first spoken user input.
  - 5. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - evaluating one or more conditional rules that depend on the contextual information associated with the first spoken user input.
  - 6. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - calculating a likelihood score that the virtual assistant should respond to the first spoken user input based on the contextual information associated with the first spoken user input; and
      
      comparing the likelihood score to a threshold value.
  - 7. The method of claim 6, wherein the contextual information comprises an elapsed time between receiving the first spoken user input and a previous user input, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to a value of the elapsed time being greater than a threshold duration; and
      
      increasing the likelihood score in response to the value of the elapsed time being less than the threshold duration.
  - 8. The method of any claim 6, wherein the contextual information comprises a previous spoken user input, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information based on contextual information comprises increasing the likelihood score in response to detecting a match between the previous spoken user input and the first spoken user input.
  - 9. The method of claim 6, wherein the contextual information comprises a distance between a user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the distance being greater than a threshold distance; and
      
      increasing the likelihood score in response to the distance being less than the threshold distance.
  - 10. The method of any of claim 6, wherein the contextual information comprises an orientation of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the orientation of the device being facedown or upside down; and
      
      increasing the likelihood score in response to the orientation of the device being face up or upright.
  - 11. The method of claim 6, wherein the contextual information comprises an orientation between the user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the orientation being one in which a display of the electronic device is oriented towards the user; and
      
      decreasing the likelihood score in response to the orientation being one in which the display of the electronic device is oriented away from the user.
  - 12. The method of claim 6, wherein the contextual information comprises a direction of the user'"'"'s eyes when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the direction of the user'"'"'s eyes being pointed at the electronic device; and
      
      decreasing the likelihood score in response to the direction of the user'"'"'s eyes being pointed away from the electronic device.
  - 13. The method of claim 6, wherein the contextual information comprises an indication of whether the first spoken user input was recognized by an automatic speech recognizer, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the first spoken user input was recognized by the automatic speech recognizer; and
      
      decreasing the likelihood score in response to the indication indicating that the first spoken user input was not recognized by the automatic speech recognizer.
  - 14. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous spoken user input, and wherein determining the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a spoken user input semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the spoken user input semantic threshold value.
  - 15. The method claim 6, wherein the contextual information comprises a length of the first spoken user input, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the length of the first spoken user input less than a threshold length; and
      
      decreasing the likelihood score in response to the length of the first spoken user input being greater than the threshold length.
  - 16. The method of claim 6, wherein the contextual information comprises an identification of a speaker of the first spoken user input.
  - 17. The method of claim 6, wherein the contextual information comprises a time the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the time being within a predetermined set of times; and
      
      decreasing the likelihood score in response to the time not being within the predetermined set of times.
  - 18. The method of claim 6, wherein the contextual information comprises an indication of whether the electronic device was outputting information to the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was outputting information to the user when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not outputting information to the user when the first spoken user input was received.
  - 19. The method of claim 6, wherein the contextual information comprises an expectation of receiving input from the user, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the expectation of receiving input from the user indicating that input was expected to be received from the user; and
      
      decreasing the likelihood score in response to the expectation of receiving input from the user indicating that input was not expected to be received from the user.
  - 20. The method of claim 6, wherein the contextual information comprises an indication of whether the electronic device is being held when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was being held when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not being held when the first spoken user input was received.
  - 21. The method of claim 6, wherein the contextual information comprises an operating state of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the operating state of the electronic device being one of a set of predetermined operating states; and
      
      decreasing the likelihood score in response to the operating state of the electronic device not being one of the set of predetermined operating states.
  - 22. The method of claim 6, wherein the contextual information comprises a previous action performed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the previous action performed by the electronic device being one of a set of predetermined actions; and
      
      decreasing the likelihood score in response to the previous action performed by the electronic device not being one of the set of predetermined actions.
  - 23. The method of claim 6, wherein the contextual information comprises an indication of whether the content was being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises increasing the likelihood score in response to the indication indicating that content was being displayed by the electronic device when the first spoken user input was received.
  - 24. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and content being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a content semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the content semantic threshold value.
  - 25. The method of claim 6, wherein the contextual information comprises a position of the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the position of the user being one of a predetermined set of positions; and
      
      decreasing the likelihood score in response to the position of the user not being one of the predetermined set of positions.
  - 26. The method of claim 6, wherein the contextual information comprises a gesture being performed by the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the gesture being one of a predetermined set of gestures; and
      
      decreasing the likelihood score in response to the gesture not being one of the predetermined set of gestures.
  - 27. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous output of the electronic device, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous output semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous output semantic threshold value.
  - 28. The method of claim 6, wherein the contextual information comprises a location of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the location being one of a predetermined set of locations; and
      
      increasing the likelihood score in response to the location not being one of the predetermined set of locations.
  - 29. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input an application being run by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input device based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than an application semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the application semantic threshold value.
  - 30. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous contact, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous contact semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous contact semantic threshold value.
  - 31. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous email, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous email semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous email semantic threshold value.
  - 32. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous SMS message, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous SMS message semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous SMS semantic threshold value.
  - 33. The method of claim 6, wherein the contextual information comprises a movement of the electronic device, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the movement being one of a predetermined set of movements; and
      
      decreasing the likelihood score in response to the movement not being one of the predetermined set of movements.
  - 34. The method of claim 6, wherein the contextual information comprises a user setting, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the user setting being one of a predetermined set of user settings; and
      
      decreasing the likelihood score in response to the user setting not being one of the predetermined set of user settings.
  - 35. The method of claim 6, wherein the contextual information comprises an amount of light sensed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the amount of light being greater than a threshold amount of light; and
      
      decreasing the likelihood score in response to the amount of light being less than the threshold amount of light.
  - 36. The method of claim 6, wherein the contextual information comprises calendar data, and wherein calculating the likelihood score that the virtual assistant should respond to the first spoken user input based on contextual information comprises decreasing the likelihood score in response to the calendar data indicating that the user is occupied at the time that the first spoken user input was received.

37. A non-transitory computer-readable storage medium comprising instructions for:
- receiving an audio input;
  
  monitoring the audio input to identify a first spoken user input;
  
  identifying the first spoken user input in the audio input;
  
  determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input;
  
  responsive to a determination to respond to the first spoken user input;
  
  generating a response to the first spoken user input; and
  
  monitoring the audio input to identify a second spoken user input; and
  
  responsive to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the response to the first spoken user input.

38. A system comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving an audio input;
  
  monitoring the audio input to identify a first spoken user input;
  
  identifying the first spoken user input in the audio input;
  
  determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input;
  
  responsive to a determination to respond to the first spoken user input;
  
  generating a response to the first spoken user input; and
  
  monitoring the audio input to identify a second spoken user input; and
  
  responsive to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the response to the first spoken user input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
BINDER, Justin G., PIERNOT, Philippe P.

Granted Patent

US 9,715,875 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/013   Eye tracking input arrangem...

G06F 3/167   Audio in a user interface, ...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/1822   Parsing for meaning underst...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/227   of the speaker; Human-fact...

G10L 2015/228   of application context

H04W 4/025   using location based inform...

REDUCING THE NEED FOR MANUAL START/END-POINTING AND TRIGGER PHRASES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

REDUCING THE NEED FOR MANUAL START/END-POINTING AND TRIGGER PHRASES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links