Reducing the need for manual start/end-pointing and trigger phrases

US 10,373,617 B2
Filed: 07/21/2017
Issued: 08/06/2019
Est. Priority Date: 05/30/2014
Status: Active Grant

First Claim

Patent Images

1. A method for operating a virtual assistant on an electronic device, the method comprising:

receiving, at the electronic device, an audio input;

monitoring the audio input to identify a first spoken user input, wherein the first spoken user input comprises a user request;

identifying the first spoken user input in the audio input;

determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input, wherein the contextual information comprises a direction of the user'"'"'s gaze when the first spoken user input was received, wherein the determining comprises;

calculating a likelihood score that the virtual assistant should provide an audible response to the first spoken user input based on the contextual information associated with the first spoken user input, wherein the audible response at least partially satisfies the user request;

increasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed at the electronic device when the first spoken user input was received; and

decreasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed away from the electronic device when the first spoken user input was received;

in response to a determination to respond to the first spoken user input;

generating the audible response to the first spoken user input; and

monitoring the audio input to identify a second spoken user input; and

in response to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the audible response to the first spoken user input.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for selectively processing and responding to a spoken user input are provided. In one example, audio input containing a spoken user input can be received at a user device. The spoken user input can be identified from the audio input by identifying start and end-points of the spoken user input. It can be determined whether or not the spoken user input was intended for a virtual assistant based on contextual information. The determination can be made using a rule-based system or a probabilistic system. If it is determined that the spoken user input was intended for the virtual assistant, the spoken user input can be processed and an appropriate response can be generated. If it is instead determined that the spoken user input was not intended for the virtual assistant, the spoken user input can be ignored and/or no response can be generated.

Citations

73 Claims

1. A method for operating a virtual assistant on an electronic device, the method comprising:
- receiving, at the electronic device, an audio input;
  
  monitoring the audio input to identify a first spoken user input, wherein the first spoken user input comprises a user request;
  
  identifying the first spoken user input in the audio input;
  
  determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input, wherein the contextual information comprises a direction of the user'"'"'s gaze when the first spoken user input was received, wherein the determining comprises;
  
  calculating a likelihood score that the virtual assistant should provide an audible response to the first spoken user input based on the contextual information associated with the first spoken user input, wherein the audible response at least partially satisfies the user request;
  
  increasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed at the electronic device when the first spoken user input was received; and
  
  decreasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed away from the electronic device when the first spoken user input was received;
  
  in response to a determination to respond to the first spoken user input;
  
  generating the audible response to the first spoken user input; and
  
  monitoring the audio input to identify a second spoken user input; and
  
  in response to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the audible response to the first spoken user input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 2. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying one or more predetermined words at a start of the first spoken user input.
  - 3. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying a physical or virtual button input received prior to receiving the first spoken user input.
  - 4. The method of claim 1, wherein generating the audible response to the first spoken user input comprises one or more of:
    - performing speech-to-text conversion on the first spoken user input;
      
      determining a user intent based on the first spoken user input;
      
      determining a task to be performed based on the first spoken user input;
      
      determining a parameter for the task to be performed based on the first spoken user input;
      
      performing the task to be performed;
      
      displaying a text response to the first spoken user input; and
      
      outputting an audio response to the first spoken user input.
  - 5. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - evaluating one or more conditional rules that depend on the contextual information associated with the first spoken user input.
  - 6. The method of claim 1, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - comparing the likelihood score to a threshold value.
  - 7. The method of claim 6, wherein the contextual information comprises an elapsed time between receiving the first spoken user input and a previous user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to a value of the elapsed time being greater than a threshold duration; and
      
      increasing the likelihood score in response to the value of the elapsed time being less than the threshold duration.
  - 8. The method of claim 6, wherein the contextual information comprises a previous spoken user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to detecting a match between the previous spoken user input and the first spoken user input.
  - 9. The method of claim 6, wherein the contextual information comprises a distance between a user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the distance being greater than a threshold distance; and
      
      increasing the likelihood score in response to the distance being less than the threshold distance.
  - 10. The method of claim 6, wherein the contextual information comprises an orientation of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the orientation of the device being face down or upside down; and
      
      increasing the likelihood score in response to the orientation of the device being face up or upright.
  - 11. The method of claim 6, wherein the contextual information comprises an orientation between the user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the orientation being one in which a display of the electronic device is oriented towards the user; and
      
      decreasing the likelihood score in response to the orientation being one in which the display of the electronic device is oriented away from the user.
  - 12. The method of claim 6, wherein the contextual information comprises an indication of whether the first spoken user input was recognized by an automatic speech recognizer, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the first spoken user input was recognized by the automatic speech recognizer; and
      
      decreasing the likelihood score in response to the indication indicating that the first spoken user input was not recognized by the automatic speech recognizer.
  - 13. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous spoken user input, and wherein determining the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a spoken user input semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the spoken user input semantic threshold value.
  - 14. The method of claim 6, wherein the contextual information comprises a length of the first spoken user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the length of the first spoken user input less than a threshold length; and
      
      decreasing the likelihood score in response to the length of the first spoken user input being greater than the threshold length.
  - 15. The method of claim 6, wherein the contextual information comprises an identification of a speaker of the first spoken user input.
  - 16. The method of claim 6, wherein the contextual information comprises a time the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the time being within a predetermined set of times; and
      
      decreasing the likelihood score in response to the time not being within the predetermined set of times.
  - 17. The method of claim 6, wherein the contextual information comprises an indication of whether the electronic device was outputting information to the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was outputting information to the user when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not outputting information to the user when the first spoken user input was received.
  - 18. The method of claim 6, wherein the contextual information comprises an expectation of receiving input from the user, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the expectation of receiving input from the user indicating that input was expected to be received from the user; and
      
      decreasing the likelihood score in response to the expectation of receiving input from the user indicating that input was not expected to be received from the user.
  - 19. The method of claim 6, wherein the contextual information comprises an indication of whether the electronic device is being held when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was being held when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not being held when the first spoken user input was received.
  - 20. The method of claim 6, wherein the contextual information comprises an operating state of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the operating state of the electronic device being one of a set of predetermined operating states; and
      
      decreasing the likelihood score in response to the operating state of the electronic device not being one of the set of predetermined operating states.
  - 21. The method of claim 6, wherein the contextual information comprises a previous action performed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the previous action performed by the electronic device being one of a set of predetermined actions; and
      
      decreasing the likelihood score in response to the previous action performed by the electronic device not being one of the set of predetermined actions.
  - 22. The method of claim 6, wherein the contextual information comprises an indication of whether content was being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to the indication indicating that content was being displayed by the electronic device when the first spoken user input was received.
  - 23. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and content being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a content semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the content semantic threshold value.
  - 24. The method of claim 6, wherein the contextual information comprises a position of the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the position of the user being one of a predetermined set of positions; and
      
      decreasing the likelihood score in response to the position of the user not being one of the predetermined set of positions.
  - 25. The method of claim 6, wherein the contextual information comprises a gesture being performed by the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the gesture being one of a predetermined set of gestures; and
      
      decreasing the likelihood score in response to the gesture not being one of the predetermined set of gestures.
  - 26. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous output of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous output semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous output semantic threshold value.
  - 27. The method of claim 6, wherein the contextual information comprises a location of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the location being one of a predetermined set of locations; and
      
      increasing the likelihood score in response to the location not being one of the predetermined set of locations.
  - 28. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input an application being run by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input device based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than an application semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the application semantic threshold value.
  - 29. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous contact, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous contact semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous contact semantic threshold value.
  - 30. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous email, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous email semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous email semantic threshold value.
  - 31. The method of claim 6, wherein the contextual information comprises a semantic relationship between the first spoken user input and a previous SMS message, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous SMS message semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous SMS semantic threshold value.
  - 32. The method of claim 6, wherein the contextual information comprises a movement of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the movement being one of a predetermined set of movements; and
      
      decreasing the likelihood score in response to the movement not being one of the predetermined set of movements.
  - 33. The method of claim 6, wherein the contextual information comprises a user setting, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the user setting being one of a predetermined set of user settings; and
      
      decreasing the likelihood score in response to the user setting not being one of the predetermined set of user settings.
  - 34. The method of claim 6, wherein the contextual information comprises an amount of light sensed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the amount of light being greater than a threshold amount of light; and
      
      decreasing the likelihood score in response to the amount of light being less than the threshold amount of light.
  - 35. The method of claim 6, wherein the contextual information comprises calendar data, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises decreasing the likelihood score in response to the calendar data indicating that the user is occupied at the time that the first spoken user input was received.

36. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the device to:
- receive an audio input;
  
  monitor the audio input to identify a first spoken user input, wherein the first spoken user input comprises a user request;
  
  identify the first spoken user input in the audio input;
  
  determine whether to respond to the first spoken user input based on contextual information associated with the first spoken user input, wherein the contextual information comprises a direction of the user'"'"'s gaze when the first spoken user input was received, wherein the determining comprises;
  
  calculating a likelihood score that the virtual assistant should provide an audible response to the first spoken user input based on the contextual information associated with the first spoken user input, wherein the audible response at least partially satisfies the user request;
  
  increasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed at the electronic device when the first spoken user input was received; and
  
  decreasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed away from the electronic device when the first spoken user input was received;
  
  responsive to a determination to respond to the first spoken user input;
  
  generate the audible response to the first spoken user input; and
  
  monitor the audio input to identify a second spoken user input; and
  
  responsive to a determination not to respond to the first spoken user input, monitor the audio input to identify the second spoken user input without generating the audible response to the first spoken user input.
- View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 73)
- - 38. The non-transitory computer readable storage medium of claim 36, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying a physical or virtual button input received prior to receiving the first spoken user input.
  - 39. The non-transitory computer readable storage medium of claim 36, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - evaluating one or more conditional rules that depend on the contextual information associated with the first spoken user input.
  - 40. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an elapsed time between receiving the first spoken user input and a previous user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to a value of the elapsed time being greater than a threshold duration; and
      
      increasing the likelihood score in response to the value of the elapsed time being less than the threshold duration.
  - 41. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a previous spoken user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to detecting a match between the previous spoken user input and the first spoken user input.
  - 42. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an orientation of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the orientation of the device being face down or upside down; and
      
      increasing the likelihood score in response to the orientation of the device being face up or upright.
  - 43. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an orientation between the user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the orientation being one in which a display of the electronic device is oriented towards the user; and
      
      decreasing the likelihood score in response to the orientation being one in which the display of the electronic device is oriented away from the user.
  - 44. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an indication of whether the first spoken user input was recognized by an automatic speech recognizer, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the first spoken user input was recognized by the automatic speech recognizer; and
      
      decreasing the likelihood score in response to the indication indicating that the first spoken user input was not recognized by the automatic speech recognizer.
  - 45. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous spoken user input, and wherein determining the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a spoken user input semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the spoken user input semantic threshold value.
  - 46. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an identification of a speaker of the first spoken user input.
  - 47. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an indication of whether the electronic device was outputting information to the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was outputting information to the user when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not outputting information to the user when the first spoken user input was received.
  - 48. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an indication of whether the electronic device is being held when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was being held when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not being held when the first spoken user input was received.
  - 49. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a previous action performed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the previous action performed by the electronic device being one of a set of predetermined actions; and
      
      decreasing the likelihood score in response to the previous action performed by the electronic device not being one of the set of predetermined actions.
  - 50. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises an indication of whether content was being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to the indication indicating that content was being displayed by the electronic device when the first spoken user input was received.
  - 51. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a semantic relationship between the first spoken user input and content being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a content semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the content semantic threshold value.
  - 52. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a position of the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the position of the user being one of a predetermined set of positions; and
      
      decreasing the likelihood score in response to the position of the user not being one of the predetermined set of positions.
  - 53. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous output of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous output semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous output semantic threshold value.
  - 54. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a semantic relationship between the first spoken user input an application being run by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input device based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than an application semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the application semantic threshold value.
  - 55. The non-transitory computer readable storage medium of claim 36, wherein the contextual information comprises a movement of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the movement being one of a predetermined set of movements; and
      
      decreasing the likelihood score in response to the movement not being one of the predetermined set of movements.
  - 73. The system of claim 38, wherein the contextual information comprises a movement of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the movement being one of a predetermined set of movements; and
      
      decreasing the likelihood score in response to the movement not being one of the predetermined set of movements.

37. A system comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving an audio input;
  
  monitoring the audio input to identify a first spoken user input, wherein the first spoken user input comprises a user request;
  
  identifying the first spoken user input in the audio input;
  
  determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input, wherein the contextual information comprises a direction of the user'"'"'s gaze when the first spoken user input was received, wherein the determining comprises;
  
  calculating a likelihood score that the virtual assistant should provide an audible response to the first spoken user input based on the contextual information associated with the first spoken user input, wherein the audible response at least partially satisfies the user request;
  
  increasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed at the electronic device when the first spoken user input was received; and
  
  decreasing the likelihood score in response to the direction of the user'"'"'s gaze being pointed away from the electronic device when the first spoken user input was received;
  
  responsive to a determination to respond to the first spoken user input;
  
  generating the audible response to the first spoken user input; and
  
  monitoring the audio input to identify a second spoken user input; and
  
  responsive to a determination not to respond to the first spoken user input, monitoring the audio input to identify the second spoken user input without generating the audible response to the first spoken user input.
- View Dependent Claims (56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
- - 56. The system of claim 37, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input excludes identifying a physical or virtual button input received prior to receiving the first spoken user input.
  - 57. The system of claim 37, wherein determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input comprises:
    - evaluating one or more conditional rules that depend on the contextual information associated with the first spoken user input.
  - 58. The system of claim 37, wherein the contextual information comprises an elapsed time between receiving the first spoken user input and a previous user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to a value of the elapsed time being greater than a threshold duration; and
      
      increasing the likelihood score in response to the value of the elapsed time being less than the threshold duration.
  - 59. The system of claim 37, wherein the contextual information comprises a previous spoken user input, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to detecting a match between the previous spoken user input and the first spoken user input.
  - 60. The system of claim 37, wherein the contextual information comprises an orientation of the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - decreasing the likelihood score in response to the orientation of the device being face down or upside down; and
      
      increasing the likelihood score in response to the orientation of the device being face up or upright.
  - 61. The system of claim 37, wherein the contextual information comprises an orientation between the user and the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the orientation being one in which a display of the electronic device is oriented towards the user; and
      
      decreasing the likelihood score in response to the orientation being one in which the display of the electronic device is oriented away from the user.
  - 62. The system of claim 37, wherein the contextual information comprises an indication of whether the first spoken user input was recognized by an automatic speech recognizer, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the first spoken user input was recognized by the automatic speech recognizer; and
      
      decreasing the likelihood score in response to the indication indicating that the first spoken user input was not recognized by the automatic speech recognizer.
  - 63. The system of claim 37, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous spoken user input, and wherein determining the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a spoken user input semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the spoken user input semantic threshold value.
  - 64. The system of claim 37, wherein the contextual information comprises an identification of a speaker of the first spoken user input.
  - 65. The system of claim 37, wherein the contextual information comprises an indication of whether the electronic device was outputting information to the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was outputting information to the user when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not outputting information to the user when the first spoken user input was received.
  - 66. The system of claim 37, wherein the contextual information comprises an indication of whether the electronic device is being held when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the indication indicating that the electronic device was being held when the first spoken user input was received; and
      
      decreasing the likelihood score in response to the indication indicating that the electronic device was not being held when the first spoken user input was received.
  - 67. The system of claim 37, wherein the contextual information comprises a previous action performed by the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the previous action performed by the electronic device being one of a set of predetermined actions; and
      
      decreasing the likelihood score in response to the previous action performed by the electronic device not being one of the set of predetermined actions.
  - 68. The system of claim 37, wherein the contextual information comprises an indication of whether content was being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises increasing the likelihood score in response to the indication indicating that content was being displayed by the electronic device when the first spoken user input was received.
  - 69. The system of claim 37, wherein the contextual information comprises a semantic relationship between the first spoken user input and content being displayed by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a content semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the content semantic threshold value.
  - 70. The system of claim 37, wherein the contextual information comprises a position of the user when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to the position of the user being one of a predetermined set of positions; and
      
      decreasing the likelihood score in response to the position of the user not being one of the predetermined set of positions.
  - 71. The system of claim 37, wherein the contextual information comprises a semantic relationship between the first spoken user input and the previous output of the electronic device, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than a previous output semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the previous output semantic threshold value.
  - 72. The system of claim 37, wherein the contextual information comprises a semantic relationship between the first spoken user input an application being run by the electronic device when the first spoken user input was received, and wherein calculating the likelihood score that the virtual assistant should provide the audible response to the first spoken user input device based on contextual information comprises:
    - increasing the likelihood score in response to a value of the semantic relationship being greater than an application semantic threshold value; and
      
      decreasing the likelihood score in response to the value of the semantic relationship being less than the application semantic threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Piernot, Philippe P., Binder, Justin
Primary Examiner(s)
Siddo, Ibrahim

Application Number

US15/656,793
Publication Number

US 20180012596A1
Time in Patent Office

746 Days
Field of Search
US Class Current
CPC Class Codes

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/013   Eye tracking input arrangem...

G06F 3/167   Audio in a user interface, ...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/1822   Parsing for meaning underst...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/227   of the speaker; Human-fact...

G10L 2015/228   of application context

H04W 4/025   using location based inform...

Reducing the need for manual start/end-pointing and trigger phrases

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

73 Claims

Specification

Solutions

Use Cases

Quick Links

Reducing the need for manual start/end-pointing and trigger phrases

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

73 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links