Robust end-pointing of speech signals using speaker recognition
First Claim
Patent Images
1. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
receiving a stream of audio comprising the spoken user request;
determining a first likelihood that the stream of audio comprises user speech;
determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device;
weighting the first likelihood and the second likelihood;
identifying the start-point or the end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and
identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;
processing the portion of the stream of audio to determine a corresponding task; and
performing the task responsive to receiving the stream of audio.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes for robust end-pointing of speech signals using speaker recognition are provided. In one example process, a stream of audio having a spoken user request can be received. A first likelihood that the stream of audio includes user speech can be determined. A second likelihood that the stream of audio includes user speech spoken by an authorized user can be determined. A start-point or an end-point of the spoken user request can be determined based at least in part on the first likelihood and the second likelihood.
15 Citations
45 Claims
-
1. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
-
at an electronic device; receiving a stream of audio comprising the spoken user request; determining a first likelihood that the stream of audio comprises user speech; determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device; weighting the first likelihood and the second likelihood; identifying the start-point or the end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point; processing the portion of the stream of audio to determine a corresponding task; and performing the task responsive to receiving the stream of audio. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
at an electronic device; receiving a stream of audio comprising the spoken user request; determining a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio; in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, performing speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device; identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point; processing the portion of the stream of audio to determine a corresponding task; and performing the task responsive to receiving the stream of audio. - View Dependent Claims (15, 16, 17, 18)
-
19. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
-
at an electronic device; receiving a signal to begin recording an audio input, wherein the audio input comprises the spoken user request; determining a baseline energy level of the audio input based on an energy level of a first portion of the audio input; determining a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input; in response to the baseline energy level exceeding a threshold energy level, performing speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device; identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identifying a portion of the audio input including the spoken user request using the start-point or the end-point; processing the portion of the audio input to determine a corresponding task; and performing the task responsive to receiving the audio input. - View Dependent Claims (20, 21, 22, 23)
-
-
24. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
-
receive a stream of audio comprising the spoken user request; determine a first likelihood that the stream of audio comprises user speech; determine a second likelihood that the stream of audio comprises user speech spoken by an authorized user; weight the first likelihood and the second likelihood; identify a start-point or an end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and identify a portion of the stream of audio including the spoken user request using the start-point or the end-point; process the portion of the stream of audio to determine a corresponding task; and perform the task responsive to receiving the stream of audio. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. An electronic device comprising:
-
one or more processors; memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for; receiving a stream of audio comprising the spoken user request; determining a first likelihood that the stream of audio comprises user speech; determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user; weighting the first likelihood and the second likelihood; identifying a start-point or an end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point; processing the portion of the stream of audio to determine a corresponding task; and performing the task responsive to receiving the stream of audio. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41)
-
-
42. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
-
receive a stream of audio comprising the spoken user request; determine a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio; in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, perform speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device; identify the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identify a portion of the stream of audio including the spoken user request using the start-point or the end-point; process the portion of the stream of audio to determine a corresponding task; and perform the task responsive to receiving the stream of audio.
-
-
43. An electronic device comprising:
-
one or more processors; memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for; receiving a stream of audio comprising the spoken user request; determining a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio; in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, performing speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device; identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point; processing the portion of the stream of audio to determine a corresponding task; and performing the task responsive to receiving the stream of audio.
-
-
44. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
-
receive a signal to begin recording an audio input, wherein the audio input comprises the spoken user request; determine a baseline energy level of the audio input based on an energy level of a first portion of the audio input; determine a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input; in response to the baseline energy level exceeding a threshold energy level, perform speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device; identify the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identify a portion of the audio input including the spoken user request using the start-point or the end-point; process the portion of the audio input to determine a corresponding task; and perform the task responsive to receiving the audio input.
-
-
45. An electronic device comprising:
-
one or more processors; memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for; receiving a signal to begin recording an audio input, wherein the audio input comprises the spoken user request; determining a baseline energy level of the audio input based on an energy level of a first portion of the audio input; determining a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input; in response to the baseline energy level exceeding a threshold energy level, performing speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device; identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and identifying a portion of the audio input including the spoken user request using the start-point or the end-point; processing the portion of the audio input to determine a corresponding task; and performing the task responsive to receiving the audio input.
-
Specification