Robust end-pointing of speech signals using speaker recognition

US 10,186,282 B2
Filed: 04/30/2015
Issued: 01/22/2019
Est. Priority Date: 06/19/2014
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:

at an electronic device;

receiving a stream of audio comprising the spoken user request;

determining a first likelihood that the stream of audio comprises user speech;

determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device;

weighting the first likelihood and the second likelihood;

identifying the start-point or the end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and

identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;

processing the portion of the stream of audio to determine a corresponding task; and

performing the task responsive to receiving the stream of audio.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for robust end-pointing of speech signals using speaker recognition are provided. In one example process, a stream of audio having a spoken user request can be received. A first likelihood that the stream of audio includes user speech can be determined. A second likelihood that the stream of audio includes user speech spoken by an authorized user can be determined. A start-point or an end-point of the spoken user request can be determined based at least in part on the first likelihood and the second likelihood.

15 Citations

45 Claims

1. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech;
  
  determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device;
  
  weighting the first likelihood and the second likelihood;
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and
  
  identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  processing the portion of the stream of audio to determine a corresponding task; and
  
  performing the task responsive to receiving the stream of audio.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 3. The method of claim 2, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 4. The method of claim 1, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user.
  - 5. The method of claim 1, wherein the first likelihood is based at least in part on an energy level of the stream of audio.
  - 6. The method of claim 1, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 7. The method of claim 6, further comprising updating the speech model based at least in part on a second portion of the stream of audio.
  - 8. The method of claim 1, wherein the authorized user is one of a plurality of authorized users of the electronic device.
  - 9. The method of claim 1, wherein weighting the first likelihood and the second likelihood is based on comparing the first likelihood to the second likelihood.
  - 10. The method of claim 1, further comprising:
    - combining the weighted first likelihood and the weighted second likelihood to obtain a combined likelihood score, wherein identifying the start-point or the end-point of the spoken user request is based on the combined likelihood score.
  - 11. The method of claim 1, wherein weighting the first likelihood and the second likelihood comprises weighting the second likelihood greater than the first likelihood in accordance with the first likelihood contradicting the second likelihood.
  - 12. The method of claim 1, wherein weighting the first likelihood and the second likelihood comprises weighting the second likelihood less than the first likelihood in accordance with the first likelihood confirming the second likelihood.
  - 13. The method of claim 1, wherein weighting the first likelihood and the second likelihood comprises weighting the second likelihood equal to the first likelihood.

14. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio;
  
  in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, performing speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device;
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  processing the portion of the stream of audio to determine a corresponding task; and
  
  performing the task responsive to receiving the stream of audio.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method of claim 14, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 16. The method of claim 15, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 17. The method of claim 14, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 18. The method of claim 17, further comprising updating the speech model using a second portion of the stream of audio.

19. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
  
  receiving a signal to begin recording an audio input, wherein the audio input comprises the spoken user request;
  
  determining a baseline energy level of the audio input based on an energy level of a first portion of the audio input;
  
  determining a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input;
  
  in response to the baseline energy level exceeding a threshold energy level, performing speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device;
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identifying a portion of the audio input including the spoken user request using the start-point or the end-point;
  
  processing the portion of the audio input to determine a corresponding task; and
  
  performing the task responsive to receiving the audio input.
- View Dependent Claims (20, 21, 22, 23)
- - 20. The method of claim 19, wherein:
    - the audio input comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the audio input comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 21. The method of claim 20, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 22. The method of claim 19, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the audio input.
  - 23. The method of claim 22, further comprising updating the speech model using a third portion of the audio input.

24. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
- receive a stream of audio comprising the spoken user request;
  
  determine a first likelihood that the stream of audio comprises user speech;
  
  determine a second likelihood that the stream of audio comprises user speech spoken by an authorized user;
  
  weight the first likelihood and the second likelihood;
  
  identify a start-point or an end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and
  
  identify a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  process the portion of the stream of audio to determine a corresponding task; and
  
  perform the task responsive to receiving the stream of audio.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32)
- - 25. The non-transitory computer-readable storage medium of claim 24, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 26. The non-transitory computer-readable storage medium of claim 24, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user.
  - 27. The non-transitory computer-readable storage medium of claim 24, wherein the first likelihood is based at least in part on an energy level of the stream of audio.
  - 28. The non-transitory computer-readable storage medium of claim 24, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 29. The non-transitory computer-readable storage medium of claim 28, further comprising instructions for causing the one or more processors to:
    - update the speech model based at least in part on a second portion of the stream of audio.
  - 30. The non-transitory computer-readable storage medium of claim 24, wherein the authorized user is one of a plurality of authorized users of the electronic device.
  - 31. The non-transitory computer-readable storage medium of claim 24, wherein weighting the first likelihood and the second likelihood is based on comparing the first likelihood to the second likelihood.
  - 32. The non-transitory computer-readable storage medium of claim 24, further comprising instructions for causing the one or more processors to:
    - combine the weighted first likelihood and the weighted second likelihood to obtain a combined likelihood score, wherein identifying the start-point or the end-point of the spoken user request is based on the combined likelihood score.

33. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech;
  
  determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user;
  
  weighting the first likelihood and the second likelihood;
  
  identifying a start-point or an end-point of the spoken user request based at least in part on the weighted first likelihood and the weighted second likelihood; and
  
  identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  processing the portion of the stream of audio to determine a corresponding task; and
  
  performing the task responsive to receiving the stream of audio.
- View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41)
- - 34. The electronic device of claim 33, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 35. The electronic device of claim 33, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by an authorized user.
  - 36. The electronic device of claim 33, wherein the first likelihood is based at least in part on an energy level of the stream of audio.
  - 37. The electronic device of claim 33, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 38. The electronic device of claim 37, wherein the one or more programs further include instructions for:
    - updating the speech model based at least in part on a second portion of the stream of audio.
  - 39. The electronic device of claim 33, wherein the authorized user is one of a plurality of authorized users of the electronic device.
  - 40. The electronic device of claim 33, wherein weighting the first likelihood and the second likelihood is based on comparing the first likelihood to the second likelihood.
  - 41. The electronic device of claim 33, wherein the one or more programs further include instructions for:
    - combining the weighted first likelihood and the weighted second likelihood to obtain a combined likelihood score, wherein identifying the start-point or the end-point of the spoken user request is based on the combined likelihood score.

42. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
- receive a stream of audio comprising the spoken user request;
  
  determine a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio;
  
  in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, perform speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device;
  
  identify the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identify a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  process the portion of the stream of audio to determine a corresponding task; and
  
  perform the task responsive to receiving the stream of audio.

43. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio;
  
  in response to the energy level exceeding a threshold energy level for longer than a predetermined threshold duration, performing speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device;
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identifying a portion of the stream of audio including the spoken user request using the start-point or the end-point;
  
  processing the portion of the stream of audio to determine a corresponding task; and
  
  performing the task responsive to receiving the stream of audio.

44. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
- receive a signal to begin recording an audio input, wherein the audio input comprises the spoken user request;
  
  determine a baseline energy level of the audio input based on an energy level of a first portion of the audio input;
  
  determine a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input;
  
  in response to the baseline energy level exceeding a threshold energy level, perform speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device;
  
  identify the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identify a portion of the audio input including the spoken user request using the start-point or the end-point;
  
  process the portion of the audio input to determine a corresponding task; and
  
  perform the task responsive to receiving the audio input.

45. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a signal to begin recording an audio input, wherein the audio input comprises the spoken user request;
  
  determining a baseline energy level of the audio input based on an energy level of a first portion of the audio input;
  
  determining a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input;
  
  in response to the baseline energy level exceeding a threshold energy level, performing speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device;
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood; and
  
  identifying a portion of the audio input including the spoken user request using the start-point or the end-point;
  
  processing the portion of the audio input to determine a corresponding task; and
  
  performing the task responsive to receiving the audio input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Naik, Devang K., Kajarekar, Sachin
Primary Examiner(s)
Rudolph, Vincent
Assistant Examiner(s)
Brinich, Stephen

Application Number

US14/701,147
Publication Number

US 20150371665A1
Time in Patent Office

1,363 Days
Field of Search

704246-255, 704215, 704 1- 10, 704E17002, 704E15007
US Class Current
CPC Class Codes

G10L 17/00   Speaker identification or v...

G10L 17/22   Interactive procedures; Man...

G10L 25/78   Detection of presence or ab...

G10L 25/87   Detection of discrete point...

Robust end-pointing of speech signals using speaker recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Robust end-pointing of speech signals using speaker recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links