ROBUST END-POINTING OF SPEECH SIGNALS USING SPEAKER RECOGNITION

US 20150371665A1
Filed: 04/30/2015
Published: 12/24/2015
Est. Priority Date: 06/19/2014
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a start-point or an end-point of a spoken user request,the method comprising:

at an electronic device;

receiving a stream of audio comprising the spoken user request;

determining a first likelihood that the stream of audio comprises user speech;

determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device; and

identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for robust end-pointing of speech signals using speaker recognition are provided. In one example process, a stream of audio having a spoken user request can be received. A first likelihood that the stream of audio includes user speech can be determined. A second likelihood that the stream of audio includes user speech spoken by an authorized user can be determined. A start-point or an end-point of the spoken user request can be determined based at least in part on the first likelihood and the second likelihood.

Citations

24 Claims

1. A method for identifying a start-point or an end-point of a spoken user request,the method comprising:
- at an electronic device;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech;
  
  determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user of the electronic device; and
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 3. The method of claim 2, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 4. The method of claim 1, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user.
  - 5. The method of claim 1, wherein the first likelihood is based at least in part on an energy level of the stream of audio.
  - 6. The method of claim 1, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 7. The method of claim 6, further comprising updating the speech model based at least in part on a portion of the stream of audio.
  - 8. The method of claim 1, wherein the authorized user is one of a plurality of authorized users of the electronic device.

9. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech based at least in part on an energy level of the stream of audio;
  
  in response to the energy level exceeding a threshold energy level for longer than a threshold duration, performing speaker authentication on the stream of audio to determine a second likelihood that the stream of audio comprises speech spoken by an authorized user of the electronic device; and
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 9, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 11. The method of claim 10, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 12. The method of claim 9, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 13. The method of claim 12, further comprising updating the speech model using a portion of the stream of audio.

14. A method for identifying a start-point or an end-point of a spoken user request, the method comprising:
- at an electronic device;
  
  receiving a signal to begin recording an audio input, wherein the audio input comprises the spoken user request;
  
  determining a baseline energy level of the audio input based on an energy level of a first portion of the audio input;
  
  determining a first likelihood that the audio input comprises user speech based on an energy level of a second portion of the audio input;
  
  in response to the baseline energy level exceeding a threshold energy level, performing speaker authentication on the second portion of the audio input to determine a second likelihood that the audio input comprises speech spoken by an authorized user of the electronic device; and
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method of claim 14, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 16. The method of claim 15, wherein the duration of the segment of audio is at least five times longer than the duration of the frame of audio.
  - 17. The method of claim 14, wherein the second likelihood is based at least in part on a speech model of the authorized user, and wherein the speech model is based at least in part on speech of the authorized user received prior to receiving the stream of audio.
  - 18. The method of claim 17, further comprising updating the speech model using a portion of the audio input.

19. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
- receive a stream of audio comprising the spoken user request;
  
  determine a first likelihood that the stream of audio comprises user speech;
  
  determine a second likelihood that the stream of audio comprises user speech spoken by an authorized user; and
  
  identify the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.
- View Dependent Claims (20, 21)
- - 20. The non-transitory computer-readable storage medium of claim 19, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 21. The non-transitory computer-readable storage medium of claim 19, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user.

22. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a stream of audio comprising the spoken user request;
  
  determining a first likelihood that the stream of audio comprises user speech;
  
  determining a second likelihood that the stream of audio comprises user speech spoken by an authorized user; and
  
  identifying the start-point or the end-point of the spoken user request based at least in part on the first likelihood and the second likelihood.
- View Dependent Claims (23, 24)
- - 23. The electronic device of claim 22, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      a duration of the segment of audio is longer than a duration of the frame of audio.
  - 24. The electronic device of claim 22, wherein:
    - the stream of audio comprises a plurality of frames of audio;
      
      determining the first likelihood comprises determining a likelihood that a frame of audio of the plurality of frames of audio comprises user speech;
      
      the stream of audio comprises a plurality of segments of audio;
      
      determining the second likelihood comprises determining a likelihood that a segment of audio of the plurality of segments of audio comprises user speech spoken by the authorized user; and
      
      determining the likelihood that the frame of audio comprises user speech is performed prior to determining the likelihood that any segment of audio of the plurality of segments of audio comprises user speech spoken by an authorized user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
NAIK, Devang K., KAJAREKAR, Sachin

Granted Patent

US 10,186,282 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 17/00   Speaker identification or v...

G10L 17/22   Interactive procedures; Man...

G10L 25/78   Detection of presence or ab...

G10L 25/87   Detection of discrete point...

ROBUST END-POINTING OF SPEECH SIGNALS USING SPEAKER RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

ROBUST END-POINTING OF SPEECH SIGNALS USING SPEAKER RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links