ANCHORED SPEECH DETECTION AND SPEECH RECOGNITION

US 20170270919A1
Filed: 06/29/2016
Published: 09/21/2017
Est. Priority Date: 03/21/2016
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for identifying speech from a desired speaker for automatic speech recognition (ASR), the method comprising:

receiving audio data corresponding to speech, the audio data comprising a plurality of audio frames;

processing the plurality of audio frames to determine a first plurality of audio feature vectors corresponding to a first portion of the audio data and a second plurality of audio feature vectors corresponding to a second portion of the audio data;

determining that the first plurality of audio feature vectors corresponds to a wakeword;

processing the first plurality of audio feature vectors with a recurrent neural network encoder to determine a reference feature vector corresponding to speech from a desired speaker;

processing the second plurality of audio feature vectors, and the reference feature vector, using a neural-network classifier to determine a first score corresponding to a first audio feature vector in the second plurality, the first score corresponding to a likelihood that the first audio feature vector corresponds to audio spoken by the desired speaker;

determining that the score is above a threshold;

creating an indication that the first feature vector corresponds to speech from the desired speaker;

determining a first weight corresponding to the first feature vector based on the first feature vector corresponding to speech from the desired speaker; and

performing ASR using the first weight and the first feature vector.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

301 Citations

21 Claims

1. A computer implemented method for identifying speech from a desired speaker for automatic speech recognition (ASR), the method comprising:
- receiving audio data corresponding to speech, the audio data comprising a plurality of audio frames;
  
  processing the plurality of audio frames to determine a first plurality of audio feature vectors corresponding to a first portion of the audio data and a second plurality of audio feature vectors corresponding to a second portion of the audio data;
  
  determining that the first plurality of audio feature vectors corresponds to a wakeword;
  
  processing the first plurality of audio feature vectors with a recurrent neural network encoder to determine a reference feature vector corresponding to speech from a desired speaker;
  
  processing the second plurality of audio feature vectors, and the reference feature vector, using a neural-network classifier to determine a first score corresponding to a first audio feature vector in the second plurality, the first score corresponding to a likelihood that the first audio feature vector corresponds to audio spoken by the desired speaker;
  
  determining that the score is above a threshold;
  
  creating an indication that the first feature vector corresponds to speech from the desired speaker;
  
  determining a first weight corresponding to the first feature vector based on the first feature vector corresponding to speech from the desired speaker; and
  
  performing ASR using the first weight and the first feature vector.
- View Dependent Claims (2, 3)
- - 2. The computer-implemented method of claim 1, further comprising:
    - processing at least a portion of the plurality of audio frames to determine a third plurality of audio feature vectors corresponding to the second portion;
      
      processing the third plurality of audio feature vectors and the reference feature vector using the neural-network classifier to determine a second score corresponding to a second audio feature vector in the third plurality, the second score corresponding to a likelihood that the second audio feature vector corresponds to audio spoken by the desired speaker;
      
      determining that the second score is below the threshold;
      
      creating a second indication that the third feature vector corresponds to speech from a different speaker as the wakeword; and
      
      determining a second weight corresponding to the third feature vector based on the third feature vector corresponding to speech from a different speaker as the wakeword, wherein the second weight is less than the first weight.
  - 3. The computer-implemented method of claim 1, further comprising:
    - identifying a first pair of feature vectors corresponding to audio frames positioned prior to a first audio frame corresponding to the first feature vector;
      
      identifying a second pair of feature vectors corresponding to audio frames positioned after the first audio frame,wherein processing the second plurality of audio feature vectors and the reference feature vector using the neural-network classifier further comprises processing the first pair of feature vectors, the first feature vector, and the second pair of feature vectors to determine the first score.

4. A computer implemented method comprising:
- receiving input audio data;
  
  identifying reference audio data;
  
  processing the reference audio data with a recurrent neural network to determine a reference feature vector; and
  
  processing a portion of the input audio data and the reference feature vector using a classifier to determine whether the portion corresponds to speech from a same speaker as the reference audio data.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
- - 5. The computer implemented method of claim 4, further comprising:
    - determining that a first portion of the input audio data comprises a keyword; and
      
      selecting the first portion as the reference audio data.
  - 6. The computer implemented method of claim 4, wherein receiving input audio data comprises receiving a first audio data as part of a first interaction with a first device and receiving a second audio data as part of a second interaction with the first device and wherein the method further comprises:
    - selecting the first audio data as the reference audio data; and
      
      selecting the second audio data as the portion of the input audio data.
  - 7. The computer implemented method of claim 4, further comprising:
    - prior to receiving the input audio data, storing training audio data corresponding to a first speaker,wherein identifying the reference audio data comprises selecting the training audio data as the reference audio data.
  - 8. The computer implemented method of claim 7, wherein:
    - processing reference audio data with a recurrent neural network comprises processing the training audio data using the recurrent neural network to determine the reference feature vector prior to receiving the input audio data.
  - 9. The computer-implemented method of claim 4, further comprising:
    - identifying a plurality of feature vectors corresponding to the portion of the input audio data,wherein processing the portion of the input audio data and the reference feature vector using the classifier comprises processing the plurality of feature vectors to determine a first score corresponding to whether a first feature vector in the plurality of feature vectors corresponds to speech from the same speaker as the reference audio data.
  - 10. The computer implemented method of claim 9, further comprising:
    - determining a first weight corresponding to the first feature vector based on the first score; and
      
      performing speech recognition using the first weight and the first feature vector.
  - 11. The computer implemented method of claim 10, further comprising:
    - identifying a second plurality of feature vectors corresponding to a second portion of the input audio data;
      
      processing the second portion of the audio data and the reference feature vector using the classifier to determine a second score corresponding to whether a second feature vector in the second plurality of feature vectors corresponds to speech from the same speaker as the reference audio data, wherein the second score is lower that the first score; and
      
      determining a second weight corresponding to the second feature vector, wherein the second weight is less than the first weight.
  - 12. The computer-implemented method of claim 4, wherein the recurrent neural network is configured to input a plurality of audio feature vectors and output a single vector, the single vector incorporating information from each of the plurality of the plurality of audio feature vectors.

13. A computing system comprising:
- at least one processor; and
  
  a memory device including instructions operable to be executed by the at least one processor to configure the system to;
  
  receive input audio data;
  
  identify reference audio data;
  
  process the reference audio data with a recurrent neural network to determine a reference feature vector; and
  
  process a portion of the input audio data and the reference feature vector using a classifier to determine whether the portion corresponds to speech from a same speaker as the reference audio data.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The computing system of claim 13, further comprising instructions that configure the system to:
    - determine that a first portion of the input audio data comprises a keyword; and
      
      select the first portion as the reference audio data.
  - 15. The computing system of claim 13, wherein the instructions that configure the system to receive input audio data comprises instructions that configure the system to receive a first audio data as part of a first interaction with a first device and receiving a second audio data as part of a second interaction with the first device, and the computing system further comprises instructions that configure the system to:
    - select the first audio data as the reference audio data; and
      
      select the second audio data as the portion of the input audio data.
  - 16. The computing system of claim 13, wherein the instructions that configure the system to:
    - prior to receiving the input audio data, store training audio data corresponding to a first speaker,wherein the instructions that configure the system to identify the reference audio data comprises instructions that configure the system to select the training audio data as the reference audio data.
  - 17. The computing system of claim 13, wherein the instructions that configure the system to process reference audio data with a recurrent neural network comprise instructions that configure the system to process the training audio data using the recurrent neural network to determine the reference feature vector prior to receiving the input audio data.
  - 18. The computing system of claim 13, further comprising instructions that configure the system to:
    - identify a plurality of feature vectors corresponding to the portion of the input audio data,wherein the instructions that configure the system to process the portion of the input audio data and the reference feature vector using the classifier comprise instructions that configure the system to process the plurality of feature vectors to determine a first score corresponding to whether a first feature vector in the plurality of feature vectors corresponds to speech from the same speaker as the reference audio data.
  - 19. The computing system of claim 13, further comprising instructions that configure the system to:
    - determine a first weight corresponding to the first feature vector based on the first score; and
      
      perform speech recognition using the first weight and the first feature vector.
  - 20. The computing system of claim 19, further comprising instructions that configure the system to:
    - identify a second plurality of feature vectors corresponding to a second portion of the input audio data;
      
      process the second portion of the audio data and the reference feature vector using the classifier to determine a second score corresponding to whether a second feature vector in the second plurality of feature vectors corresponds to speech from the same speaker as the reference audio data, wherein the second score is lower that the first score; and
      
      determine a second weight corresponding to the second feature vector, wherein the second weight is less than the first weight.
  - 21. The computing system of claim 13, wherein the recurrent neural network is configured to input a plurality of audio feature vectors and output a single vector, the single vector incorporating information from each of the plurality of the plurality of audio feature vectors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Parthasarathi, Sree Hari Krishnan, Hoffmeister, Bjorn, King, Brian, Maas, Roland

Granted Patent

US 10,373,612 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/16   using artificial neural net...

G10L 15/20   Speech recognition techniqu...

G10L 17/02   Preprocessing operations, e...

G10L 17/06   Decision making techniques;...

G10L 17/18   Artificial neural networks;...

G10L 2015/088   Word spotting

G10L 2025/783   based on threshold decision

G10L 25/87   Detection of discrete point...

ANCHORED SPEECH DETECTION AND SPEECH RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

301 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

ANCHORED SPEECH DETECTION AND SPEECH RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

301 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links