ANCHORED SPEECH DETECTION AND SPEECH RECOGNITION
First Claim
1. A computer implemented method for identifying speech from a desired speaker for automatic speech recognition (ASR), the method comprising:
- receiving audio data corresponding to speech, the audio data comprising a plurality of audio frames;
processing the plurality of audio frames to determine a first plurality of audio feature vectors corresponding to a first portion of the audio data and a second plurality of audio feature vectors corresponding to a second portion of the audio data;
determining that the first plurality of audio feature vectors corresponds to a wakeword;
processing the first plurality of audio feature vectors with a recurrent neural network encoder to determine a reference feature vector corresponding to speech from a desired speaker;
processing the second plurality of audio feature vectors, and the reference feature vector, using a neural-network classifier to determine a first score corresponding to a first audio feature vector in the second plurality, the first score corresponding to a likelihood that the first audio feature vector corresponds to audio spoken by the desired speaker;
determining that the score is above a threshold;
creating an indication that the first feature vector corresponds to speech from the desired speaker;
determining a first weight corresponding to the first feature vector based on the first feature vector corresponding to speech from the desired speaker; and
performing ASR using the first weight and the first feature vector.
1 Assignment
0 Petitions
Accused Products
Abstract
A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.
301 Citations
21 Claims
-
1. A computer implemented method for identifying speech from a desired speaker for automatic speech recognition (ASR), the method comprising:
-
receiving audio data corresponding to speech, the audio data comprising a plurality of audio frames; processing the plurality of audio frames to determine a first plurality of audio feature vectors corresponding to a first portion of the audio data and a second plurality of audio feature vectors corresponding to a second portion of the audio data; determining that the first plurality of audio feature vectors corresponds to a wakeword; processing the first plurality of audio feature vectors with a recurrent neural network encoder to determine a reference feature vector corresponding to speech from a desired speaker; processing the second plurality of audio feature vectors, and the reference feature vector, using a neural-network classifier to determine a first score corresponding to a first audio feature vector in the second plurality, the first score corresponding to a likelihood that the first audio feature vector corresponds to audio spoken by the desired speaker; determining that the score is above a threshold; creating an indication that the first feature vector corresponds to speech from the desired speaker; determining a first weight corresponding to the first feature vector based on the first feature vector corresponding to speech from the desired speaker; and performing ASR using the first weight and the first feature vector. - View Dependent Claims (2, 3)
-
-
4. A computer implemented method comprising:
-
receiving input audio data; identifying reference audio data; processing the reference audio data with a recurrent neural network to determine a reference feature vector; and processing a portion of the input audio data and the reference feature vector using a classifier to determine whether the portion corresponds to speech from a same speaker as the reference audio data. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computing system comprising:
-
at least one processor; and a memory device including instructions operable to be executed by the at least one processor to configure the system to; receive input audio data; identify reference audio data; process the reference audio data with a recurrent neural network to determine a reference feature vector; and process a portion of the input audio data and the reference feature vector using a classifier to determine whether the portion corresponds to speech from a same speaker as the reference audio data. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
Specification