Duration ratio modeling for improved speech recognition
First Claim
Patent Images
1. A method of recognizing speech in audio data, the method comprising:
- receiving audio data representing speech, wherein the receiving is performed by an automated speech recognition (ASR) device configured to convert the audio data to text data, the ASR device comprising an ASR module;
transforming, by the ASR module, the audio data into one or more feature vectors representing the speech;
identifying, by the ASR module and using a portion of the one or more feature vectors, a sequence of phonemes represented in a portion of the audio data;
determining, by the ASR module, a first duration of the sequence of phonemes;
determining, by the ASR module, a second duration of a single phoneme within the sequence of phonemes;
determining, by the ASR module, a duration score of the single phoneme, wherein the duration score is determined using the second duration in relation to the first duration;
determining, by the ASR module, a recognition score based at least in part on the duration score;
determining, by the ASR module, a speech recognition result based at least in part upon the recognition score, wherein the speech recognition result is the text data corresponding to the speech; and
causing a command to be executed using the text data.
1 Assignment
0 Petitions
Accused Products
Abstract
In speech recognition, the duration of a phoneme is taken into account when determining recognition scores. Specifically, the duration of a phoneme may be evaluated relative to the duration of neighboring phonemes. A phoneme that is interpreted to be significantly longer or shorter than its neighbors may be given a lower duration score. A duration score for a phoneme may be calculated and used to adjust a recognition score. In this manner a duration model may supplement an acoustic model and language model to improve speech recognition results.
23 Citations
27 Claims
-
1. A method of recognizing speech in audio data, the method comprising:
-
receiving audio data representing speech, wherein the receiving is performed by an automated speech recognition (ASR) device configured to convert the audio data to text data, the ASR device comprising an ASR module; transforming, by the ASR module, the audio data into one or more feature vectors representing the speech; identifying, by the ASR module and using a portion of the one or more feature vectors, a sequence of phonemes represented in a portion of the audio data; determining, by the ASR module, a first duration of the sequence of phonemes; determining, by the ASR module, a second duration of a single phoneme within the sequence of phonemes; determining, by the ASR module, a duration score of the single phoneme, wherein the duration score is determined using the second duration in relation to the first duration; determining, by the ASR module, a recognition score based at least in part on the duration score; determining, by the ASR module, a speech recognition result based at least in part upon the recognition score, wherein the speech recognition result is the text data corresponding to the speech; and causing a command to be executed using the text data. - View Dependent Claims (2, 3)
-
-
4. A method, comprising:
-
receiving audio data, wherein the receiving is performed by at least one automated speech recognition (ASR) device configured to convert the audio data into text data, the at least one ASR device comprising at least one ASR module; determining, by the at least one ASR module, a sequence of speech units represented in a portion of the audio data; determining, by the at least one ASR module, a first duration of the sequence of speech units; determining, by the at least one ASR module and using the first duration, an expected duration of a single speech unit in the sequence of speech units; determining, by the at least one ASR module, a second duration of the single speech unit; determining, by the at least one ASR module, a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration; determining, by the at least one ASR module, a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and causing a command to be executed using the text data. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing device configured to convert audio data to text data, comprising:
-
an audio capture device configured to receive audio data, the received audio data representing spoken utterances; an automatic speech recognition (ASR) module configured to transform the received audio data into a sequence of speech units represented in the audio data; at least one processor; a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor; to determine a first duration of the sequence of speech units of the received audio data; to determine, using the first duration, an expected duration of a single speech unit of the received audio data in the sequence of speech units; to determine, by the ASR module, a second duration of the single speech unit; to determine a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration; to determine a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and to cause a command to be executed using the text data. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
-
program code to cause an automated speech recognition (ASR) module, configured to convert audio data to text data, of the computing device to transform the audio data received by the computing device into a sequence of speech units represented in the audio data; program code to determine a first duration of the sequence of speech units of the received audio data; program code to determine, using the first duration, an expected duration of a single speech unit of the received audio data in the sequence of speech units; program code to determine, by the ASR module, a second duration of the single speech unit; program code to determine a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration; and program code to determine a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and program code to cause a command to be executed using the text data. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
Specification