Systems and methods for speech transcription
First Claim
1. A computer-implemented method for transcribing speech comprising:
- receiving an input audio from a user;
normalizing the input audio to make a total power of the input audio consistent with a set of training samples used to train a trained neural network;
generating a jitter set of audio files from the normalized input audio by translating the normalized input audio by one or more time values;
for each audio file from the jitter set of audio files, which includes the normalized input audio;
generating a set of spectrogram frames for each audio file;
inputting the set of spectrogram frames into a trained neural network;
obtaining predicted character probabilities outputs from the trained neural network; and
decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words.
1 Assignment
0 Petitions
Accused Products
Abstract
Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
36 Citations
20 Claims
-
1. A computer-implemented method for transcribing speech comprising:
-
receiving an input audio from a user; normalizing the input audio to make a total power of the input audio consistent with a set of training samples used to train a trained neural network; generating a jitter set of audio files from the normalized input audio by translating the normalized input audio by one or more time values; for each audio file from the jitter set of audio files, which includes the normalized input audio; generating a set of spectrogram frames for each audio file; inputting the set of spectrogram frames into a trained neural network; obtaining predicted character probabilities outputs from the trained neural network; and decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 9)
-
-
8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising:
-
receiving an input audio from a user; generating a set of spectrogram frames from the input audio; inputting the set of spectrogram frames into a set of trained neural networks; obtaining predicted character probabilities outputs from the set of trained neural networks; and decoding a transcription of the input audio using the predicted character probabilities outputs from the set of trained neural networks constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. - View Dependent Claims (10, 11, 12)
-
-
13. A computer-implemented method for transcribing speech comprising:
-
receiving an input audio from a user; generating a set of spectrogram frames for the input audio; inputting the set of spectrogram frames into a trained neural network; obtaining predicted character probabilities outputs from the trained neural network; and decoding a transcription of the input audio using the predicted character probabilities outputs from the trained neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification