VARIOUS APPARATUS AND METHODS FOR A SPEECH RECOGNITION SYSTEM
First Claim
1. A continuous speech recognition engine, comprisingfront-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication into a time coded sequence of sound feature frames for speech recognition;
- a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames;
a coarse sound representation generator having an input to receive both
1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and
2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;
a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file, wherein each word in the supplied audio file is stored in a memory with a robust confidence level parameter and the start and stop time codes from the common time line; and
a user interface configured to allow speech data analytics on each word in the supplied audio file of continuous voice communications stored in the memory based on the robust confidence level parameter.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus, and system are described for a continuous speech recognition engine that includes a fine speech recognizer model, a coarse sound representation generator, and a coarse match generator. The fine speech recognizer model receives a time coded sequence of sound feature frames, applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames. The coarse sound representation generator generates a coarse sound representation of the recognized word. The coarse match generator determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word and assigns the likelihood as a robust confidence level parameter to that recognized word.
310 Citations
20 Claims
-
1. A continuous speech recognition engine, comprising
front-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication into a time coded sequence of sound feature frames for speech recognition; -
a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames; a coarse sound representation generator having an input to receive both
1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and
2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file, wherein each word in the supplied audio file is stored in a memory with a robust confidence level parameter and the start and stop time codes from the common time line; and a user interface configured to allow speech data analytics on each word in the supplied audio file of continuous voice communications stored in the memory based on the robust confidence level parameter. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system, comprising:
a continuous speech recognition engine that includes front-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication, as opposed to a paused voice command communication, into a time coded sequence of sound feature frames for speech recognition; a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames; a coarse sound representation generator having an input to receive both
1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and
2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file; wherein each recognized word from the continuous speech recognition engine has a robust confidence level parameter associated with that recognized word and each time the same recognized word is uttered in the supplied audio file, each instance of the recognized word can have its own robust confidence level parameter for that instance of the recognized word, which can differ in robust confidence level from another instance of the recognized word uttered in the same supplied audio file; a user interface configured to allow a speech data analytics on each word in the supplied audio file stored in the memory based on the robust confidence level parameter, wherein the user interface has a input to receive the supplied audio files from a client machine over a wide area network and supply the supplied audio files to the front end filters; a server to host the continuous speech recognition engine; a database to store each word in the supplied audio file with its assigned robust confidence level parameter and the start and stop time code from the common time line; and an intelligence engine configured to assign a higher weight to recognized words with a robust confidence level above a threshold than recognized words below the threshold, and use the weight for the recognized words when queries are made with the user interface. - View Dependent Claims (12, 13, 14, 15, 16)
-
17. A method for continuous speech recognition that uses robustness as a confidence measure for words output by a speech recognition system as a measure of how confident the system is that each individual word was correctly identified to either or both 1) a database of spoken words and 2) one or more language models, comprising:
-
converting a supplied audio file of a continuous voice communication, as opposed to a paused voice command communication, into a time coded sequence of sound feature frames for speech recognition; receiving the time coded sequence of sound feature frames and applying a speech recognition processes to the sound feature frames to determine at least a best guess at a recognizable word that corresponds to the sequence of sound feature frames; generating the recognizable word and pairing that recognized word its start and end time; generating a coarse sound representation of the recognized word that contains a series of identified individual phonemes occurring within the duration of the start and stop time of the recognized word; comparing the recognized word along side the coarse sound representation, captured during same segment of time the recognized word occupies, to the known sounds of that recognized word in a database and then assigning a robustness confidence level parameter to the recognized word based on the comparison; pairing the robust confidence level parameter for that recognized word with the recognized word and including the start and stop time codes from the common time line with the supplied audio file, wherein each recognized word from the continuous speech recognition engine has a robust confidence level parameter associated with that recognized word and each time the same recognized word is uttered in the supplied audio file, each instance of the recognized word can have its own robust confidence level parameter for that instance of the recognized word, which can differ in robust confidence level from another instance of the recognized word uttered in the same supplied audio file; and performing speech data analytics on each word in the supplied audio file stored in the memory based on the robust confidence level parameter including categorizing automated speech recognition results on an individual word basis within the supplied audio file of continuous communication based on how likely each word has been correctly recognized. - View Dependent Claims (18, 19, 20)
-
Specification