VARIOUS APPARATUS AND METHODS FOR A SPEECH RECOGNITION SYSTEM

US 20100223056A1
Filed: 02/27/2009
Published: 09/02/2010
Est. Priority Date: 02/27/2009
Status: Active Grant

First Claim

Patent Images

1. A continuous speech recognition engine, comprisingfront-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication into a time coded sequence of sound feature frames for speech recognition;

a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames;

a coarse sound representation generator having an input to receive both

1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and

2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;

a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file, wherein each word in the supplied audio file is stored in a memory with a robust confidence level parameter and the start and stop time codes from the common time line; and

a user interface configured to allow speech data analytics on each word in the supplied audio file of continuous voice communications stored in the memory based on the robust confidence level parameter.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus, and system are described for a continuous speech recognition engine that includes a fine speech recognizer model, a coarse sound representation generator, and a coarse match generator. The fine speech recognizer model receives a time coded sequence of sound feature frames, applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames. The coarse sound representation generator generates a coarse sound representation of the recognized word. The coarse match generator determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word and assigns the likelihood as a robust confidence level parameter to that recognized word.

311 Citations

20 Claims

1. A continuous speech recognition engine, comprisingfront-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication into a time coded sequence of sound feature frames for speech recognition;
- a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames;
  
  a coarse sound representation generator having an input to receive both
  
  1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and
  
  2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;
  
  a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file, wherein each word in the supplied audio file is stored in a memory with a robust confidence level parameter and the start and stop time codes from the common time line; and
  
  a user interface configured to allow speech data analytics on each word in the supplied audio file of continuous voice communications stored in the memory based on the robust confidence level parameter.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The continuous speech recognition engine of claim 1, wherein the user interface generates a time coded text file as a transcript, where each transcribed word has the robust confidence level parameter as a measure of how confident the system is that the word was correctly identified and recognized words below a threshold robust confidence level are indicated in the transcript.
  - 3. The continuous speech recognition engine of claim 1, wherein the coarse sound representation generator receives as an input sound data vectors from a sound decoder and generates a coarse sound representation of the recognized word consisting of sound data vectors that correspond to a duration of the recognized word outputted from the fine speech recognition model.
  - 4. The continuous speech recognition engine of claim 1, wherein the phoneme decoder compares a sound pattern of each phoneme to a set of phoneme models to recognize the sound feature frames as a sequence of phonemes and identifies each phoneme to that database of known phonemes, and the phoneme decoder supplies each identified phoneme in a series of identified phonemes to the input of the coarse sound representation generator.
  - 5. The continuous speech recognition engine of claim 1, wherein the fine speech recognizer model recognizes the sound feature frames as a word in a particular human language and sub dialect of that human language and associates these language parameters with the recognized word, together with a start and end time as the recognized word outputted from the fine speech recognizer model, and the fine speech recognizer model includes a mixture Gaussian distributions of context clustered triphones, with statistical language models, and uses a Viterbi algorithm.
  - 6. The continuous speech recognition engine of claim 1, wherein the front end filters filter out the background noise from the audio file, parse the sounds within the audio file to discreet phonemes, and assign a common time code to the audio sounds occurring in supplied file, as well as, wherein the front end filters output the time coded sequence of sound feature frames that include sound data vectors at a regular interval to supply the same sound feature frames for analysis by the fine speech recognizer model and the coarse match generator.
  - 7. The continuous speech recognition engine of claim 1, wherein the coarse match generator cooperates with two or more human language models to determine the robustness confidence level parameter for the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound in that human language and dialect of that recognized word.
  - 8. The continuous speech recognition engine of claim 1, wherein the coarse match generator receives two or more guesses of the recognized word from the fine speech recognizer model and the coarse match generator pairs a robust confidence level parameter to each recognized word, and the coarse match generator contains a phoneme token model with a dynamic programming search algorithm used to match recognized words in determining the robust confidence level parameter.
  - 9. The continuous speech recognition engine of claim 1, wherein the coarse match generator receives two or more guesses of the recognized word from the fine speech recognizer model and the coarse match generator only outputs the recognized word with a highest robust confidence level parameter from the two or more guesses, as well as the fine speech recognizer model and the coarse match generator analyze the exact same sound data when determining the recognized word and the robust confidence level parameter for that recognized word, as well asthe coarse match generator compares the coarse sound representation to the actual sound of the known word in the database rather than comparing the sequence of phonemes to probabilities of words and sounds having likely probabilities of being grouped together, which does occur in the human language models.
  - 10. The continuous speech recognition engine of claim 4, wherein the user interface receives query words from a user from a client machine to find out if the supplied audio file contains any of the query words, where an intelligence engine identifies recognized words below a certain robust confidence level to be filtered out from the query or just placed in a hierarchical rank list at the bottom of the ranked list due to the weighting associated with the recognized words below a certain robust confidence level, and the user then can activate a link to the returned time segments containing those recognized words matching the query words and listen to a segment of the supplied audio file pertinent to when those words are spoken in the supplied audio file.

11. A system, comprising:
- a continuous speech recognition engine that includesfront-end filters and sound data parsers configured to convert a supplied audio file of a continuous voice communication, as opposed to a paused voice command communication, into a time coded sequence of sound feature frames for speech recognition;
  
  a fine speech recognizer model having an input to receive the time coded sequence of sound feature frames from the front-end filters as an input, where the fine speech recognizer model applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames;
  
  a coarse sound representation generator having an input to receive both
  
  1) a start and stop times for a time segment associated with the recognized word from the fine model speech recognizer and
  
  2) a series of identified individual phonemes from a phoneme decoder as inputs, where the coarse sound representation generator outputs the series of identified individual phonemes occurring within the duration of the start and stop times of the recognized word as a coarse sound representation of the recognized word;
  
  a coarse match generator having an input to receive the coarse sound representation of the recognized word from the coarse sound representation generator as an input as well as the recognized word from the fine model speech recognizer, wherein the coarse match generator then determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word, where the coarse match generator assigns the likelihood as a robust confidence level parameter to that recognized word from the fine speech recognition model and includes the start and stop time codes of the recognized word from the common time line with the supplied audio file;
  
  wherein each recognized word from the continuous speech recognition engine has a robust confidence level parameter associated with that recognized word and each time the same recognized word is uttered in the supplied audio file, each instance of the recognized word can have its own robust confidence level parameter for that instance of the recognized word, which can differ in robust confidence level from another instance of the recognized word uttered in the same supplied audio file;
  
  a user interface configured to allow a speech data analytics on each word in the supplied audio file stored in the memory based on the robust confidence level parameter, wherein the user interface has a input to receive the supplied audio files from a client machine over a wide area network and supply the supplied audio files to the front end filters;
  
  a server to host the continuous speech recognition engine;
  
  a database to store each word in the supplied audio file with its assigned robust confidence level parameter and the start and stop time code from the common time line; and
  
  an intelligence engine configured to assign a higher weight to recognized words with a robust confidence level above a threshold than recognized words below the threshold, and use the weight for the recognized words when queries are made with the user interface.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, further comprising:
    - a triggering and synchronization module, where the continuous speech recognition engine hosted on the server monitors call center audio conversations and identifies when certain words of interest are spoken, and then triggers an event notification to the client machine so a user on the client machine can activate the notification to allow the user to listen to a segment of the audio file pertinent to when those trigger words are spoken in the supplied audio file.
  - 13. The system of claim 11, wherein the front end filters filter out the background noise from the audio file, parse the sounds within the audio file to discreet phonemes, and assign a common time code to the audio sounds occurring in supplied file, as well as, wherein the front end filters output the time coded sequence of sound feature frames that include sound data vectors at a regular interval to supply the same sound feature frames for analysis by the fine speech recognizer model and the coarse match generator, wherein the user interface receives query words from a user from a client machine to find out if the supplied audio file contains any of the query words, where an intelligence engine identifies recognized words below a certain robust confidence level to be filtered out from the query or just placed in a hierarchical rank list at the bottom of the ranked list due to the weighting associated with the recognized words below a certain robust confidence level, and the user then can activate a link to the returned time segments containing those recognized words matching the query words and listen to a segment of the supplied audio file pertinent to when those words are spoken in the supplied audio file.
  - 14. The system of claim 12, wherein the fine speech recognizer model recognizes the sound feature frames as a word in a particular human language and sub dialect of that human language and associates these language parameters with the recognized word, together with a start and end time as the recognized word outputted from the fine speech recognizer model, and the fine speech recognizer model includes a mixture Gaussian distributions of context clustered triphones, with statistical language models, and uses a Viterbi algorithm.
  - 15. The system of claim 11, wherein the coarse match generator cooperates with two or more human language models to determine the robustness confidence level parameter for the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound in that human language and dialect of that recognized word, as well as wherein the coarse match generator receives two or more guesses of the recognized word from the fine speech recognizer model and the coarse match generator only outputs the recognized word with a highest robust confidence level parameter from the two or more guesses, and the fine speech recognizer model and the coarse match generator analyze the exact same sound data when determining the recognized word and the robust confidence level parameter for that recognized word.
  - 16. The system of claim 11, wherein the user interface generates a time coded text file as a transcript, where each transcribed word has the robust confidence level parameter as a measure of how confident the system is that the word was correctly identified and recognized words below a threshold robust confidence level are indicated in the transcript.

17. A method for continuous speech recognition that uses robustness as a confidence measure for words output by a speech recognition system as a measure of how confident the system is that each individual word was correctly identified to either or both 1) a database of spoken words and 2) one or more language models, comprising:
- converting a supplied audio file of a continuous voice communication, as opposed to a paused voice command communication, into a time coded sequence of sound feature frames for speech recognition;
  
  receiving the time coded sequence of sound feature frames and applying a speech recognition processes to the sound feature frames to determine at least a best guess at a recognizable word that corresponds to the sequence of sound feature frames;
  
  generating the recognizable word and pairing that recognized word its start and end time;
  
  generating a coarse sound representation of the recognized word that contains a series of identified individual phonemes occurring within the duration of the start and stop time of the recognized word;
  
  comparing the recognized word along side the coarse sound representation, captured during same segment of time the recognized word occupies, to the known sounds of that recognized word in a database and then assigning a robustness confidence level parameter to the recognized word based on the comparison;
  
  pairing the robust confidence level parameter for that recognized word with the recognized word and including the start and stop time codes from the common time line with the supplied audio file, wherein each recognized word from the continuous speech recognition engine has a robust confidence level parameter associated with that recognized word and each time the same recognized word is uttered in the supplied audio file, each instance of the recognized word can have its own robust confidence level parameter for that instance of the recognized word, which can differ in robust confidence level from another instance of the recognized word uttered in the same supplied audio file; and
  
  performing speech data analytics on each word in the supplied audio file stored in the memory based on the robust confidence level parameter including categorizing automated speech recognition results on an individual word basis within the supplied audio file of continuous communication based on how likely each word has been correctly recognized.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, further comprising:
    - supplying the same sound feature frames for analysis of the recognized word and generation of the coarse sound representation;
      
      assigning a higher weight to recognized words with a robust confidence level above a threshold than recognized words below the threshold, and using the weight for the recognized words when queries are made with the user interface; and
      
      in response to a query, identifying recognized words below the threshold robust confidence level to be filtered out from the response to the query or just placed in a hierarchical rank list at the bottom of the ranked list due to the weighting associated with the recognized words below the threshold robust confidence level; and
      
      presenting a link so the user then can activate the link to the returned time segments containing those recognized words matching the query words and listen to a segment of the supplied audio file pertinent to when those words are spoken in the supplied audio file.
  - 19. The method of claim 17, further comprising:
    - monitoring call center audio conversations and identifying when certain words of interest are spoken, and then triggering an event notification across a network to a client machine so a user on the client machine can activate the notification to allow the user to listen to a segment of the audio file pertinent to when those trigger words are spoken in the supplied audio file.
  - 20. The method of claim 17, further comprising:
    - generating a time coded text file as a transcript, where each transcribed word has the robust confidence level parameter as a measure of how confident the system is that the word was correctly identified and recognized words below a threshold robust confidence level are indicated in the transcript.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Longsand Limited (Open Text Corporation)
Original Assignee
Autonomy Corporation PLC (HP Inc.)
Inventors
Kadirkamanathan, Mahapathy

Granted Patent

US 9,646,603 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/02   Feature extraction for spee...

G10L 2015/025   Phonemes, fenemes or fenone...

VARIOUS APPARATUS AND METHODS FOR A SPEECH RECOGNITION SYSTEM

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

311 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

VARIOUS APPARATUS AND METHODS FOR A SPEECH RECOGNITION SYSTEM

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

311 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links