Various apparatus and methods for a speech recognition system

US 9,646,603 B2
Filed: 02/27/2009
Issued: 05/09/2017
Est. Priority Date: 02/27/2009
Status: Active Grant

First Claim

Patent Images

1. A continuous speech recognition engine, comprisingan input subsystem configured to convert input audio data into a time coded sequence of sound feature frames for speech recognition;

a fine speech recognizer to apply a speech recognition process to the sound feature frames and determine at least a candidate recognized word that corresponds to the sound feature frames;

a coarse sound representation generator to output a series of individual phonemes occurring within a time duration of the recognized word as a coarse sound representation of the recognized word; and

at least one processor to;

compare the coarse sound representation of the recognized word to a known sound of the recognized word in a database, andassign a confidence level parameter to the recognized word from the fine speech recognizer according to the comparing.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus, and system are described for a continuous speech recognition engine that includes a fine speech recognizer model, a coarse sound representation generator, and a coarse match generator. The fine speech recognizer model receives a time coded sequence of sound feature frames, applies a speech recognition process to the sound feature frames and determines at least a best guess at each recognizable word that corresponds to the sound feature frames. The coarse sound representation generator generates a coarse sound representation of the recognized word. The coarse match generator determines a likelihood of the coarse sound representation actually being the recognized word based on comparing the coarse sound representation of the recognized word to a database containing the known sound of that recognized word and assigns the likelihood as a robust confidence level parameter to that recognized word.

37 Citations

23 Claims

1. A continuous speech recognition engine, comprisingan input subsystem configured to convert input audio data into a time coded sequence of sound feature frames for speech recognition;
- a fine speech recognizer to apply a speech recognition process to the sound feature frames and determine at least a candidate recognized word that corresponds to the sound feature frames;
  
  a coarse sound representation generator to output a series of individual phonemes occurring within a time duration of the recognized word as a coarse sound representation of the recognized word; and
  
  at least one processor to;
  
  compare the coarse sound representation of the recognized word to a known sound of the recognized word in a database, andassign a confidence level parameter to the recognized word from the fine speech recognizer according to the comparing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The continuous speech recognition engine of claim 1, wherein the at least one processor is to generate a transcript, where each transcribed word in the transcript has a corresponding confidence level parameter as a measure of how confident a system is that the transcribed word was correctly identified.
  - 3. The continuous speech recognition engine of claim 1, wherein the coarse sound representation generator is to receive as input sound data vectors from a sound decoder, and the coarse sound representation of the recognized word includes a subset of the sound data vectors that correspond to the time duration of the recognized word.
  - 4. The continuous speech recognition engine of claim 1, further comprising a phoneme decoder to compare a sound pattern of each phoneme to a set of phoneme models to recognize the sound feature frames as a sequence of phonemes and identify each phoneme, and the phoneme decoder is to supply each identified phoneme to the coarse sound representation generator.
  - 5. The continuous speech recognition engine of claim 1, wherein the fine speech recognizer is to recognize the sound feature frames as the recognized word in a particular human language and sub dialect of the particular human language, and associate language parameters with the recognized word.
  - 6. The continuous speech recognition engine of claim 1, wherein the input subsystem is to filter out background noise from the audio data and parse sounds within the audio data to discrete phonemes.
  - 7. The continuous speech recognition engine of claim 1, wherein the coarse match generator is to cooperate with plural human language models to determine the confidence level parameter for the recognized word.
  - 8. The continuous speech recognition engine of claim 1, further comprising a user interface to receive query words from a client machine to find out if the audio data contains any of the query words.
  - 9. The continuous speech recognition engine of claim 1, wherein the fine speech recognizer is to determine a plurality of candidate recognizable words that correspond to the sound feature frames, where the sound feature frames are for a particular spoken word in the audio data, and the at least one processor is to:
    - compare the coarse sound representation to each of plural known sounds in the database, the plural known sounds corresponding to different ones of the recognizable words, andassign a respective confidence level parameter to each of the recognizable words according to the comparing of the coarse sound representation to a respective one of the plural known sounds in the database.
  - 10. The continuous speech recognition engine of claim 9, wherein each of the assigned confidence level parameters includes a likelihood that the coarse sound representation is the respective one of the recognizable words.
  - 11. The continuous speech recognition engine of claim 1, wherein the fine speech recognizer is to provide start and stop times defining the time duration of the recognized word, and the coarse sound representation generator is to receive the start and stop times.

12. A method for speech recognition, comprising:
- converting, by a system having a processor, audio data into a time coded sequence of sound feature frames for speech recognition;
  
  receiving, by the system, the time coded sequence of sound feature frames and applying a speech recognition process of a first speech recognizer to the sound feature frames to determine at least one candidate recognized word that corresponds to the sequence of sound feature frames;
  
  generating, by a coarse sound representation generator in the system, a coarse sound representation that contains a series of individual phonemes occurring within a time duration of the recognized word; and
  
  comparing, by the system, the coarse sound representation to a known sound of the recognized word in a database and then assigning a confidence level parameter to the recognized word produced by the first speech recognizer based on the comparison.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The method of claim 12, further comprising:
    - assigning higher weights to recognized words with respective confidence level parameters above a threshold than recognized words with respective confidence level parameters below the threshold; and
      
      in response to a query, filtering out recognized words with confidence level parameters below the threshold from a response to the query.
  - 14. The method of claim 12, further comprising:
    - monitoring call center audio conversations and identifying when trigger words of interest are spoken, and then triggering an event notification across a network to a client machine so a user on the client machine can activate the event notification to allow the user to listen to a segment of the audio data pertinent to when the trigger words are spoken in the audio data.
  - 15. The method of claim 12, further comprising:
    - generating a transcript, where each transcribed word in the transcript has a respective confidence level parameter as a measure of how confident the system is that the word was correctly identified.
  - 16. The method of claim 12, further comprising:
    - outputting the confidence level parameter with the recognized word, along with start and stop time codes corresponding to the time duration.
  - 17. The method of claim 12, further comprising:
    - performing speech data analytics on a word in the audio data based on the confidence level parameter, including categorizing automated speech recognition results on an individual word basis within the audio data based on how likely each word has been correctly recognized.
  - 18. The method of claim 12, wherein determining the at least one candidate recognized word comprises determining a plurality candidate recognizable words that correspond to the sound feature frames, where the sound feature frames are for a particular spoken word in the audio data, the method further comprising:
    - comparing the coarse sound representation to each of plural known sounds in the database, the plural known sounds corresponding to different ones of the recognizable words, andassigning a respective confidence level parameter to each of the recognizable words according to the comparing of the coarse sound representation to a respective one of the plural known sounds in the database.
  - 19. The method of claim 18, wherein each of the assigned confidence level parameters includes a likelihood that the coarse sound representation is the respective one of the recognizable words.
  - 20. The method of claim 12, wherein the speech recognition process is performed by a fine speech recognizer, and the coarse sound representation is performed by a coarse sound representation generator.

21. A non-transitory computer readable storage medium storing instructions that upon execution cause a system to:
- convert audio data into a time coded sequence of sound feature frames for speech recognition;
  
  receive the time coded sequence of sound feature frames and apply a speech recognition process of a first speech recognizer to the sound feature frames to determine at least one candidate recognized word that corresponds to the sequence of sound feature frames;
  
  generate, using a coarse sound representation generator, a coarse sound representation of the recognized word that contains a series of individual phonemes occurring within a time duration of the recognized word; and
  
  compare the coarse sound representation to a known sound of the recognized word in a database and then assign a confidence level parameter to the recognized word produced by the first speech recognizer based on the comparison.
- View Dependent Claims (22, 23)
- - 22. The non-transitory computer readable storage medium of claim 21, wherein determining the at least one candidate recognized word comprises determining a plurality of candidate recognizable words that correspond to the sound feature frames, where the sound feature frames are for a particular spoken word in the audio data, wherein the instructions upon execution cause the system to further:
    - compare the coarse sound representation to each of plural known sounds in the database, the plural known sounds corresponding to different ones of the recognizable words, andassign a respective confidence level parameter to each of the recognizable words according to the comparing of the coarse sound representation to a respective one of the plural known sounds in the database.
  - 23. The non-transitory computer readable storage medium of claim 22, wherein each of the assigned confidence level parameters includes a likelihood that the coarse sound representation is the respective one of the recognizable words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Longsand Limited (Open Text Corporation)
Original Assignee
Longsand Limited (Open Text Corporation)
Inventors
Kadirkamanathan, Mahapathy
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Villena, Mark

Application Number

US12/395,484
Publication Number

US 20100223056A1
Time in Patent Office

2,993 Days
Field of Search

704 1, 704231, 704235, 704253, 704278
US Class Current
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/02   Feature extraction for spee...

G10L 2015/025   Phonemes, fenemes or fenone...

Various apparatus and methods for a speech recognition system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

Various apparatus and methods for a speech recognition system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others