Method and system for the automatic amendment of speech recognition vocabularies
First Claim
1. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:
- taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;
generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization;
expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
processing the first representation to remove all markup language tags;
generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said realization, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals;
detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;
for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and
for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method and system to improve speech recognition using an existing audio realization of a spoken text and a true textual representation of the spoken text. The audio realization and the true textual representation can be aligned to reveal time stamps. A speech recognition can be performed on the audio realization to provide a hypothesis textual representation for the audio realization. The aligned true textual representation can be compared with the hypothesis textual representation. Single word pairs from the true and the hypothesis textual representations can be selected where the representations are different. Similarly, single word pairs can be selected from each representation where the representations are identical. A word or pronunciation database can be updated using the selected single word pairs together with the corresponding aligned audio realization.
-
Citations
18 Claims
-
1. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:
-
taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization; generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization; expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent; processing the first representation to remove all markup language tags; generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said realization, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals; detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio. - View Dependent Claims (2, 3, 4)
-
-
5. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:
-
taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization; producing a second representation that is a textual representation of said realization by performing a speech recognition on said realization using the word database; expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent; generating a line-by-line output by aligning said first representation and said second representation, each line of said output comprising a segment of said first representation, a segment of said second representation, and a time indicator indicating a start time and end time of said segments; detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio. - View Dependent Claims (6, 7, 8)
-
-
9. A system for automatically updating a word database and a pronunciation database, the system comprising:
-
an audio device for taking a realization of spoken audio; an text, reader for taking a first representation that is an allegedly true textual representation of said realization; a speech recognizer that performs a speech recognition on said realization to generate a second representation from said realization, said second representation being a time-based transcription of said realization; a word database used by the speech recognizer to perform speech recognition tasks; an expander that expands said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent; an aligner configured to generate a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said second representation, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals; a classifier configured to detect and mark each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; and a selector that for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updates said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio, and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updates said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio. - View Dependent Claims (10)
-
-
11. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
-
taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization; generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization; expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent; processing the first representation to remove all markup language tags; generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said second representation, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals; detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating a pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating a word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio. - View Dependent Claims (12, 13, 14)
-
-
15. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
-
taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization; producing a second representation that is a textual representation of said realization by performing a speech recognition on said realization using the word database; expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent; generating a line-by-line output by aligning said first representation and said second representation, each line of said output comprising a segment of said first representation, a segment of said second representation, and a time indicator indicating a start time and end time of said segments; detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating a pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating a word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio. - View Dependent Claims (16, 17, 18)
-
Specification