SYSTEM AND METHOD FOR DECODING SPEECH
First Claim
1. A computer software product that includes a computer readable media readable by a processor, the computer readable media having stored thereon a set of instructions for performing decoding of speech, the instructions comprising:
- (a) a first set of instructions which, when loaded into main memory and executed by the processor, causes the processor to establish a pronunciation dictionary for a particular language and store the pronunciation dictionary in computer readable memory, the pronunciation dictionary including a plurality of words, each of the words being divided into phonemes of the language, each of the phonemes being represented by a single character;
(b) a second set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train an acoustic model for the language, the acoustic model including hidden Markov models corresponding to the phonemes of the language;
(c) a third set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained acoustic model in the computer readable memory;
(d) a fourth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train a language model for the language, the language model being an N-gram language model containing probabilities of particular word sequences from a transcription corpus;
(e) a fifth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained language model in the computer readable memory;
(f) a sixth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to receive at least one spoken word in the language and generate a digital speech signal corresponding the at least one spoken word;
(g) a seventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform phoneme recognition on the speech signal to generate a set of spoken phonemes of the at least one word, the set of spoken phonemes being recorded in the computer readable memory, wherein each of the spoken phonemes is represented by a single character;
(h) an eighth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform sequence alignment between the spoken phonemes of the at least one word and a set of reference phonemes of the pronunciation dictionary corresponding to the at least one word;
(i) a ninth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to compare the spoken phonemes of the at least one word and the set of reference phonemes of the pronunciation dictionary corresponding to the at least one word to identify a set of unique variants; and
(j) a tenth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to update the pronunciation dictionary and the language model by adding the set of unique variants thereto and recording the updated pronunciation dictionary and the language model in the computer readable memory.
1 Assignment
0 Petitions
Accused Products
Abstract
The system and method for speech decoding in speech recognition systems provides decoding for speech variants common to such languages. These variants include within-word and cross-word variants. For decoding of within-word variants, a data-driven approach is used, in which phonetic variants are identified, and a pronunciation dictionary and language model of a dynamic programming speech recognition system are updated based upon these identifications. Cross-word variants are handled with a knowledge-based approach, applying phonological rules, part-of-speech tagging or tagging of small words to a speech transcription corpus and updating the pronunciation dictionary and language model of the dynamic programming speech recognition system based upon identified cross-word variants.
-
Citations
7 Claims
-
1. A computer software product that includes a computer readable media readable by a processor, the computer readable media having stored thereon a set of instructions for performing decoding of speech, the instructions comprising:
-
(a) a first set of instructions which, when loaded into main memory and executed by the processor, causes the processor to establish a pronunciation dictionary for a particular language and store the pronunciation dictionary in computer readable memory, the pronunciation dictionary including a plurality of words, each of the words being divided into phonemes of the language, each of the phonemes being represented by a single character; (b) a second set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train an acoustic model for the language, the acoustic model including hidden Markov models corresponding to the phonemes of the language; (c) a third set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained acoustic model in the computer readable memory; (d) a fourth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train a language model for the language, the language model being an N-gram language model containing probabilities of particular word sequences from a transcription corpus; (e) a fifth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained language model in the computer readable memory; (f) a sixth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to receive at least one spoken word in the language and generate a digital speech signal corresponding the at least one spoken word; (g) a seventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform phoneme recognition on the speech signal to generate a set of spoken phonemes of the at least one word, the set of spoken phonemes being recorded in the computer readable memory, wherein each of the spoken phonemes is represented by a single character; (h) an eighth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform sequence alignment between the spoken phonemes of the at least one word and a set of reference phonemes of the pronunciation dictionary corresponding to the at least one word; (i) a ninth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to compare the spoken phonemes of the at least one word and the set of reference phonemes of the pronunciation dictionary corresponding to the at least one word to identify a set of unique variants; and (j) a tenth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to update the pronunciation dictionary and the language model by adding the set of unique variants thereto and recording the updated pronunciation dictionary and the language model in the computer readable memory. - View Dependent Claims (2, 3)
-
-
4. A computer software product that includes a computer readable media readable by a processor, the computer readable media having stored thereon a set of instructions for performing decoding of speech, the instructions comprising:
-
(a) a first set of instructions which, when loaded into main memory and executed by the processor, causes the processor to establish a pronunciation dictionary for a particular language and store the pronunciation dictionary in computer readable memory, said pronunciation dictionary including a plurality of words each divided into phonemes of the language; (b) a second set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train an acoustic model for the language, the acoustic model including hidden Markov models corresponding to the phonemes of the language; (c) a third set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained acoustic model in the computer readable memory; (d) a fourth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train a language model for the language, the language model being an N-gram language model containing probabilities of particular word sequences from a transcription corpus; (e) a fifth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained language model in the computer readable memory; (f) a sixth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to receive at least one sentence including a plurality of words in the language and generate a digital speech signal corresponding the at least one sentence; (g) a seventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform phoneme recognition on the speech signal to generate a set of spoken phonemes of each of the words of the at least one sentence, said set of spoken phonemes being recorded in the computer readable memory; (h) an eighth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to compare the spoken phonemes of the words of the at least one sentence and the set of reference phonemes of the pronunciation dictionary corresponding to the at least one word to form a transcription of the at least one sentence, the transcription being recorded in the computer readable memory; (i) a ninth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to analyze each pair of adjacent words of the at least one sentence to identify a phonological rule selected from the group consisting of merging and changing; (j) a tenth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to, upon identification of the phonological rule in at least one pair of the adjacent words, form at least one compound word to replace the at least one pair of words in the transcription; and (k) an eleventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to update the pronunciation dictionary and the language model by adding the at least one compound word thereto and recording the updated pronunciation dictionary and the language model in the computer readable memory. - View Dependent Claims (5)
-
-
6. A computer software product that includes a computer readable media readable by a processor, the computer readable media having stored thereon a set of instructions for performing decoding of speech, the instructions comprising:
-
(a) a first set of instructions which, when loaded into main memory and executed by the processor, causes the processor to establish a pronunciation dictionary for a particular language and store the pronunciation dictionary in computer readable memory, said pronunciation dictionary including a plurality of words each divided into phonemes of the language; (b) a second set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train an acoustic model for the language, the acoustic model including hidden Markov models corresponding to the phonemes of the language; (c) a third set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained acoustic model in the computer readable memory; (d) a fourth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to train a language model for the language, the language model being an N-gram language model containing probabilities of particular word sequences from a transcription corpus; (e) a fifth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to store the trained language model in the computer readable memory; (f) a sixth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to receive at least one sentence including a plurality of words in the language and generate a digital speech signal corresponding the at least one sentence; (g) a seventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to perform phoneme recognition on the speech signal to generate a set of spoken phonemes of each of the words of the at least one sentence, said set of spoken phonemes being recorded in the computer readable memory; (h) an eighth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to compare the spoken phonemes of the words of the at least one sentence and the set of reference phonemes of the pronunciation dictionary corresponding to the at least one word to form a transcription of the at least one sentence, the transcription being recorded in the computer readable memory; (i) a ninth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to apply a part-of-speech tagger to the transcription and analyze each pair of adjacent tagged words of the at least one sentence to identify tagged words selected from the group consisting of adjective-noun words and word-preposition words; (j) a tenth set of instructions which, when loaded into main memory and executed by the processor, causes the processor to, upon identification of the tagged words in at least one pair of the adjacent tagged words, form at least one compound word to replace the at least one pair of tagged words in the transcription; and (k) an eleventh set of instructions which, when loaded into main memory and executed by the processor, causes the processor to update the pronunciation dictionary and the language model by adding the at least one compound word thereto and recording the updated pronunciation dictionary and the language model in the computer readable memory. - View Dependent Claims (7)
-
Specification