Spelling speech recognition apparatus and method for communications
First Claim
1. A speech recognition system comprising:
- microphone means for receiving acoustic waves and converting the acoustic waves into electronic signals;
front-end signal processing means, coupled to said microphone means, for processing the electronic signals to generate parametric representations of the electronic signals, including preemphasizer means for spectrally flattening the electronic signals generated by said microphone means;
frame-blocking means, coupled to said preemphasizer means, for blocking the electronic signals into frames of N samples with adjacent frames separated by M samples;
windowing means, coupled to said frame-blocking means, for windowing each frame;
autocorrelation means, coupled to said windowing means, for autocorrelating the frames;
cepstral coefficient generating means, coupled to said autocorrelation means, for converting each frame into cepstral coefficients; and
tapered windowing means, coupled to said cepstral coefficient generating means, for weighting the cepstral coefficients, thereby generating parametric representations of the sound waves;
pronunciation database storage means for storing a plurality of parametric representations of letter pronunciations;
letter similarity comparator means, coupled to said front-end signal processing means and to said pronunciation database storage means, for comparing the parametric representation of the electronic signals with said plurality of parametric representations of letter pronunciations, and generating a first sequence of associations between the parametric representation of the electronic signals and said plurality of parametric representations of letter pronunciations responsive to predetermined criteria;
vocabulary database storage means for storing a plurality of parametric representations of word pronunciations;
word similarity comparator means, coupled to said letter similarity comparator and to said vocabulary database storage means, for comparing an aggregated plurality of parametric representations of letter pronunciations with said plurality of parametric representations of word pronunciations, and generating a second sequence of associations between at least one of said aggregated plurality of parametric representations of the letter pronunciations with at least one of said plurality of parametric representations of word pronunciations responsive to predetermined criteria; and
display means, coupled to said word similarity comparator means, for displaying said first and second sequences of associations.
1 Assignment
0 Petitions
Accused Products
Abstract
An accurate speech recognition system capable of rapidly processing greater varieties of words and operable in many different devices, but without the computational power and memory requirements, high power consumption, complex operating system, high costs, and weight of traditional systems. The utilization of individual letter utterances to transmit words allows voice information transfer for both person-to-person and person-to-machine communication for mobile phones, PDAs, and other communication devices. This invention is an apparatus and method for a speech recognition system comprising a microphone, front-end signal processor for generating parametric representations of speech input signals, a pronunciation database, a letter similarity comparator for comparing the parametric representation of the input signals with the parametric representations of letter pronunciations, and generating a sequence of associations between the input speech and the letters in the pronunciation database, a vocabulary database, a word similarity comparator for comparing an aggregated plurality of the letters with the words in the vocabulary database and generating a sequence of associations between them, and a display for displaying the selected letters and words for confirmation.
284 Citations
22 Claims
-
1. A speech recognition system comprising:
-
microphone means for receiving acoustic waves and converting the acoustic waves into electronic signals;
front-end signal processing means, coupled to said microphone means, for processing the electronic signals to generate parametric representations of the electronic signals, including preemphasizer means for spectrally flattening the electronic signals generated by said microphone means;
frame-blocking means, coupled to said preemphasizer means, for blocking the electronic signals into frames of N samples with adjacent frames separated by M samples;
windowing means, coupled to said frame-blocking means, for windowing each frame;
autocorrelation means, coupled to said windowing means, for autocorrelating the frames;
cepstral coefficient generating means, coupled to said autocorrelation means, for converting each frame into cepstral coefficients; and
tapered windowing means, coupled to said cepstral coefficient generating means, for weighting the cepstral coefficients, thereby generating parametric representations of the sound waves;
pronunciation database storage means for storing a plurality of parametric representations of letter pronunciations;
letter similarity comparator means, coupled to said front-end signal processing means and to said pronunciation database storage means, for comparing the parametric representation of the electronic signals with said plurality of parametric representations of letter pronunciations, and generating a first sequence of associations between the parametric representation of the electronic signals and said plurality of parametric representations of letter pronunciations responsive to predetermined criteria;
vocabulary database storage means for storing a plurality of parametric representations of word pronunciations;
word similarity comparator means, coupled to said letter similarity comparator and to said vocabulary database storage means, for comparing an aggregated plurality of parametric representations of letter pronunciations with said plurality of parametric representations of word pronunciations, and generating a second sequence of associations between at least one of said aggregated plurality of parametric representations of the letter pronunciations with at least one of said plurality of parametric representations of word pronunciations responsive to predetermined criteria; and
display means, coupled to said word similarity comparator means, for displaying said first and second sequences of associations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
letter calibration means, coupled to said pronunciation database storage means, for calibrating the parametric representations of the electronic signals with said plurality of parametric representations of letter pronunciation stored in said pronunciation database storage means;
dynamic time warper means for performing dynamic time warping on the parametric representations of the electronic signals and said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means;
distortion calculation means, coupled to said letter calibration means and to said dynamic time warper means, for calculating a distortion between the parametric representations of the electronic signals and said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means;
scoring means, coupled to said distortion calculation means, for assigning a score to said distortion responsive to predetermined criteria; and
selection means, coupled to said scoring means, for selecting at least one of said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means having the lowest distortion.
-
-
5. The speech recognition system of claim 4 wherein said dynamic time warper means comprises minimization means for determining the minimum cepstral distances between the parametric representation of the electronic signals and said plurality of parametric representations of the letter pronunciations stored in said pronunciation database storage means.
-
6. The speech recognition system of claim 1 wherein said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means include the pronunciation of individual characters of the Chinese language and said plurality of parametric representations of word pronunciations stored in said vocabulary database storage means include the pronunciation of aggregated word strings of the Chinese language.
-
7. The speech recognition system of claim 1 wherein said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means include the pronunciation of individual characters of the Korean language and said plurality of parametric representations of word pronunciations stored in said vocabulary database storage means include the pronunciation of aggregated word strings of the Korean language.
-
8. The speech recognition system of claim 1 wherein said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means include the pronunciation of individual characters of the Japanese language and said plurality of parametric representations of word pronunciations stored in said vocabulary database storage means include the pronunciation of aggregated word strings of the Japanese language.
-
9. The speech recognition system of claim 1 wherein said plurality of parametric representations of letter pronunciations stored in said pronunciation database storage means include the pronunciation of individual characters of the French language and said plurality of parametric representations of word pronunciations stored in said vocabulary database storage means include the pronunciation of aggregated word strings of the French language.
-
10. A letter similarity comparator comprising:
-
means for receiving electronic signals parametric representations;
pronunciation database storage means for storing a plurality of letter pronunciation parametric representations;
letter calibration means, coupled to said receiving means and to said pronunciation database storage means, for calibrating the electronic signals parametric representations with said plurality of letter pronunciation parametric representations stored in said pronunciation database storage means;
dynamic time warper means for performing dynamic time warping on the electronic signals parametric representations and said plurality of letter pronunciation parametric representations stored in said pronunciation database storage means;
distortion calculation means, coupled to said letter calibration means and to said dynamic time warper means, for calculating a distortion between the electronic signals parametric representations and said plurality of letter pronunciation parametric representations stored in said pronunciation database storage means;
scoring means, coupled to said distortion calculation means, for assigning a score to said distortion responsive to predetermined criteria; and
selection means, coupled to said scoring means, for selecting at least one of said plurality of letter pronunciation parametric representations having the lowest distortion.
-
-
11. An electronic communication device comprising:
-
a microphone for receiving sound signals and generating electronic signals therefrom;
a coder-decoder, coupled to said microphone, for coding and decoding the electronic signals;
a signal processor, coupled to said coder-decoder, for processing the electronic signals thereby generating parametric representations of the electronic signals;
a database storage unit, coupled to said signal processor, for storing data and having a first sector therein for storing a plurality of letter pronunciation parametric representations and a second sector therein for storing a plurality of word pronunciation parametric representations;
a first comparator, coupled to said signal processor and to said database storage unit, for comparing parametric representations of the electronic signals with said plurality of letter pronunciation parametric representations in said first sector of said database storage unit;
a first selector, coupled to said first comparator, for selecting at least one of said plurality of letter pronunciation parametric representations responsive to predetermined criteria;
a second comparator, coupled to said signal processor and to said database storage unit, for comparing aggregated parametric representations of letter pronunciations with said plurality of word pronunciation parametric representations in said second sector of said database storage unit;
a second selector, coupled to said second comparator, for selecting at least one of said plurality of word pronunciation parametric representations responsive to predetermined criteria; and
a display, coupled to said first and second selectors, for displaying said at least one of said plurality of selected letter pronunciation parametric representations and for displaying said at least one of said plurality of word pronunciation parametric representations. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method for recognizing speech sound signals, comprising the steps of:
-
forming a stored database of letter and word sounds including the steps of, (a) parameterizing a plurality of letter sounds;
(b) storing said plurality of parameterized letter sounds;
(c) parameterizing a plurality of word sounds;
(d) storing said plurality of parameterized of word sounds;
performing speech recognition of input speech including the steps of, (e) receiving sound waves;
(f) converting the sound waves into electronic signals;
(g) parameterizing the electronic signals;
(h) comparing said parameterized electronic signals with said stored plurality of parameterized letter sounds responsive to calibrating said plurality of parameterized electronic signals with said plurality of parameterized letter sounds responsive to a predetermined calibration method;
(i) selecting at least one of said stored plurality of parameterized letter sounds responsive to predetermined parameter similarity criteria;
(j) displaying said selected at least one of said stored plurality of parameterized letter sounds;
(k) aggregating said selected at least one of said stored plurality of parameterized letter sounds to form a parameterized word;
(l) comparing said parameterized word with said stored plurality of parameterized word sounds;
(m) selecting at least one of said stored plurality of parameterized word sounds responsive to predetermined parameter similarity criteria; and
(n) displaying said selected at least one of said stored plurality of parameterized word sounds.
-
-
21. A method for recognizing speech sound signals, comprising the steps of:
-
forming a stored database of letter and word sounds including the steps of, (a) speaking a plurality of letter sounds;
(b) distinguishing whether the speaker is male or female;
(c) parameterizing said plurality of letter sounds;
(d) storing said plurality of parameterized letter sounds;
(e) parameterizing a plurality of word sounds;
(f) storing said plurality of parameterized of word sounds;
performing speech recognition of input speech including the steps of, (g) receiving sound waves;
(h) converting the sound waves into electronic signals;
(i) parameterizing the electronic signals;
(j) comparing said parameterized electronic signals with said stored plurality of parameterized letter sounds;
(k) selecting at least one of said stored plurality of parameterized letter sounds responsive to predetermined parameter similarity criteria;
(l) displaying said selected at least one of said stored plurality of parame-terized letter sounds;
(m) aggregating said selected at least one of said stored plurality of parameterized letter sounds to form a parameterized word, (n) comparing said parameterized word with said stored plurality of parameterized word sounds;
(o) selecting at least one of said stored plurality of parameterized word sounds responsive to predetermined parameter similarity criteria; and
(p) displaying said selected at least one of said stored plurality of parameterized word sounds.
-
-
22. A method for recognizing speech sound signals, comprising the steps of:
-
forming a stored database of letter and word sounds including the steps of, (a) speaking a plurality of letter sounds;
(b) distinguishing the endpoints of each letter sound responsive to the spoken letter sounds, thereby distinguishing substantially clear spoken letter sounds;
(c) parameterizing said plurality of letter sounds;
(d) storing said plurality of parameterized letter sounds;
(e) parameterizing a plurality of word sounds;
(f) storing said plurality of parameterized of word sounds;
performing, speech recognition of input speech including the steps of, (g) receiving sound waves;
(h) converting the sound waves into electronic signals;
(i) parameterizing the electronic signals;
(i) comparing said parameterized electronic signals with said stored plurality of parameterized letter sounds;
(k) selecting at least one of said stored plurality of parameterized letter sounds responsive to predetermined parameter similarity criteria;
(l) displaying said selected at least one of said stored plurality of parameterized letter sounds;
(m) aggregating said selected at least one of said stored plurality of parameterized letter sounds to form a parameterized word;
(n) comparing said parameterized word with said stored plurality of parameterized word sounds;
(o) selecting at least one of said stored plurality of parameterized word sounds responsive to predetermined parameter similarity criteria; and
(p) displaying said selected at least one of said stored plurality of parameterized word sounds.
-
Specification