Speech recognition training for small hardware devices
First Claim
1. A speech processing system for constructing speech recognition reference models, comprising:
- a speech recognizer residing on a first computing device;
said speech recognizer receiving speech training data and processing the speech training data into an intermediate representation of the speech training data, said speech recognizer further being operative to communicate the intermediate representation to a second computing device;
a speech model server residing on said second computing device, said second computing device being interconnected via a network to said first computing device;
said speech model server receiving the intermediate representation of the speech training data and generating a speech reference model using the intermediate representation, said speech model server further being operative to communicate the speech reference model to said first computing device; and
a lexicon coupled to said speech recognizer for storing the speech reference model on said first computing device.
5 Assignments
0 Petitions
Accused Products
Abstract
A distributed speech processing system for constructing speech recognition reference models that are to be used by a speech recognizer in a small hardware device, such as a personal digital assistant or cellular telephone. The speech processing system includes a speech recognizer residing on a first computing device and a speech model server residing on a second computing device. The speech recognizer receives speech training data and processes it into an intermediate representation of the speech training data. The intermediate representation is then communicated to the speech model server. The speech model server generates a speech reference model by using the intermediate representation of the speech training data and then communicates the speech reference model back to the first computing device for storage in a lexicon associated with the speech recognizer.
-
Citations
32 Claims
-
1. A speech processing system for constructing speech recognition reference models, comprising:
-
a speech recognizer residing on a first computing device;
said speech recognizer receiving speech training data and processing the speech training data into an intermediate representation of the speech training data, said speech recognizer further being operative to communicate the intermediate representation to a second computing device;
a speech model server residing on said second computing device, said second computing device being interconnected via a network to said first computing device;
said speech model server receiving the intermediate representation of the speech training data and generating a speech reference model using the intermediate representation, said speech model server further being operative to communicate the speech reference model to said first computing device; and
a lexicon coupled to said speech recognizer for storing the speech reference model on said first computing device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
a phoneticizer receptive of the intermediate representation for producing a plurality of phonetic transcriptions; and
a model trainer coupled to said phoneticizer for building said speech reference model based on said plurality of phonetic transcriptions.
-
-
8. The speech processing system of claim 4 wherein said speech model server further comprises:
-
a Hidden Markov Model (HMM) database for storing phone model speech data corresponding to a plurality of phonemes; and
a model trainer coupled to said HMM database for decoding the vector of parameters into a phonetic transcription of the audio data, whereby said phonetic transcription serves as said speech reference model.
-
-
9. The speech processing system of claim 1 wherein said speech recognizer captures at least two training repetitions of audio data that serves as the speech training data and converts the audio data into a sequence of vectors of parameters that serves as said intermediate representation of the speech training data, where each vector corresponds to a training repetition and the parameters are indicative of the short term speech spectral shape of said audio data.
-
10. The speech processing system of claim 9 wherein said speech model server being operative to determine a reference vector from the sequence of vectors, align each vector in the sequence of vectors to the reference vector, determine a mean and a variance of each parameter in the reference vector computed over the values in the aligned vectors, thereby constructing said speech reference model from the sequence of vectors.
-
11. A distributed speech processing system for supporting applications that reside on a personal digital assistant (PDA) device, comprising:
-
an input means for capturing speech training data at the PDA;
a speech recognizer coupled to said input means and receptive of speech training data from said input means;
said speech recognizer being operative to process the speech training data into an intermediate representation of the speech training data and communicate the intermediate representation to a second computing device;
a speech model server residing on said second computing device, said second computing device being interconnected via a network to the PDA;
said speech model server receiving the intermediate representation of the speech training data and generating a speech reference model using the intermediate representation, said speech model server further being operative to communicate the speech reference model to said first computing device; and
a lexicon coupled to said speech recognizer for storing the speech reference model on the PDA. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
a stylus;
a display pad for capturing handwritten stroke data from the stylus; and
a handwritten recognition module for converting handwritten stroke data into alphanumeric data, whereby the alphanumeric data serves as speech training data.
-
-
13. The distributed speech processing system of claim 12 wherein said speech recognizer segments the alphanumeric data into a sequence of symbols which serves as the intermediate representation of the speech training data.
-
14. The distributed speech processing system of claim 11 wherein said speech model server further comprises a speech model database for storing speaker-independent speech reference models, said speech model server being operative to retrieve a speech reference model from said speech model database that corresponds to the intermediate representation of said speech training data received from said speech recognizer.
-
15. The distributed speech processing system of claim 11 wherein said speech model server further comprises:
-
a phoneticizer receptive of the intermediate representation for producing a plurality of phonetic transcriptions; and
a model trainer coupled to said phoneticizer for building said speech reference model based on said plurality of phonetic transcriptions.
-
-
16. The distributed speech processing system of claim 11 wherein said input means is further defined as a microphone for capturing audio data that serves as speech training data.
-
17. The distributed speech processing system of claim 16 wherein said speech recognizer converts the audio data into a digital input signal and translates the digital input signal into a vector of parameters which serves as the intermediate representation of the speech training data, said parameters being indicative of the short term speech spectral shape of said audio data.
-
18. The distributed speech processing system of claim 17 wherein said vector of parameters is further defined as either pulse code modulation (PCM), μ
- -law encoded PCM, filter bank energies, line spectral frequencies, or cepstral coefficients.
-
19. The distributed speech processing system of claim 11 wherein said speech model server further comprises:
-
a Hidden Markov Model (HMM) database for storing phone model speech data corresponding to a plurality of phonemes; and
a model trainer coupled to said HMM database for decoding said vector of parameters into a phonetic transcription of the audio data, whereby said phonetic transcription serves as said speech reference model.
-
-
20. The speech processing system of claim 11 wherein said speech recognizer captures at least two training repetitions of audio data that serves as the speech training data and converts the audio data into a sequence of vectors of parameters that serves as said intermediate representation of the speech training data, where each vector corresponds to a training repetition and the parameters are indicative of the short term speech spectral shape of said audio data.
-
21. The speech processing system of claim 20 wherein said speech model server being operative to determine a reference vector from the sequence of vectors, align each vector in the sequence of vectors to the reference vector, determine a mean and a variance of each parameter in the reference vector computed over the values in the aligned vectors, thereby constructing said speech reference model from the sequence of vectors.
-
22. A distributed speech processing system for supporting applications that reside on a cellular telephone handset device, comprising:
-
an input means for capturing speech training data at the handset device;
a speech recognizer coupled to said input means and receptive of speech training data from said input means;
said speech recognizer being operative to process the speech training data into an intermediate representation of the speech training data and communicate the intermediate representation to a second computing device;
a speech model server residing on said second computing device, said second computing device being interconnected via a network to the handset device;
said speech model server receiving the intermediate representation of the speech training data and generating a speech reference model using the intermediate representation, said speech model server further being operative to communicate the speech reference model to said first computing device; and
a lexicon coupled to said speech recognizer for storing the speech reference model on the handset device. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31)
a phoneticizer receptive of the intermediate representation for producing a plurality of phonetic transcriptions; and
a model trainer coupled to said phoneticizer for building said speech reference model based on said plurality of phonetic transcriptions.
-
-
26. The distributed speech processing system of claim 22 wherein said input means is further defined as a microphone for capturing audio data that serves as speech training data.
-
27. The distributed speech processing system of claim 26 wherein said speech recognizer converts the audio data into a digital input signal and translates the digital input signal into a vector of parameters which serves as the intermediate representation of the speech training data, said parameters being indicative of the short term speech spectral shape of said audio data.
-
28. The distributed speech processing system of claim 27 wherein said vector of parameters is further defined as either pulse code modulation (PCM), μ
- -law encoded PCM, filter bank energies, line spectral frequencies, or cepstral coefficients.
-
29. The distributed speech processing system of claim 22 wherein said speech model server further comprises:
-
a Hidden Markov Model (HMM) database for storing phone model speech data corresponding to a plurality of phonemes; and
a model trainer coupled to said HMM database for decoding said vector of parameters into a phonetic transcription of the audio data, whereby said phonetic transcription serves as said speech reference model.
-
-
30. The distributed speech processing system of claim 22 wherein said speech recognizer captures at least two training repetitions of audio data that serves as the speech training data and converts the audio data into a sequence of vectors of parameters that serves as said intermediate representation of the speech training data, where each vector corresponds to a training repetition and the parameters are indicative of the short term speech spectral shape of said audio data.
-
31. The distributed speech processing system of claim 30 wherein said speech model server being operative operative to determine a reference vector from the sequence of vectors, align each vector in the sequence of vectors to the reference vector, determine a mean and a variance of each parameter in the reference vector computed over the values in the aligned vectors, thereby constructing said speech reference model from the sequence of vectors.
-
32. A method of building speech reference models for use in a speech recognition system, comprising the steps of:
-
collecting speech training data at a first computing device;
processing the speech training data into an intermediate representation of the speech training data on said first computing device;
communicating said intermediate representation of the speech training data to a second computing device, said second computing device interconnected via a network to said first computing device;
creating a speech reference model from said intermediate representation at said second computing device; and
communicating said speech reference model to the first computing device for use in the speech recognition system.
-
Specification