Method and apparatus for converting text into audible signals using a neural network
First Claim
1. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
- wherein training a neural network utilizes the steps of;
1a) inputting recorded audio messages;
1b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
1e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
1f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;
wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
1j) converting the one of the plurality of acoustic representations into an audible signal.
0 Assignments
0 Petitions
Accused Products
Abstract
Text may be converted to audible signals, such as speech, by first training a neural network 106 using recorded audio messages 204. To begin the training, the recorded audio messages are converted into a series of audio frames 205 having a fixed duration 213. Then, each audio frame is assigned a phonetic representation 203 and a target acoustic representation 208, where the phonetic representation 203 is a binary word that represents the phone and articulation characteristics of the audio frame, while the target acoustic representation 208 is a vector of audio information such as pitch and energy. After training, the neural network 106 is used in conversion of text into speech. First, text that is to be convened is translated to a series of phonetic frames 401 of the same form as the phonetic representations 208 and having the fixed duration 213. Then the neural network produces acoustic representations in response to context descriptions 207 that include some of the phonetic frames 401. The acoustic representations are then converted into a speech wave form by a synthesizer 107.
79 Citations
32 Claims
-
1. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
-
wherein training a neural network utilizes the steps of; 1a) inputting recorded audio messages; 1b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration; 1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics; 1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; 1e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations; 1f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation; wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of; 1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration; 1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; 1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and 1j) converting the one of the plurality of acoustic representations into an audible signal. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
-
a) inputting recorded audio messages; b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration; c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations; d) generating a context description of a plurality of context descriptions for the each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations; f) training a neural network to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation, wherein training the neural network includes the steps of; 1a) inputting recorded audio messages; 2b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration; 1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics; 1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frames and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; 1e) assigning for the each audio frame, a target acoustic representation of a plurality of acoustic representations; 2f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the taract acoustic representation; wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of; 1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration; 1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; 1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and 1j) converting the one of the plurality of acoustic representations into an audible signal. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
-
a) receiving a text stream; b) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of a plurality of phonetic representations, and wherein the phonetic frame has a fixed duration; c) assigning one of a plurality of context descriptions to the phonetic frame based on one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; d) converting, by a neural network, the phonetic frame into one of a plurality of acoustic representations, based on the one of the plurality context descriptions, wherein training the neural network includes the steps of; d1) inputting recorded audio messages; d2) dividing the recorded audio messages into a series of audio frames wherein each audio frame has a fixed duration; d3) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics; d4) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frames and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; d5) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations; d6) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation; wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of; d7) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration; d8) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; d9) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and e) converting the one of the plurality of acoustic representations into an audible signal. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A device for converting text into audible signals comprising:
-
a text-to-phone processor, wherein the text-to-phone processor translates a text stream into a series of phonetic representations; a duration processor, operably coupled to the text-to-phone processor, wherein the duration processor generates duration data for the text stream; a pre-processor, wherein the pre-processor converts the series of phonetic representations and the duration data into a series of phonetic frames, wherein each phonetic frame of the series of phonetic frames is of a fixed duration and has a context description, and wherein the context description is based on each phonetic frame of the series of phonetic frames and at least some other phonetic frame of the series of phonetic frames; and a neural network, which can be trained, which generates an acoustic representation for each phonetic frame of the series of phonetic frames based on the context description, wherein training the neural network includes the steps of; a) inputting recorded audio messages; b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration; c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics; d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations; f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation; wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of; g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration; h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and j) converting the one of the plurality of acoustic representations into an audible signal. - View Dependent Claims (30)
-
-
31. A speech synthesizing device within a vehicular navigation system to generate an audible output to a driver of a vehicle comprising:
-
a directional database consisting of a plurality of text streams; a text-to-phone processor, operably coupled to the directional database, wherein the text-to-phone processor translates a text stream of the plurality of text streams into a series of phonetic representations; a duration processor, operably coupled to the text-to-phone processor, wherein the duration processor generates duration data for the text stream; a pre-processor, wherein the pre-processor converts the series of phonetic representations and the duration data into a series of phonetic frames, wherein each phonetic frame of the series of phonetic frames is of a fixed duration and has a context description, and wherein the context description is based on the each phonetic frame of the series of phonetic frames and at least some other phonetic frame of the series of phonetic frames; a neural network, which can be trained, which generates an acoustic representation for a phonetic frame of the series of phonetic frames based on the context description, wherein training the neural network includes the steps of; a) inputting recorded audio messages; b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration; c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics; d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames; e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations; f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation; wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of; g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration; h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames; i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and j) converting the one of the plurality of acoustic representations into an audible signal. - View Dependent Claims (32)
-
Specification