Method and apparatus for converting text into audible signals using a neural network

US 5,668,926 A
Filed: 03/22/1996
Issued: 09/16/1997
Est. Priority Date: 04/28/1994
Status: Expired due to Fees

First Claim

Patent Images

1. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:

wherein training a neural network utilizes the steps of;

1a) inputting recorded audio messages;

1b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;

1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;

1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;

1e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;

1f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;

wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;

1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;

1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;

1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and

1j) converting the one of the plurality of acoustic representations into an audible signal.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Text may be converted to audible signals, such as speech, by first training a neural network 106 using recorded audio messages 204. To begin the training, the recorded audio messages are converted into a series of audio frames 205 having a fixed duration 213. Then, each audio frame is assigned a phonetic representation 203 and a target acoustic representation 208, where the phonetic representation 203 is a binary word that represents the phone and articulation characteristics of the audio frame, while the target acoustic representation 208 is a vector of audio information such as pitch and energy. After training, the neural network 106 is used in conversion of text into speech. First, text that is to be convened is translated to a series of phonetic frames 401 of the same form as the phonetic representations 208 and having the fixed duration 213. Then the neural network produces acoustic representations in response to context descriptions 207 that include some of the phonetic frames 401. The acoustic representations are then converted into a speech wave form by a synthesizer 107.

79 Citations

View as Search Results

32 Claims

1. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
- wherein training a neural network utilizes the steps of;
  
  1a) inputting recorded audio messages;
  
  1b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
  
  1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
  
  1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  1e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  1f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;
  
  wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
  
  1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
  
  1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
  
  1j) converting the one of the plurality of acoustic representations into an audible signal.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein, in step (c) the phonetic representation includes a phone.
  - 3. The method of claim 2, wherein, in step (c) the phonetic representation includes a binary word, where one bit of the binary word is set and any remaining bits of the binary word are not set to indicate that the phonetic representation is a phone.
  - 4. The method of claim 1, wherein, in step (e) the plurality of acoustic representations are speech parameters.
  - 5. The method of claim 1, wherein step (f) comprises training the neural network using back propagation of errors.
  - 6. The method of claim 1, wherein, in step (g) the text stream is a phonetic form of a language.

7. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
- a) inputting recorded audio messages;
  
  b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
  
  c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations;
  
  d) generating a context description of a plurality of context descriptions for the each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  f) training a neural network to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation,wherein training the neural network includes the steps of;
  
  1a) inputting recorded audio messages;
  
  2b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
  
  1c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
  
  1d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frames and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  1e) assigning for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  2f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the taract acoustic representation;
  
  wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
  
  1g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
  
  1h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  1i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
  
  1j) converting the one of the plurality of acoustic representations into an audible signal.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 8. The method of claim 7, wherein, in step (c) the phonetic representation includes a phone.
  - 9. The method of claim 8, wherein, in step (c) the phonetic representation includes a binary word, where one bit of the binary word is set and any remaining bits of the binary word are not set to indicate that the phonetic representation is a phone.
  - 10. The method of claim 7, wherein, in step (e) the phonetic representation includes articulation characteristics.
  - 11. The method of claim 7, wherein, in step (f) the plurality of acoustic representations are speech parameters.
  - 12. The method of claim 7, wherein, in step (f) the neural network is a feed-forward neural network.
  - 13. The method of claim 7, wherein step (f) comprises training the neural network using back propagation of errors.
  - 14. The method of claim 7, wherein, in step (f) the neural network has a recurrent input structure.
  - 15. The method of claim 7, wherein step (d) further comprises generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames.
  - 16. The method of claim 7, wherein step (d) further comprises generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames.
  - 17. The method of claim 7, wherein step (d) further comprises generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames.

18. A method for training and utilizing a neural network that is used to convert text streams into audible signals, the method comprising the steps of:
- a) receiving a text stream;
  
  b) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of a plurality of phonetic representations, and wherein the phonetic frame has a fixed duration;
  
  c) assigning one of a plurality of context descriptions to the phonetic frame based on one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  d) converting, by a neural network, the phonetic frame into one of a plurality of acoustic representations, based on the one of the plurality context descriptions,wherein training the neural network includes the steps of;
  
  d1) inputting recorded audio messages;
  
  d2) dividing the recorded audio messages into a series of audio frames wherein each audio frame has a fixed duration;
  
  d3) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
  
  d4) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frames and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  d5) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  d6) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;
  
  wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
  
  d7) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
  
  d8) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  d9) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
  
  e) converting the one of the plurality of acoustic representations into an audible signal.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 19. The method of claim 18, wherein, in step (b) the phonetic representation includes a phone.
  - 20. The method of claim 19, wherein, in step (b) the phonetic representation includes a binary word, where one bit of the binary word is set and any remaining bits of the binary word are not set to indicate that the phonetic representation is a phone.
  - 21. The method of claim 18, wherein, in step (b) the phonetic representation includes articulation characteristics.
  - 22. The method of claim 18, wherein, in step (d) the plurality of acoustic representations are speech parameters.
  - 23. The method of claim 18, wherein, in step (d) the neural network is a feed-forward neural network.
  - 24. The method of claim 18, wherein, in step (d) the neural network has a recurrent input structure.
  - 25. The method of claim 18, wherein step (c) further comprises generating syntactic boundary information based on the phonetic representation of an audio frame and a phonetic representation of at least some other audio frames of the series of audio frames.
  - 26. The method of claim 18, wherein step (c) further comprises generating phonetic boundary information based on the phonetic representation of an audio frame and a phonetic representation of at least some other audio frames of the series of audio frames.
  - 27. The method of claim 18, wherein step (c) further comprises generating a description of prominence of syntactic information based on the phonetic representation of an audio frame and a phonetic representation of a least some other audio frames of the series of audio frames.
  - 28. The method of claim 18, wherein, in step (a) the text stream is a phonetic form of a language.

29. A device for converting text into audible signals comprising:
- a text-to-phone processor, wherein the text-to-phone processor translates a text stream into a series of phonetic representations;
  
  a duration processor, operably coupled to the text-to-phone processor, wherein the duration processor generates duration data for the text stream;
  
  a pre-processor, wherein the pre-processor converts the series of phonetic representations and the duration data into a series of phonetic frames, wherein each phonetic frame of the series of phonetic frames is of a fixed duration and has a context description, and wherein the context description is based on each phonetic frame of the series of phonetic frames and at least some other phonetic frame of the series of phonetic frames; and
  
  a neural network, which can be trained, which generates an acoustic representation for each phonetic frame of the series of phonetic frames based on the context description,wherein training the neural network includes the steps of;
  
  a) inputting recorded audio messages;
  
  b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
  
  c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
  
  d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;
  
  wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
  
  g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
  
  h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
  
  j) converting the one of the plurality of acoustic representations into an audible signal.
- View Dependent Claims (30)
- - 30. The device of claim 29 further comprising:
    - a synthesizer, operably connected to the neural network, that produces an audible signal in response to the acoustic representation.

31. A speech synthesizing device within a vehicular navigation system to generate an audible output to a driver of a vehicle comprising:
- a directional database consisting of a plurality of text streams;
  
  a text-to-phone processor, operably coupled to the directional database, wherein the text-to-phone processor translates a text stream of the plurality of text streams into a series of phonetic representations;
  
  a duration processor, operably coupled to the text-to-phone processor, wherein the duration processor generates duration data for the text stream;
  
  a pre-processor, wherein the pre-processor converts the series of phonetic representations and the duration data into a series of phonetic frames, wherein each phonetic frame of the series of phonetic frames is of a fixed duration and has a context description, and wherein the context description is based on the each phonetic frame of the series of phonetic frames and at least some other phonetic frame of the series of phonetic frames;
  
  a neural network, which can be trained, which generates an acoustic representation for a phonetic frame of the series of phonetic frames based on the context description,wherein training the neural network includes the steps of;
  
  a) inputting recorded audio messages;
  
  b) dividing the recorded audio messages into a series of audio frames, wherein each audio frame has a fixed duration;
  
  c) assigning, for each audio frame of the series of audio frames, a phonetic representation of a plurality of phonetic representations that include articulation characteristics;
  
  d) generating a context description of a plurality of context descriptions for each audio frame based on the phonetic representation of the each audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating syntactic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, generating phonetic boundary information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames, and generating a description of prominence of syntactic information based on the phonetic representation of the audio frame and the phonetic representation of at least some other audio frames of the series of audio frames;
  
  e) assigning, for the each audio frame, a target acoustic representation of a plurality of acoustic representations;
  
  f) training a feed-forward neural network with a recurrent input structure to associate an acoustic representation of the plurality of acoustic representations with the context description of the each audio frame, wherein the acoustic representation substantially matches the target acoustic representation;
  
  wherein upon receiving a text stream, converting the text stream into an audible signal utilizing the steps of;
  
  g) converting the text stream into a series of phonetic frames, wherein a phonetic frame of the series of phonetic frames includes one of the plurality of phonetic representations, and wherein a phonetic frame has the fixed duration;
  
  h) assigning one of the plurality of context descriptions to the phonetic frame based on the one of the plurality of phonetic representations and phonetic representations of at least some other phonetic frames of the series of phonetic frames;
  
  i) converting, by the neural network, the phonetic frame into one of the plurality of acoustic representations, based on the one of the plurality of context descriptions; and
  
  j) converting the one of the plurality of acoustic representations into an audible signal.
- View Dependent Claims (32)
- - 32. The vehicular navigation system of claim 31 further comprising:
    - a synthesizer, operably connected to the neural network, that produces an audible signal in response to the acoustic representation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Gerson, Ira Alan, Corrigan, Gerald Edward, Karaali, Orhan
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US08/622,237
Time in Patent Office

543 Days
Field of Search

395/2.11, 395/2.41, 395/2.68, 395/2.69, 395/2.76
US Class Current

704/232
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 25/30 using neural networks

Method and apparatus for converting text into audible signals using a neural network

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

79 Citations

32 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for converting text into audible signals using a neural network

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

79 Citations

32 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others