Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network

US 6,178,402 B1
Filed: 04/29/1999
Issued: 01/23/2001
Est. Priority Date: 04/29/1999
Status: Active Grant

First Claim

Patent Images

1. A method for generating a series of acoustic descriptions in a text-to-speech system based upon a linguistic description of text comprising the steps of:

a) generating an information vector for each segment description in the linguistic description, wherein the information vector includes a description of a sequence of segments surrounding a described segment;

b) using a neural network to generate a representation of a trajectory of acoustic parameters, said trajectory being associated with the described segment; and

c) generating the series of acoustic descriptions by computing points on the trajectory at identified instants, for each of a set of time periods making up the segment, the trajectory consists of each acoustic parameter in the space of acoustic parameters being equal to a polynomial function of time, wherein the polynomial functions are cubic functions, wherein the number of time periods making up the segment is two.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method, device and system to generate acoustic parameters in a text-to-speech system utilizing a neural network to generate a representation of a trajectory in an acoustic parameter space across a phonetic segment.

36 Citations

View as Search Results

27 Claims

1. A method for generating a series of acoustic descriptions in a text-to-speech system based upon a linguistic description of text comprising the steps of:
- a) generating an information vector for each segment description in the linguistic description, wherein the information vector includes a description of a sequence of segments surrounding a described segment;
  
  b) using a neural network to generate a representation of a trajectory of acoustic parameters, said trajectory being associated with the described segment; and
  
  c) generating the series of acoustic descriptions by computing points on the trajectory at identified instants, for each of a set of time periods making up the segment, the trajectory consists of each acoustic parameter in the space of acoustic parameters being equal to a polynomial function of time, wherein the polynomial functions are cubic functions, wherein the number of time periods making up the segment is two.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1 wherein the speech is described as a sequence of phone identifications and the segments for which duration is being generated are segments of speech expressing predetermined phones in the sequence of phone identifications, wherein segment descriptions include the phone identifications.
  - 3. The method of claim 2 wherein the information vector also includes descriptive information for a context associated with the described segment, said descriptive information including at least one of (a) articulatory features associated with each phone in the sequence of phones;
    - (b) locations of syllable, word and other syntactic and intonational boundaries;
      
      (c) syllable strength information;
      
      (d) descriptive information of a word type; and
      
      (e) rule firing information.
  - 4. The method of claim 1 wherein the representation of the trajectory consists of the value of each acoustic parameter in the space of acoustic parameters at the beginning of each time period, one-third of the way through each time period, two-thirds of the way through each time period, and at the end of each time period.
  - 5. The method of claim 4 wherein the neural network output does not include a value for each parameter at the beginning of each time period, the value of the parameter at the end of the previous time period being used instead.
  - 6. The method of claim 1 wherein the neural network is a pretrained feedforward neural network.
  - 7. The method of claim 6 wherein the pretrained neural network has been trained using back-propagation of errors.
  - 8. The method of claim 7 wherein training data for the pretrained network has been generated by recording natural speech, partitioning the speech data into segments associated with identified phones, marking any other syntactical intonational and stress information used in the method, using a vocoder to generate a series acoustic parameters corresponding to the speech, using optimization techniques to determine a representation of a trajectory that approximates the actual series of acoustic parameters and processing into informational vectors and target output for the neural network.
  - 9. The method of claim 1 wherein the steps of the method are stored in a memory unit of a computer.
  - 10. The method of claim 1 wherein the steps of the method are implemented by a Digital Signal Processor.
  - 11. The method of claim 1 wherein the steps of the method are embodied in a tangible medium of/for an Application Specific Integrated Circuit, ASIC.
  - 12. The method of claim 1 wherein the steps of the method are embodied in a tangible medium of a gate array.
  - 13. The method of claim 1 further including the step of providing the series of descriptions to a vocoder to generate speech.
  - 14. The method of claim 1 wherein the representation of the trajectory includes a representation of a duration for the segment.
  - 15. The method of claim 1 wherein the duration of the segment is predetermined and included as part of the information vector provided to the neural network.

16. A device for generating a series of acoustic descriptions in a text-to-speech system based upon a linguistic description of text comprising:
- a) a linguistic information preprocessor, operably coupled to receive the linguistic, to generate an information vector for each segment description in the linguistic description, wherein the information vector includes a description of a sequence of segments surrounding a described segment;
  
  b) a neural network, operably coupled to the linguistic information processor, to generating a representation of a trajectory in a space of acoustic parameters, said trajectory being associated with the described segment; and
  
  c) a trajectory computation unit, operably coupled to the neural network, to generate the series of acoustic descriptions by computing points on the trajectory at identified instants, for each of a set of time periods making up the segment, the trajectory consists of each acoustic parameter in the space of acoustic parameters being equal to a polynomial function of time, wherein the polynomial functions are cubic functions, wherein the number of time periods making up the segment is two.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The device of claim 16 wherein the speech is described as a sequence of phone identifications and the segments for which duration is being generated are segments of speech expressing predetermined phones in the sequence of phone identifications, wherein segment descriptions include the phone identifications.
  - 18. The device of claim 17 wherein the information vector also includes descriptive information for a context associated with the described segment, said descriptive information including at least one of:
    - (a) articulatory features associated with each phone in the sequence of phones;
      
      (b) locations of syllable, word and other syntactic and intonational boundaries;
      
      (c) syllable strength information;
      
      (d) descriptive information of a word type; and
      
      (e) rule firing information.
  - 19. The device of claim 16 wherein the representation of the trajectory consists of the value of each acoustic parameter in the space of acoustic parameters at the beginning of each time period, one-third of the way through each time period, two-thirds of the way through each time period, and at the end of each time period.
  - 20. The device of claim 19 wherein the neural network output does not include a value for each parameter at the beginning of each time period, the value of the parameter at the end of the previous time period being used instead.
  - 21. The device of claim 16 wherein the neural network is a pretrained feedforward neural network.
  - 22. The device of claim 21 wherein the pretrained neural network has been trained using back-propagation of errors.
  - 23. The device of claim 22 wherein training data for the pretrained network has been generated by recording natural speech, partitioning the speech data into segments associated with identified phones, marking any other syntactical intonational and stress information used in the method, using a vocoder to generate a series acoustic parameters corresponding to the speech, using optimization techniques to determine a representation of a trajectory that approximates the actual series of acoustic parameters and processing into informational vectors and target output for the neural network.
  - 24. The device of claim 16 further including a vocoder, operably coupled to the trajectory computation unit, for generating speech.
  - 25. The device of claim 16 wherein the representation of the trajectory includes a representation of a duration for the segment.
  - 26. The device of claim 16 wherein the duration of the segment is predetermined and included as part of the information vector provided to the neural network.

27. A text-to-speech synthesizer to generate a series of acoustic descriptions in a text-to-speech system based upon a linguistic description of text comprising:
- a) a linguistic information preprocessor, operably coupled to receive the linguistic description, to generate an information vector for each segment description in the linguistic description, wherein the information vector includes a description of a sequence of segments surrounding a described segment;
  
  b) a neural network, operably coupled to the linguistic information processor, to generate a representation of a trajectory in a space of acoustic parameters, said trajectory being associated with the described segment; and
  
  c) a trajectory computation unit, operably coupled to the neural network, to generate the series of descriptions by computing points on the trajectory at identified instants, for each of a set of time periods making up the segment, the trajectory consists of each acoustic parameter in the space of acoustic parameters being equal to a polynomial function of time, wherein the polynomial functions are cubic functions, wherein the number of time periods making up the segment is two.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Corrigan, Gerald E.
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Azad, Abul K.

Application Number

US09/301,711
Time in Patent Office

635 Days
Field of Search

704/232, 704/257, 704/259, 704/260
US Class Current

704/259
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 2015/025   Phonemes, fenemes or fenone...

Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

36 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links