Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system

US 5,913,194 A
Filed: 07/14/1997
Issued: 06/15/1999
Est. Priority Date: 07/14/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for providing, in response to linguistic information that includes a sequence of segment descriptions each of which includes a phonetic segment type and duration, efficient generation of a refined parametric representation of speech for providing synthetic speech, comprising the steps of:

A) using a data selection module to retrieve representative parameter vectors for each segment description according to at least the phonetic segment type and phonetic segment types included in adjacent segment descriptions;

B) interpolating between the representative parameter vectors according to the segment descriptions to provide interpolated statistical parameters;

C) converting the interpolated statistical parameters and linguistic information to neural network input parameters;

D) utilizing a neural network with a post-processor to convert the neural network input parameters into neural network output parameters that correspond to a parametric representation of speech and converting the neural network output parameters to a refined parametric representation of speech, wherein the refined parametric representation of speech can be used to provide synthetic speech.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method (400), device and system (300) provide, in response to linguistic information, efficient generation of a parametric representation of speech using a neural network. The method provides, in response to linguistic information efficient generation of a refined parametric representation of speech, comprising the steps of: A) using a data selection module to retrieve representative parameter vectors for each segment description according to the phonetic segment type and the phonetic segment types included in adjacent segment descriptions; B) interpolating between the representative parameter vectors according to the segment descriptions and duration to provide interpolated statistical parameters; C) converting the interpolated statistical parameters and linguistic information to neural network input parameters; D) utilizing a statistically enhanced neural network/neural network with post-processor to provide neural network output parameters that correspond to a parametric representation of speech; and converting the neural network output parameters to a refined parametric representation of speech.

Citations

90 Claims

1. A method for providing, in response to linguistic information that includes a sequence of segment descriptions each of which includes a phonetic segment type and duration, efficient generation of a refined parametric representation of speech for providing synthetic speech, comprising the steps of:
- A) using a data selection module to retrieve representative parameter vectors for each segment description according to at least the phonetic segment type and phonetic segment types included in adjacent segment descriptions;
  
  B) interpolating between the representative parameter vectors according to the segment descriptions to provide interpolated statistical parameters;
  
  C) converting the interpolated statistical parameters and linguistic information to neural network input parameters;
  
  D) utilizing a neural network with a post-processor to convert the neural network input parameters into neural network output parameters that correspond to a parametric representation of speech and converting the neural network output parameters to a refined parametric representation of speech, wherein the refined parametric representation of speech can be used to provide synthetic speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 2. The method of claim 1 wherein the refined parametric representation of speech is a sequence of coder parameters suitable to be provided to a waveform synthesizer.
  - 3. The method of claim 2 further including a step of providing the refined parametric representation of speech to a waveform synthesizer to synthesize speech.
  - 4. The method of claim 1 wherein the interpolating between the representative parameter vectors is performed using a linear interpolation algorithm.
  - 5. The method of claim 1 wherein the interpolating between the representative parameter vectors is performed using a non-linear interpolation algorithm.
  - 6. The method of claim 5 wherein the non-linear interpolation algorithm is a cubic spline interpolation algorithm.
  - 7. The method of claim 5 wherein the non-linear interpolation algorithm is a Lagrange interpolation algorithm.
  - 8. The method of claim 1 wherein elements of the interpolated statistical parameters correspond to elements of the refined parametric representation of speech.
  - 9. The method of claim 1 wherein elements of the interpolated statistical parameters are derived from elements of the neural network output parameters.
  - 10. The method of claim 1 wherein the representative parameter vectors are retrieved according to linguistic context which is derived from one of:
    - A) a phonetic segment sequence;
      
      B) articulatory features;
      
      C) acoustic features;
      
      D) stress;
      
      E) prosody;
      
      F) syntax; and
      
      G) a combination of at least two of A-F.
  - 11. The method of claim 1 wherein the statistically enhanced neural network is a feedforward neural network.
  - 12. The method of claim 1 wherein the statistically enhanced neural network contains a recurrent feedback mechanism.
  - 13. The method of claim 1 wherein the statistically enhanced neural network is a multi-layer perceptron.
  - 14. The method of claim 1 wherein the statistically enhanced neural network input includes a tapped delay line input.
  - 15. The method of claim 1 wherein the statistically enhanced neural network is trained using a gradient descent technique.
  - 16. The method of claim 1 wherein the statistically enhanced neural network is trained using a Bayesian technique.
  - 17. The method of claim 1 wherein the statistically enhanced neural network is trained using back-propagation of errors.
  - 18. The method of claim 1 wherein the statistically enhanced neural network is composed of a layer of processing elements with a predetermined specified activation function and at least one of:
    - A) another layer of processing elements with a predetermined specified activation function;
      
      B) a multiple layer of processing elements with predetermined specified activation functions;
      
      C) a rule-based module that generates output based on internal rules and input to the rule-based module;
      
      D) a statistical system that generates output based on input and an internal statistical function; and
      
      E) a recurrent feedback mechanism.
  - 19. The method of claim 1 wherein the statistically enhanced neural network input information includes at least one of:
    - A) a phoneme identifier associated with each phoneme in current and adjacent segment descriptions;
      
      B) articulatory features associated with each phoneme in current and adjacent segment descriptions;
      
      C) locations of syllable, word and other predetermined syntactic and intonational boundaries;
      
      D) duration of time between syllable, word and other predetermined syntactic and intonational boundaries;
      
      E) syllable strength information;
      
      F) descriptive information of a word type, and;
      
      G) prosodic information which includes at least one of;
      
      1) locations of word endings and degree of disjuncture between words;
      
      2) locations of pitch accents and a form of the pitch accents;
      
      3) locations of boundaries marked in pitch contours and a form of the boundaries;
      
      4) time separating marked prosodic events, and;
      
      5) a number of prosodic events of a predetermined type in a time period separating a prosodic event of another predetermined type and a frame for which the refined parametric representation of speech is being generated.
  - 20. The method of claim 1 wherein the representative parameter vectors are generated by using a predetermined clustering algorithm.
  - 21. The method of claim 20 wherein the clustering algorithm is a k-means clustering algorithm.
  - 22. The method of claim 1 wherein the representative parameter vectors are generated by using an averaging algorithm.
  - 23. The method of claim 1 wherein the representative parameter vectors are derived by:
    - A) extracting vectors from a parameter database to create a set of similar parameter vectors; and
      
      B) computing a representative parameter vector from the set of similar parameter vectors.
  - 24. The method of claim 23 wherein the parameter database is a same database that is used to generate neural network training vectors.
  - 25. The method of claim 23 wherein the parameter database is derived from neural network training vectors.
  - 26. The method of claim 23 wherein the parameter database contains parametric representations of recorded speech and corresponding linguistic labels.
  - 27. The method of claim 26 wherein the corresponding linguistic labels contain phonetic segment labels and segment durations.
  - 28. The method of claim 23 wherein the representative parameter vectors consist of a sequence of parameter vectors wherein each parameter vector describes a portion of a phonetic segment.
  - 29. The method of claim 23 wherein the representative parameter vectors are derived by:
    - A) segmenting the duration of each phonetic segment in the parameter database into a finite number of regions; and
      
      B) computing a parameter vector for each region.
  - 30. The method of claim 23 wherein all of the set of similar parameter vectors are parametric representations of speech in the parameter database which correspond to speech having at least one of:
    - A) a same phonetic segment sequence;
      
      B) same articulatory features;
      
      C) same acoustic features;
      
      D) a same stress;
      
      E) a same prosody;
      
      F) a same syntax; and
      
      G) a combination of at least two of A-F.

31. A device for providing, in response to linguistic information that includes a sequence of segment descriptions each of which includes a phonetic segment type and a duration, efficient generation of a parametric representation of speech for providing synthetic speech, comprising:
- A) a data selection module, coupled to receive the sequence of segment descriptions, that retrieves representative parameter vectors for each segment description according to at least the phonetic segment type and phonetic segment types included in adjacent segment descriptions;
  
  B) an interpolation module, coupled to receive the sequence of segment descriptions and the representative parameter vectors, that interpolates between the representative parameter vectors according to the segment descriptions to provide interpolated statistical parameters;
  
  C) a pre-processor, coupled to receive linguistic information and the interpolated statistical parameters that generates neural network input parameters;
  
  D) a neural network with post-processor, coupled to receive neural network input parameters, that converts the neural network input parameters to neural network output parameters corresponding to a parametric representation of speech and converts the neural network output parameters to a refined parametric representation of speech, wherein the refined parametric representation of speech can be used to provide synthetic speech.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
- - 32. The device of claim 31 wherein the refined parametric representation of speech is a sequence of coder parameters suitable to be provided to a waveform synthesizer.
  - 33. The device of claim 32 further including a waveform synthesizer, coupled to receive the sequence of coder parameters, that converts the coder parameters to synthesized speech.
  - 34. The device of claim 31 wherein interpolation module utilizes a linear interpolation algorithm.
  - 35. The device of claim 31 wherein the interpolation module utilizes a non-linear interpolation algorithm.
  - 36. The device of claim 35 wherein the non-linear interpolation algorithm is a cubic spline interpolation algorithm.
  - 37. The device of claim 35 wherein the non-linear interpolation algorithm is a Lagrange interpolation algorithm.
  - 38. The device of claim 31 wherein elements of the interpolated statistical parameters are identical to elements generated by the statistically enhanced neural network.
  - 39. The device of claim 31 wherein elements of the interpolated statistical parameters are derived from elements of the neural network output parameters.
  - 40. The device of claim 31 wherein the representative parameter vectors correspond to linguistic context which is derived from one of:
    - A) a phonetic segment sequence;
      
      B) articulatory features;
      
      C) acoustic features;
      
      D) stress;
      
      E) prosody;
      
      F) syntax; and
      
      G) a combination of at least two of A-F.
  - 41. The device of claim 31 wherein the statistically enhanced neural network is a feedforward neural network.
  - 42. The device of claim 31 wherein the statistically enhanced neural network contains a recurrent feedback mechanism.
  - 43. The device of claim 31 wherein the statistically enhanced neural network is a multi-layer perceptron.
  - 44. The device of claim 31 wherein the statistically enhanced neural network uses a tapped delay line input.
  - 45. The device of claim 31 wherein the statistically enhanced neural network is trained using a gradient descent technique.
  - 46. The device of claim 31 wherein the statistically enhanced neural network is trained using a Bayesian technique.
  - 47. The device of claim 31 wherein the statistically enhanced neural network is trained using back-propagation of errors.
  - 48. The device of claim 31 wherein the statistically enhanced neural network is composed of modules wherein each module is at least one of:
    - A) a single layer of processing elements with a predetermined activation function;
      
      B) a multiple layer of processing elements with predetermined activation functions;
      
      C) a rule-based module that generates output based on internal rules and input to the rule-based module;
      
      D) a statistical system that generates output based on input and a predetermined internal statistical function, and;
      
      E) a recurrent feedback mechanism.
  - 49. The device of claim 31 wherein the neural network input information includes at least one of:
    - A) a phoneme identifier associated with each phoneme in current and adjacent segment descriptions;
      
      B) articulatory features associated with each phoneme in the current and adjacent segment descriptions;
      
      C) locations of syllable, word and other predetermined syntactic and intonational boundaries;
      
      D) duration of time between syllable, word and other predetermined syntactic and intonational boundariesE) syllable strength information;
      
      F) descriptive information of a word type, and;
      
      G) prosodic information which includes at least one of;
      
      1) locations of word endings and degree of disjuncture between words;
      
      2) locations of pitch accents and a form of the pitch accents;
      
      3) locations of boundaries marked in pitch contours and a form of the boundaries;
      
      4) time separating marked prosodic events, and;
      
      5) a number of prosodic events of a predetermined type in a time period separating a prosodic event of another predetermined type and a frame for which the refined parametric representation of speech is being generated.
  - 50. The device of claim 31 wherein the representative parameter vectors are generated by using a clustering algorithm.
  - 51. The device of claim 50 wherein the clustering algorithm is a k-means clustering algorithm.
  - 52. The device of claim 31 wherein the representative parameter vectors are generated by using a predetermined averaging algorithm.
  - 53. The device of claim 31 wherein the representative parameter vectors are derived by:
    - A) extracting vectors from a parameter database to create a set of similar parameter vectors; and
      
      B) computing a representative parameter vector from the set of similar parameter vectors.
  - 54. The device of claim 53 wherein the parameter database is a same database that is used to generate neural network training vectors.
  - 55. The device of claim 53 wherein the parameter database are derived from the neural network training vectors.
  - 56. The device of claim 53 wherein the parameter database contains parametric representations of recorded speech and corresponding linguistic labels.
  - 57. The device of claim 56 wherein the corresponding linguistic labels contain phonetic segment labels and segment durations.
  - 58. The device of claim 53 wherein the representative parameter vectors consist of a sequence of parameter vectors wherein each parameter vector describes a predetermined portion of a phonetic segment.
  - 59. The device of claim 53 wherein the representative parameter vectors are derived by:
    - A) segmenting the duration of each phonetic segment in the parameter database into a finite number of regions; and
      
      B) computing a parameter vector for each region.
  - 60. The device of claim 53 wherein all of the set of similar parameter vectors are parametric representations of speech in the parameter database which correspond to speech having at least one of:
    - A) a same phonetic segment sequence;
      
      B) same articulatory features;
      
      C) same acoustic features;
      
      D) a same stress;
      
      E) a same prosody;
      
      F) a same syntax; and
      
      G) a combination of at least two of A-F.

61. A text-to-speech system/speech synthesis system/dialog system having a device for providing, in response to linguistic information that includes a sequence of segment descriptions each of which includes a phonetic segment type and a duration, efficient generation of a parametric representation of speech for providing synthetic speech, the device comprising:
- A) a data selection module, coupled to receive the sequence of segment descriptions, that retrieves representative parameter vectors for each segment description according to at least the phonetic segment type and phonetic segment types included in adjacent segment descriptions;
  
  B) an interpolation module, coupled to receive the sequence of segment descriptions and the representative parameter vectors, that interpolates between the representative parameter vectors according to the segment descriptions to provide interpolated statistical parameters;
  
  C) a pre-processor, coupled to receive linguistic information and the interpolated statistical parameters that generates neural network input parameters;
  
  D) a neural network with a post-processor, coupled to receive neural network input parameters, that converts the neural network input parameters to neural network output parameters that correspond to a parametric representation of speech; and
  
  where selected, including a post-processor, coupled to receive the neural network output parameters that converts the neural network output parameters to a refined parametric representation of speech, wherein the refined parametric representation of speech can be used to provide synthetic speech.
- View Dependent Claims (62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)
- - 62. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the refined parametric representation of speech is a sequence of coder parameters suitable to be provided to a waveform synthesizer.
  - 63. The method of claim 62 further including a waveform synthesizer, coupled to receive the sequence of coder parameters, that converts the refined parametric representation of speech to synthesized speech.
  - 64. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein interpolation module utilizes a linear interpolation algorithm.
  - 65. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the interpolation module utilizes a non-linear interpolation algorithm.
  - 66. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the non-linear interpolation algorithm is a cubic spline interpolation algorithm.
  - 67. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the non-linear interpolation algorithm is a Lagrange interpolation algorithm.
  - 68. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein elements of the interpolated statistical parameters are identical to elements generated by the neural network output.
  - 69. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein elements of the interpolated statistical parameters is derived from elements of the neural network output parameters.
  - 70. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the representative parameter vectors correspond to linguistic context which is derived from one of:
    - A) phonetic segment sequence;
      
      B) articulatory features;
      
      C) acoustic features;
      
      D) stress;
      
      E) prosody;
      
      F) syntax; and
      
      G) a combination of at least two of A-F.
  - 71. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is a feedforward neural network.
  - 72. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network contains a recurrent feedback mechanism.
  - 73. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is a multi-layer perceptron.
  - 74. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network uses a tapped delay line input.
  - 75. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is trained using a gradient descent technique.
  - 76. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is trained using a Bayesian technique.
  - 77. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is trained using back-propagation of errors.
  - 78. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the statistically enhanced neural network is composed of modules wherein each module is at least one of:
    - A) a single layer of processing elements with a specified activation function;
      
      B) a multiple layer of processing elements with specified activation functions;
      
      C) a rule based module that generates output based on internal rules and input to the rule based module;
      
      D) a statistical system that generates output based on input and an internal statistical function, and;
      
      E) a recurrent feedback mechanism.
  - 79. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the neural network input information includes at least one of:
    - A) phoneme identifier associated with each phoneme in current and adjacent segment descriptions;
      
      B) articulatory features associated with each phoneme in current and adjacent segment descriptions;
      
      C) locations of syllable, word and other syntactic and intonational boundaries;
      
      D) duration of time between syllable, word and other syntactic and intonational boundariesE) syllable strength information;
      
      F) descriptive information of a word type, and;
      
      G) prosodic information which includes at least one of;
      
      1) locations of word endings and degree of disjuncture between words;
      
      2) locations of pitch accents and a form of the pitch accents;
      
      3) locations of boundaries marked in pitch contours and a form of the boundaries;
      
      4) time separating marked prosodic events, and;
      
      5) a number of prosodic events of a predetermined type in a time period separating a prosodic event of another predetermined type and a frame for which the refined parametric representation of speech is being generated.
  - 80. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the representative parameter vectors were generated by using a clustering algorithm.
  - 81. The text-to-speech system/speech synthesis system/dialog system of claim 80 wherein the clustering algorithm is a k-means clustering algorithm.
  - 82. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the representative parameter vectors were generated by using an averaging algorithm.
  - 83. The text-to-speech system/speech synthesis system/dialog system of claim 61 wherein the representative parameter vectors are derived byA) extracting vectors from a parameter database to create a set of similar parameter vectors;
    - andB) computing a representative parameter vector from the set of similar parameter vectors.
  - 84. The text-to-speech system/speech synthesis system/dialog system of claim 83 wherein the parameter database is a same database that is used to generate neural network training vectors.
  - 85. The text-to-speech system/speech synthesis system/dialog system of claim 84 wherein the parameter database is derived from the neural network training vectors.
  - 86. The text-to-speech system/speech synthesis system/dialog system of claim 85 wherein the parameter database contains parametric representations of recorded speech and corresponding linguistic labels.
  - 87. The text-to-speech system/speech synthesis system/dialog system of claim 86 wherein the corresponding linguistic labels contain phonetic segment labels and segment durations.
  - 88. The text-to-speech system/speech synthesis system/dialog system of claim 83 wherein the representative parameter vectors consist of a sequence of parameter vectors wherein each parameter vector describes a portion of a phonetic segment.
  - 89. The text-to-speech system/speech synthesis system/dialog system of claim 83 wherein the representative parameter vectors are derived byA) segmenting the duration of each phonetic segment in the parameter database into a finite number of regions;
    - andB) computing a parameter vector for each region.
  - 90. The text-to-speech system/speech synthesis system/dialog system of claim 89 wherein the set of similar parameter vectors are all parametric representations of speech in the parameter database which correspond to speech having a same:
    - A) phonetic segment sequence;
      
      B) articulatory features;
      
      C) acoustic features;
      
      D) stress;
      
      E) prosody;
      
      F) syntax; and
      
      G) a combination of at least two of A-F.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Massey, Noel, Corrigan, Gerald, Karaali, Orhan
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
ZINTEL, HAROLD ALBERT

Application Number

US08/892,295
Time in Patent Office

701 Days
Field of Search

704/259, 704/265, 704/258
US Class Current

704/259
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 25/30 using neural networks

Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

90 Claims

Specification

Solutions

Use Cases

Quick Links

Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

90 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links