Method and system for synthesizing speech

US 5,970,453 A
Filed: 06/09/1995
Issued: 10/19/1999
Est. Priority Date: 01/07/1995
Status: Expired due to Fees

First Claim

Patent Images

1. A method for synthesizing speech from text, comprising the steps of:

generating a sequence of sub-phoneme elements from text, each sub-phoneme element representing a corresponding acoustic waveform; and

concatenating said sub-phoneme elements to produce an output waveform, wherein said generating step comprises the steps of;

generating from said text corresponding speech elements; and

mapping each speech element to one or more of a plurality of sub-phoneme elements to produce said sequence.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for synthesizing acoustic waveforms in, for example, a text-to-speech system is disclosed which employs the concatenation of a very large number of very small, sub-phoneme, acoustic units. Such sub-phoneme sized audio segments, called wavelets, can be individually spectrally analyzed and labelled as fenones. Fenones are clustered into logically related groups called fenemes. Sequences of fenemes can be matched with individual phonemes, and hence words. In the case of a text-to-speech system, the required phonemes are determined from prior linguistic analysis of the input words in the text. Suitable sequences of fenemes are predicted for each phoneme in its own context using hidden markov modelling techniques. A complete output waveform is constructed by concatenating wavelets to produce a very long sequence thereof, each wavelet corresponding to its respective feneme. The advantages of using a feneme set extracted from a training script read by a single human speaker is that it is possible to generate natural sounding speech, using a finite sized codebook.

119 Citations

25 Claims

1. A method for synthesizing speech from text, comprising the steps of:
- generating a sequence of sub-phoneme elements from text, each sub-phoneme element representing a corresponding acoustic waveform; and
  
  concatenating said sub-phoneme elements to produce an output waveform, wherein said generating step comprises the steps of;
  
  generating from said text corresponding speech elements; and
  
  mapping each speech element to one or more of a plurality of sub-phoneme elements to produce said sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. A method as claimed in claim 1, wherein the mapping is performed using a hidden markov model in which the states represent the sub-phoneme elements and the outputs are the speech elements.
  - 3. A method as claimed in claim 2, wherein the hidden markov model is an n-gram model, where n is at least three.
  - 4. A method as claimed in claim 1 wherein said speech elements are phonemes.
  - 5. A method as claimed in claim 1, wherein each sub-phoneme element is a frequency domain representation of a corresponding acoustic waveform, and the step of concatenating comprises converting each frequency domain representation into a time domain representation and concatenating said time domain representations to produce said output waveform.
  - 6. A method as claimed in claim 1, wherein the step of concatenating comprises applying a window to each sub-phoneme element and concatenating together the result thereof in order to the mitigate the effect of discontinuities between said sub-phoneme elements.
  - 7. A method as claimed in claim 1, wherein said sub-phoneme elements have durations of between 2 milliseconds and 20 milliseconds.
  - 8. A method as claimed in claim 1, wherein the duration of a sub-phoneme element is no less than the inverse of the instantaneous fundamental frequency.
  - 9. A method as claimed in claim 1, wherein said sequence of sub-phoneme elements is generated from an alphabet comprising between 300 and 10,000 sub-phoneme elements.
  - 10. A method as claimed in claim 1, further comprising the step of generating an alphabet of sub-phoneme elements.
  - 11. A method as claimed in claim 10, wherein the step of generating said alphabet comprises the steps ofproducing a set of first data elements by sampling an input acoustic waveform,clustering said first data elements, andgenerating a sub-phoneme element for each cluster of first data elements.
  - 12. A method as claimed in claim 11, wherein said step of clustering is effected using a k-means algorithm.

13. A system for synthesizing speech from text, the system comprising:
- means for generating a sequence of sub-phoneme elements from text, each sub-phoneme element representing a corresponding acoustic waveform; and
  
  means for concatenating said sub-phoneme elements to produce an output waveform, wherein said means for generating comprises;
  
  means for generating from said text corresponding speech elements; and
  
  means for mapping each speech element to one or more of a plurality of sub-phoneme elements to produce said sequence.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. A system as claimed in claim 13, wherein the mapping is performed using a hidden markov model in which the states represent the sub-phoneme elements and the outputs are the speech elements.
  - 15. A system as claimed in claim 14, wherein the hidden markov model is an n-gram model, where n is at least three.
  - 16. A system as claimed in claim 13, wherein said speech elements are phonemes.
  - 17. A system as claimed in claim 13, wherein each sub-phoneme element is a frequency domain representation of a corresponding acoustic waveform, and the means for concatenating comprisesmeans for converting each frequency domain representation into a time domain representation and concatenating said time domain representations to produce said output waveform.
  - 18. A system as claimed in claim 13, wherein the means for concatenating comprises means for applying a window to each sub-phoneme element and for concatenating together the result thereof in order to the mitigate the effect of discontinuities between said sub-phoneme elements.
  - 19. A system as claimed in claim 13, wherein said sub-phoneme elements have durations of between 2 milliseconds and 20 milliseconds.
  - 20. A system as claimed in claim 13, wherein the duration of a sub-phoneme element is no less than the inverse of the instantaneous fundamental frequency.
  - 21. A system as claimed in claim 13, wherein said sequence of sub-phoneme elements is generated from an alphabet comprising between 300 and 10,000 sub-phoneme elements.
  - 22. A system as claimed in claim 13, further comprising means for generating an alphabet of sub-phoneme elements.
  - 23. A system as claimed in claim 22, wherein the means for generating said alphabet comprisesmeans for producing a set of first data elements by sampling an input acoustic waveform,means for clustering said first data elements, andmeans for generating a sub-phoneme element for each cluster of first data elements.
  - 24. A system as claimed in claim 23, wherein said means for clustering is effected using a k-means algorithm.

25. A method for synthesizing speech from text, comprising the steps of:
- converting the text into a sequence of phonemes representative of the text;
  
  generating a sequence of fenemes representative of the sequence of phonemes;
  
  transforming the sequence of fenemes into a sequence of wavelets; and
  
  concatenating the sequence of wavelets to produce an acoustic waveform representative of the text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Sharman, Richard Anthony
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/489,179
Time in Patent Office

1,593 Days
Field of Search

395/2.67, 395/2.64, 395/2.65, 395/2.63, 395/2.54, 395/2.69
US Class Current

704/260
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 25/27   characterised by the analys...

Method and system for synthesizing speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

119 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

Method and system for synthesizing speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

119 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others