Text-to-speech system and a method and apparatus for training the same based upon intonational feature annotations of input text

US 6,003,005 A
Filed: 11/25/1997
Issued: 12/14/1999
Est. Priority Date: 10/15/1993
Status: Expired due to Term

First Claim

Patent Images

1. A machine implemented method of training a system for converting between text and speech, said method comprising the steps of(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, said set of predetermined text and said annotated text having a physically tangible readable form;

(b) generating a set of structural information regarding said set of predetermined text;

(c) generating a statistical representation of intonational feature information, the statistical representation being a function of said set of structural information and said intonational feature annotations; and

(d) storing said statistical representation in said system for use by said system in converting between speech and text.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training a TTS or other system to assign intonational features, such as intonational phrase boundaries, to input text that overcome the shortcomings of the known methods is described. The method of training involves taking a set of predetermined text (not speech or a signal representative of speech) and having a human annotate it with intonational feature annotations. This results in annotated text. Next, the structure of the set of predetermined text is analyzed to generate information. This information is used, along with the intonational feature annotations, to generate a statistical representation. The statistical representation may then be stored and repeatedly used to generate synthesized speech from new sets of input text without training the TTS system further. The resulting trained system and use thereof are also part of the invention.

Citations

30 Claims

1. A machine implemented method of training a system for converting between text and speech, said method comprising the steps of(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, said set of predetermined text and said annotated text having a physically tangible readable form;
- (b) generating a set of structural information regarding said set of predetermined text;
  
  (c) generating a statistical representation of intonational feature information, the statistical representation being a function of said set of structural information and said intonational feature annotations; and
  
  (d) storing said statistical representation in said system for use by said system in converting between speech and text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the step of annotating comprises prosodically annotating the set of predetermined text with expected intonational features.
  - 3. The method of claim 1 wherein the system is a text-to-speech system.
  - 4. The method of claim 3 wherein the intonational feature annotations comprise intonational phrase boundaries.
  - 5. The method of claim 1 wherein generating a statistical representation comprises generating a set of decision nodes.
  - 6. The method of claim 5 wherein generating the set of decision nodes comprises generating a hidden Markov model.
  - 7. The method of claim 5 wherein generating the set of decision nodes comprises generating a neural network.
  - 8. The method of claim 5 wherein generating the set of decision nodes comprises performing classification and regression tree techniques.
  - 9. The method of claim 1 wherein the steps (a) to (c) are performed on a computer.
  - 10. The method of claim 1 wherein the step of generating a statistical representation of intonational feature information is performed on a phrasing module.

11. An apparatus for converting text to speech, said apparatus comprising:
- (a) an input for receiving a set of input text having a physically tangible readable form; and
  
  (b) a phrasing module adapted to receive the set of input text from said input, said phrasing module including a stored statistical representation, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying the set of input text to the stored statistical representation to generate an output representative of the set of input text.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The apparatus of claim 11 further comprising:
    - (a) a post processor for processing the output of said phrasing module to generate a synthesized speech signal; and
      
      (b) means for applying the synthesized speech signal to an acoustic output device.
  - 13. The apparatus of claim 11 wherein the stored statistical representation comprises a decision tree.
  - 14. The apparatus of claim 11 wherein the stored statistical representation comprises a hidden Markov model.
  - 15. The apparatus of claim 11 wherein the stored statistical representation comprises a neural network.
  - 16. The apparatus of claim 11 wherein said phrasing module comprises a generator, said generator answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each word in the plurality of words being a part of at least one set of words, w_i and w_j, wherein w_i and w_j each comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein w_i and w_j represent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
    - w_i and w_j >
      
      , and wherein w_i-1 and w_j+1 represent real words to the left and right, respectively of w_i and w_j, the set of stored queries comprising at least one query selected from the group consisting of(a) is w_i intonationally prominent and if not, is it further reduced?;
      
      (b) is w_j intonationally prominent and if not, is it further reduced?;
      
      (c) what is the part of speech of w_i ?;
      
      (d) what is the part of speech of w_i-1,?;
      
      (e) what is the part of speech of w_j ?;
      
      (f) what is the part of speech of w_j+1 ?;
      
      (g) how many words are in the current sentence?;
      
      (h) what is the distance, in real words, from w_j to the beginning of the sentence?;
      
      (i) what is the distance, in real words, from w_j to the end of the sentence?;
      
      (j) what is the location of the potential intonational boundary site with respect to the nearest noun phrase?;
      
      (k) if the potential intonational boundary site is within a noun phrase, how far is it from the beginning of the noun phrase?;
      
      (l) what is the size, in real words, of the current noun phrase?;
      
      (m) how far into the noun phrase is w_i ?;
      
      (n) how many syllables precede the potential intonational boundary site in the current sentence?;
      
      (o) how many lexically stressed syllables precede the potential intonational boundary site in the current sentence?;
      
      (p) what is the total number of strong syllables in the current sentence?;
      
      (q) what is the stress level of the syllable immediately preceding the potential intonational boundary site?;
      
      (r) what is the result when one divides the distance from w_j to the last intonational boundary assigned by the total length of the last intonational phrase?;
      
      (s) is there punctuation at the potential intonational boundary site?; and
      
      (t) how many primary or secondary stressed syllables exist between the potential intonational boundary site and the beginning of the current sentence.

17. A machine implemented method of converting text to speech said method comprising:
- (a) accessing a stored statistical representation from a phrasing module, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor; and
  
  (b) applying a set of input text having a physically tangible readable form to the stored statistical representation to generate an output representative of the set of input text.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The method of claim 17 further comprising:
    - (a) post-processing the output to generate a synthesized speech signal; and
      
      (b) applying the synthesized speech signal to an acoustic output device.
  - 19. The method of claim 17 wherein the stored statistical representation comprises a decision tree.
  - 20. The method of claim 17 wherein the stored statistical representation comprises a hidden Markov model.
  - 21. The method of claim 17 wherein the stored statistical representation comprises a neural network.
  - 22. The method of claim 17 wherein the step of applying comprises answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each word in the plurality of words being a part of at least one set of words, w_i and w_j, wherein w_i and w_j each comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein w_i and w_j represent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
    - w_i and w_j >
      
      , and wherein w_i-1 and w_j-1 represent real words to the left and right, respectively of w_i and w, the set of stored queries comprising at least one query selected from the group consisting of;
      
      (a) is w_i intonationally prominent and if not, is it further reduced?;
      
      (b) is w_j intonationally prominent and if not, is it further reduced?;
      
      (c) what is the part of speech of w_i ?;
      
      (d) what is the part of speech of w_i-1 ?;
      
      (e) what is the part of speech of w_j ?;
      
      (f) what is the part of speech of w_j+1 ?;
      
      (g) how many words are in the current sentence?;
      
      (h) what is the distance, in real words, from w_j to the beginning of the sentence?;
      
      (i) what is the distance, in real words, from w_j to the end of the sentence?;
      
      (j) what is the location of the potential intonational boundary site with respect to the nearest noun phrase?;
      
      (k) if the potential intonational boundary site is within a noun phrase, how far is it from the beginning of the noun phrase?;
      
      (l) what is the size, in real words, of the current noun phrase?;
      
      (m) how far into the noun phrase is w_i ?;
      
      (n) how many syllables precede the potential intonational boundary site in the current sentence?;
      
      (o) how many lexically stressed syllables precede the potential intonational boundary site in the current sentence?;
      
      (p) what is the total number of strong syllables in the current sentence?;
      
      (q) what is the stress level of the syllable immediately preceding the potential intonational boundary site?;
      
      (r) what is the result when one divides the distance from w_j to the last intonational boundary assigned by the total length of the last intonational phrase?;
      
      (s) is there punctuation at the potential intonational boundary site?; and
      
      (t) how many primary or secondary stressed syllables exist between the potential intonational boundary site and the beginning of the current sentence.
  - 23. The method of claim 17 wherein said output is stored on a computer.

24. A machine implemented method of training a text-to-speech system, said method comprising the steps of:
- generating a statistical representation, said statistical representation being a function of a set of structural information of a set of text and a set of intonational feature annotations of an annotated version of said set of text; and
  
  storing said statistical representation on a text-to-speech system for use ill generating an intonational phrased output for future text input into the system.

25. An apparatus for training a text-to-speech system, said apparatus comprising:
- an input for receiving a set of text and an annotated version of the set of text; and
  
  a phrasing module adapted to receive the set of text and the annotated version of the set of text from said input, said phrasing module generating a statistical representation, said statistical representation being a function of a set of structural information of the set of text and a set of intonational feature annotations of the annotated version of the set of text, said phrasing module storing said statistical representation for use in generating an intonational phrased output for future text input into the system.

26. An apparatus comprising:
- a processor for generating structural information based on a set of text; and
  
  a phrasing module for generating a statistical representation based on said structural information and on a set of intonational feature annotations of an annotated version of said set of text, said phrasing module being operable to apply an input text to said statistical representation to generate a synthesized speech signal.

27. A method comprising the steps of:
- generating structural information based on a set of text;
  
  generating a statistical representation based on said structural information and on a set of intonational feature annotations of an annotated version of said set of text, andapplying said statistical representation to a set of input text to generate a synthesized speech signal.

28. A machine implemented method of converting text to speech, said method comprising:
- (a) accessing a stored statistical representation from a phrasing module, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor;
  
  (b) applying a set of input text having a physically tangible readable form to the stored statistical representation to generate an output representative of the set of input text; and
  
  (c) post-processing the output to generate a synthesized speech signal.

29. An apparatus for performing text-to-speech conversion on a set of input text, said apparatus comprising:
- a first processor, said first processor preprocessing a set of input text having a physically tangible readable form;
  
  a phrasing module connected to said first processor, said phrasing module having said pre-processed input text as an input, said phrasing module including a stored statistical representation which is a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying the set of pre-processed input text to the stored statistical representation to generate an output representative of the set of input text; and
  
  a second processor connected to said phrasing module, said second processor post-processing the output to generate a synthesized speech signal.

30. An apparatus for converting text to speech, said apparatus comprising:
- an input for receiving a pre-processed set of input text; and
  
  a phrasing module receiving said set of preprocessed input text from said input, said phrasing module including a stored statistical representation which is a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying said set of pre-processed input text to the stored statistical representation to generate an output representative of the set of input text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Hirschberg, Julia
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/978,359
Time in Patent Office

749 Days
Field of Search

704/200, 704/260, 704/259, 704/256, 704/258, 704/270, 704/272
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

Text-to-speech system and a method and apparatus for training the same based upon intonational feature annotations of input text

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech system and a method and apparatus for training the same based upon intonational feature annotations of input text

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links