Text-to-speech system with automatically trained phrasing rules

US 6,173,262 B1
Filed: 11/02/1995
Issued: 01/09/2001
Est. Priority Date: 10/15/1993
Status: Expired due to Term

First Claim

Patent Images

1. A method for generating a statistical representation of intonational feature information for a text-to-speech system, the method comprising the steps of:

(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, the set of predetermined text being unrelated to speech, said annotating being performed by a human annotator;

(b) with a computer means, generating a set of structural information regarding the predetermined text;

(c) with the computer means, generating said statistical representations of intonational feature information based on the set of structural information and the intonational feature annotations; and

(d) storing said statistical representation for use in training a text-to-speech system.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training a TTS or other system to assign intonational features, such as intonational phrase boundaries, is described. The method of training involves taking a set of predetermined text (not speech or a signal representative of speech) and having a human annotate it with intonational feature annotations. This results in annotated text. Next, the structure of the set of predetermined text is analyzed to generate information. This information is used, along with the intonational feature annotations, to generate a statistical representation. The statistical representation may then be stored and repeatedly used to generate synthesized speech from new sets of input text without training the TTS system further.

Citations

19 Claims

1. A method for generating a statistical representation of intonational feature information for a text-to-speech system, the method comprising the steps of:
- (a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, the set of predetermined text being unrelated to speech, said annotating being performed by a human annotator;
  
  (b) with a computer means, generating a set of structural information regarding the predetermined text;
  
  (c) with the computer means, generating said statistical representations of intonational feature information based on the set of structural information and the intonational feature annotations; and
  
  (d) storing said statistical representation for use in training a text-to-speech system.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the step of annotating comprises prosodically annotating the set of predetermined text with expected intonational features.
  - 3. The method of claim 1 wherein the intonational feature annotations comprise intonational phrase boundaries.
  - 4. The method of claim 1 wherein generating a statistical representation comprises generating a set of decision nodes.
  - 5. The method of claim 4 wherein generating the set of decision nodes comprises generating a hidden Markov model.
  - 6. The method of claim 4 wherein generating the set of decision nodes comprises generating a neural network.
  - 7. The method of claim 4 wherein generating the set of decision nodes comprises performing classification and regression tree techniques.

8. An apparatus for performing text-to-speech conversion on a set of input text, the apparatus comprising:
- (a) a stored statistical representation of intonational feature information, the stored statistical representation based on a set of predetermined text and intonational feature annotations therefor, the set of predetermined text being unrelated to speech, the intonational feature annotations having been provided by a human annotator; and
  
  (b) a processor and a phrasing module for applying the set of input text to the stored statistical representation to generate an output representative of the set of input text, the output comprising intonational feature information associated with the set of input text.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The apparatus of claim 8 further comprising:
10. The apparatus of claim 8 wherein the stored statistical representation comprises a decision tree.
11. The apparatus of claim 8 wherein the stored statistical representation comprises a hidden Markov model.
12. The apparatus of claim 8 wherein the stored statistical representation comprises a neural network.
13. The apparatus of claim 8 wherein the phrasing module comprises means for answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each work in the plurality of words being a part of at least one set of words, w_iand w_j, wherein w_iand w_jeach comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein w_iand w_jrepresent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
- w_iand w_j>
  
  , the set of stored queries comprising at least one query selected from a group consisting of;
  
  (a) is w_iintonationaly prominent and if not, is w_ifurther reduced?;
  
  (b) is w_jintonationally prominent and if not, is w_jfurther reduced?;
  
  (c) what is w_i'"'"'s part of speech?;
  
  (d) what is w_i−
  
  1'"'"'s part of speech?;
  
  (e) what is w_j'"'"'s part of speech?;
  
  (f) what is w_j+1'"'"'s part of speech?;
  
  (g) how many words are in the current sentence?;
  
  (h) how far, in real words, is w_jfrom the beginning of the sentence?;
  
  (i) how far, in real words, is w_jfrom the end of the sentence?;
  
  (j) where is the potential intonational phrase boundary site with respect to the potential noun phrase?;
  
  (k) if <
  
  w_iand w_j>
  
  is within the potential noun phrase, how far is <
  
  w_iand w_j>
  
  from the beginning of the potential noun phrase?;
  
  (l) how many words are in the potential noun phrase?;
  
  (m) how far into the potential noun phrase is w_i?;
  
  (n) how many syllables precede the potential intonational phrase boundary site in the current sentence?;
  
  (o) how many lexically stressed syllables precede the potential intonational phrase boundary site in the current sentence?;
  
  (p) how many strong syllables are there in the current sentence?;
  
  (q) what is a stress level of a syllable in w_iimmediately preceding the potential intonational boundary site?;
  
  (r) what is a result of dividing a distance from w_jto a last intonational boundary assigned by a total length of the last intonational phrase?;
  
  (s) is there punctuation at the potential intonational phrase boundary site?; and
  
  (t) how many primary and secondary stressed syllables exist between the potential intonational phrase boundary site and the beginning of the current sentence.

14. A method for performing text-to-speech conversion on a set of input text, the method comprising the steps of:
- (a) accessing a stored statistical representation of intonational feature information, the stored statistical representation based on a set of predetermined text and intonational feature annotations therefor, the set of predetermined text being unrelated to speech, the intonational feature annotations having been provided by a human annotator; and
  
  (b) with a processor means and a phrasing module means, applying the set of input text to the stored statistical representation to generate an output representative of the set of input text, the output comprising intonational feature information associated with the set of input text.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14 further comprising the steps of:
16. The method of claim 14 wherein the stored statistical representation comprises a decision tree.
17. The method of claim 14 wherein the stored statistical representation comprises a hidden Markov model.
18. The method of claim 14 wherein the stored statistical representation comprises a neural network.
19. The method of claim 14 wherein the step of applying comprises answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each work in the plurality of words being a part of at least one set of words, w_iand w_jwherein w_iand w_j, each comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein w_iand w_jrepresent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
- w_iand w_j>
  
  , the set of stored queries comprising at least one query selected from a group consisting of;
  
  (a) is w_iintonationally prominent and if not, is w_ifurther reduced?;
  
  (b) is w_jintonationally prominent and if not, is w_jfurther reduced?;
  
  (c) what is w_j'"'"'s part of speech?;
  
  (d) what is w_1−
  
  1'"'"'s part of speech?;
  
  (e) what is w_j'"'"'s part of speech?;
  
  (f) what is w_j+1'"'"'s part of speech?;
  
  (g) how many words are in the current sentence?;
  
  (h) how far, in real words, is w_jfrom the beginning of the sentence?;
  
  (i) how far, in real words, is w_jfrom the end of the sentence?;
  
  (j) where is the potential intonational phrase boundary site with respect to the potential noun phrase?;
  
  (k) if <
  
  w_iand w_j>
  
  is within the potential noun phrase, how far is <
  
  w_iand w_j>
  
  from the beginning of the potential noun phrase?;
  
  (l) how many words are in the potential noun phrase?;
  
  (m) how far into the potential noun phrase is w_i?;
  
  (n) how many syllables precede the potential intonational phrase boundary site in the current sentence?;
  
  (o) how many lexically stressed syllables precede the potential intonational phrase boundary site in the current sentence?;
  
  (p) how many strong syllables are there in the current sentence?;
  
  (q) what is a stress level of a syllable in w_iimmediately preceding the potential intonational boundary site?;
  
  (r) what is a result of dividing a distance from w_jto a last intonational boundary assigned by a total length of the last intonational phrase?;
  
  (s) is there punctuation at the potential intonational phrase boundary site?; and
  
  (t) how many primary and secondary stressed syllables exist between the potential intonational phrase boundary site and the beginning of the current sentence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Hirschberg, Julia
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/548,794
Time in Patent Office

1,895 Days
Field of Search

395/2, 395/2.65, 395/2.68, 395/2.69, 704/260, 704/256, 704/258, 704/259, 704/270, 704/272
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

Text-to-speech system with automatically trained phrasing rules

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech system with automatically trained phrasing rules

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links