Text-to-speech system with automatically trained phrasing rules
First Claim
1. A method for generating a statistical representation of intonational feature information for a text-to-speech system, the method comprising the steps of:
- (a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, the set of predetermined text being unrelated to speech, said annotating being performed by a human annotator;
(b) with a computer means, generating a set of structural information regarding the predetermined text;
(c) with the computer means, generating said statistical representations of intonational feature information based on the set of structural information and the intonational feature annotations; and
(d) storing said statistical representation for use in training a text-to-speech system.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of training a TTS or other system to assign intonational features, such as intonational phrase boundaries, is described. The method of training involves taking a set of predetermined text (not speech or a signal representative of speech) and having a human annotate it with intonational feature annotations. This results in annotated text. Next, the structure of the set of predetermined text is analyzed to generate information. This information is used, along with the intonational feature annotations, to generate a statistical representation. The statistical representation may then be stored and repeatedly used to generate synthesized speech from new sets of input text without training the TTS system further.
-
Citations
19 Claims
-
1. A method for generating a statistical representation of intonational feature information for a text-to-speech system, the method comprising the steps of:
-
(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, the set of predetermined text being unrelated to speech, said annotating being performed by a human annotator;
(b) with a computer means, generating a set of structural information regarding the predetermined text;
(c) with the computer means, generating said statistical representations of intonational feature information based on the set of structural information and the intonational feature annotations; and
(d) storing said statistical representation for use in training a text-to-speech system. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for performing text-to-speech conversion on a set of input text, the apparatus comprising:
-
(a) a stored statistical representation of intonational feature information, the stored statistical representation based on a set of predetermined text and intonational feature annotations therefor, the set of predetermined text being unrelated to speech, the intonational feature annotations having been provided by a human annotator; and
(b) a processor and a phrasing module for applying the set of input text to the stored statistical representation to generate an output representative of the set of input text, the output comprising intonational feature information associated with the set of input text. - View Dependent Claims (9, 10, 11, 12, 13)
(a) means for post-processing the output to generate a synthesized speech signal; and
(b) means for applying the synthesized speech signal to an acoustic output device.
-
-
10. The apparatus of claim 8 wherein the stored statistical representation comprises a decision tree.
-
11. The apparatus of claim 8 wherein the stored statistical representation comprises a hidden Markov model.
-
12. The apparatus of claim 8 wherein the stored statistical representation comprises a neural network.
-
13. The apparatus of claim 8 wherein the phrasing module comprises means for answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each work in the plurality of words being a part of at least one set of words, wi and wj, wherein wi and wj each comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein wi and wj represent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
- wi and wj>
, the set of stored queries comprising at least one query selected from a group consisting of;(a) is wi intonationaly prominent and if not, is wi further reduced?;
(b) is wj intonationally prominent and if not, is wj further reduced?;
(c) what is wi'"'"'s part of speech?;
(d) what is wiā
1'"'"'s part of speech?;
(e) what is wj'"'"'s part of speech?;
(f) what is wj+1'"'"'s part of speech?;
(g) how many words are in the current sentence?;
(h) how far, in real words, is wj from the beginning of the sentence?;
(i) how far, in real words, is wj from the end of the sentence?;
(j) where is the potential intonational phrase boundary site with respect to the potential noun phrase?;
(k) if <
wi and wj>
is within the potential noun phrase, how far is <
wi and wj>
from the beginning of the potential noun phrase?;
(l) how many words are in the potential noun phrase?;
(m) how far into the potential noun phrase is wi?;
(n) how many syllables precede the potential intonational phrase boundary site in the current sentence?;
(o) how many lexically stressed syllables precede the potential intonational phrase boundary site in the current sentence?;
(p) how many strong syllables are there in the current sentence?;
(q) what is a stress level of a syllable in wi immediately preceding the potential intonational boundary site?;
(r) what is a result of dividing a distance from wj to a last intonational boundary assigned by a total length of the last intonational phrase?;
(s) is there punctuation at the potential intonational phrase boundary site?; and
(t) how many primary and secondary stressed syllables exist between the potential intonational phrase boundary site and the beginning of the current sentence.
- wi and wj>
-
14. A method for performing text-to-speech conversion on a set of input text, the method comprising the steps of:
-
(a) accessing a stored statistical representation of intonational feature information, the stored statistical representation based on a set of predetermined text and intonational feature annotations therefor, the set of predetermined text being unrelated to speech, the intonational feature annotations having been provided by a human annotator; and
(b) with a processor means and a phrasing module means, applying the set of input text to the stored statistical representation to generate an output representative of the set of input text, the output comprising intonational feature information associated with the set of input text. - View Dependent Claims (15, 16, 17, 18, 19)
(a) post-processing the output to generate a synthesized speech signal; and
(b) applying the synthesized speech signal to an acoustic output device.
-
-
16. The method of claim 14 wherein the stored statistical representation comprises a decision tree.
-
17. The method of claim 14 wherein the stored statistical representation comprises a hidden Markov model.
-
18. The method of claim 14 wherein the stored statistical representation comprises a neural network.
-
19. The method of claim 14 wherein the step of applying comprises answering a set of stored queries regarding the set of input text, the set of input text comprising a current sentence, the current sentence comprising a beginning, an end, and a plurality of words, each work in the plurality of words being a part of at least one set of words, wi and wj wherein wi and wj, each comprise at least one syllable and each have a part of speech associated therewith and each have a potential noun phrase associated therewith, the potential noun phrase having a beginning and an end, and further wherein wi and wj represent real words to the left and right, respectively, of a potential intonational phrase boundary site, <
- wi and wj>
, the set of stored queries comprising at least one query selected from a group consisting of;(a) is wi intonationally prominent and if not, is wi further reduced?;
(b) is wj intonationally prominent and if not, is wj further reduced?;
(c) what is wj'"'"'s part of speech?;
(d) what is w1ā
1'"'"'s part of speech?;
(e) what is wj'"'"'s part of speech?;
(f) what is wj+1'"'"'s part of speech?;
(g) how many words are in the current sentence?;
(h) how far, in real words, is wj from the beginning of the sentence?;
(i) how far, in real words, is wj from the end of the sentence?;
(j) where is the potential intonational phrase boundary site with respect to the potential noun phrase?;
(k) if <
wi and wj>
is within the potential noun phrase, how far is <
wi and wj>
from the beginning of the potential noun phrase?;
(l) how many words are in the potential noun phrase?;
(m) how far into the potential noun phrase is wi?;
(n) how many syllables precede the potential intonational phrase boundary site in the current sentence?;
(o) how many lexically stressed syllables precede the potential intonational phrase boundary site in the current sentence?;
(p) how many strong syllables are there in the current sentence?;
(q) what is a stress level of a syllable in wi immediately preceding the potential intonational boundary site?;
(r) what is a result of dividing a distance from wj to a last intonational boundary assigned by a total length of the last intonational phrase?;
(s) is there punctuation at the potential intonational phrase boundary site?; and
(t) how many primary and secondary stressed syllables exist between the potential intonational phrase boundary site and the beginning of the current sentence.
- wi and wj>
Specification