Text-to-speech system and a method and apparatus for training the same based upon intonational feature annotations of input text
First Claim
1. A machine implemented method of training a system for converting between text and speech, said method comprising the steps of(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, said set of predetermined text and said annotated text having a physically tangible readable form;
- (b) generating a set of structural information regarding said set of predetermined text;
(c) generating a statistical representation of intonational feature information, the statistical representation being a function of said set of structural information and said intonational feature annotations; and
(d) storing said statistical representation in said system for use by said system in converting between speech and text.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of training a TTS or other system to assign intonational features, such as intonational phrase boundaries, to input text that overcome the shortcomings of the known methods is described. The method of training involves taking a set of predetermined text (not speech or a signal representative of speech) and having a human annotate it with intonational feature annotations. This results in annotated text. Next, the structure of the set of predetermined text is analyzed to generate information. This information is used, along with the intonational feature annotations, to generate a statistical representation. The statistical representation may then be stored and repeatedly used to generate synthesized speech from new sets of input text without training the TTS system further. The resulting trained system and use thereof are also part of the invention.
-
Citations
30 Claims
-
1. A machine implemented method of training a system for converting between text and speech, said method comprising the steps of
(a) annotating a set of predetermined text with intonational feature annotations to generate annotated text, said set of predetermined text and said annotated text having a physically tangible readable form; -
(b) generating a set of structural information regarding said set of predetermined text; (c) generating a statistical representation of intonational feature information, the statistical representation being a function of said set of structural information and said intonational feature annotations; and (d) storing said statistical representation in said system for use by said system in converting between speech and text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus for converting text to speech, said apparatus comprising:
-
(a) an input for receiving a set of input text having a physically tangible readable form; and (b) a phrasing module adapted to receive the set of input text from said input, said phrasing module including a stored statistical representation, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying the set of input text to the stored statistical representation to generate an output representative of the set of input text. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A machine implemented method of converting text to speech said method comprising:
-
(a) accessing a stored statistical representation from a phrasing module, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor; and (b) applying a set of input text having a physically tangible readable form to the stored statistical representation to generate an output representative of the set of input text. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
-
24. A machine implemented method of training a text-to-speech system, said method comprising the steps of:
-
generating a statistical representation, said statistical representation being a function of a set of structural information of a set of text and a set of intonational feature annotations of an annotated version of said set of text; and storing said statistical representation on a text-to-speech system for use ill generating an intonational phrased output for future text input into the system.
-
-
25. An apparatus for training a text-to-speech system, said apparatus comprising:
-
an input for receiving a set of text and an annotated version of the set of text; and a phrasing module adapted to receive the set of text and the annotated version of the set of text from said input, said phrasing module generating a statistical representation, said statistical representation being a function of a set of structural information of the set of text and a set of intonational feature annotations of the annotated version of the set of text, said phrasing module storing said statistical representation for use in generating an intonational phrased output for future text input into the system.
-
-
26. An apparatus comprising:
-
a processor for generating structural information based on a set of text; and a phrasing module for generating a statistical representation based on said structural information and on a set of intonational feature annotations of an annotated version of said set of text, said phrasing module being operable to apply an input text to said statistical representation to generate a synthesized speech signal.
-
-
27. A method comprising the steps of:
-
generating structural information based on a set of text; generating a statistical representation based on said structural information and on a set of intonational feature annotations of an annotated version of said set of text, and applying said statistical representation to a set of input text to generate a synthesized speech signal.
-
-
28. A machine implemented method of converting text to speech, said method comprising:
-
(a) accessing a stored statistical representation from a phrasing module, the stored statistical representation being a function of a set of predetermined text and intonational feature annotations therefor; (b) applying a set of input text having a physically tangible readable form to the stored statistical representation to generate an output representative of the set of input text; and (c) post-processing the output to generate a synthesized speech signal.
-
-
29. An apparatus for performing text-to-speech conversion on a set of input text, said apparatus comprising:
-
a first processor, said first processor preprocessing a set of input text having a physically tangible readable form; a phrasing module connected to said first processor, said phrasing module having said pre-processed input text as an input, said phrasing module including a stored statistical representation which is a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying the set of pre-processed input text to the stored statistical representation to generate an output representative of the set of input text; and a second processor connected to said phrasing module, said second processor post-processing the output to generate a synthesized speech signal.
-
-
30. An apparatus for converting text to speech, said apparatus comprising:
-
an input for receiving a pre-processed set of input text; and a phrasing module receiving said set of preprocessed input text from said input, said phrasing module including a stored statistical representation which is a function of a set of predetermined text and intonational feature annotations therefor, said phrasing module applying said set of pre-processed input text to the stored statistical representation to generate an output representative of the set of input text.
-
Specification