Methods and apparatus for text to speech processing using language independent prosody markup
First Claim
1. A method of modeling phenomena comprising the steps of:
- analyzing one or more instances of actual phenomena to identify characteristics of the instances of the actual phenomena;
creating a set of tags defining the identified characteristics of the one or more instances of the actual phenomena each tag controlling one or more aspects of one or more molded phenomena to be produced in response to the tags, the tags controlling the aspects of the modeled phenomena so as to create characteristics in the modeled phenomena similar to those exhibited by the one or more instances of the actual phenomena;
arranging selected members of the set of tags in a desired sequence to produce phenomena as defined by the sequence of tags; and
processing the tags in order to produce phenomena having the characteristics defined by the tags.
4 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described for employing a set of tags to model phenomena which are smooth and subject to constraints. Tags may be used to model, for example, muscular movement producing speech. In one advantageous application, a set of tags defining prosodic characteristics is developed, and selected tags are placed in appropriate locations of a body of text. Each tag defines a constraint on the prosodic characteristics of speech produced by processing the text. Processing of the body of speech and the tags produces a set of equations which are solved to produce a curve defining prosodic characteristics over the scope of a phrase, and a further set of equations which are solved to produce a curve defining prosodic characteristics of individual words within a phrase. The data defined by the curves is used with the text to produce speech having the prosodic characteristics defined by the tags. A set of tags may be produced by reading of a training text by a target speaker to produce a training corpus reflecting the prosodic characteristics of the target speaker, and then analyzing the training corpus to generate tags modeling the prosodic characteristics of the training corpus.
28 Citations
30 Claims
-
1. A method of modeling phenomena comprising the steps of:
-
analyzing one or more instances of actual phenomena to identify characteristics of the instances of the actual phenomena;
creating a set of tags defining the identified characteristics of the one or more instances of the actual phenomena each tag controlling one or more aspects of one or more molded phenomena to be produced in response to the tags, the tags controlling the aspects of the modeled phenomena so as to create characteristics in the modeled phenomena similar to those exhibited by the one or more instances of the actual phenomena;
arranging selected members of the set of tags in a desired sequence to produce phenomena as defined by the sequence of tags; and
processing the tags in order to produce phenomena having the characteristics defined by the tags. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method of processing a body of text including tags defining prosodic characteristics of speech to be produced by processing the texts comprising the steps of:
-
extracting the tags from the text;
creating a set of equations defining a phrase curve;
solving the set of equations to produce the phrase curve;
creating a set of equations defining a pitch curve;
solving the set of equations to produce the pitch curve;
mapping linguistic concepts represented by the phrase curve and the pitch curve to acoustical observables; and
performing a nonlinear transformation to adjust the prosodic characteristics defined by tags to human perceptions and expectations.
-
-
27. A method of defining a set of tags specifying prosodic characteristics of speech of a target speaker, comprising the steps of:
-
selecting a body of training text;
receiving speech representing reading of the training text by the target speaker to form a training corpus, the training corpus representing actual sounds produced by the reading of the training text by the target speaker and exhibiting prosodic characteristics of actual speech of the target speaker;
analyzing the training corpus to identify prosodic characteristics of the training corpus; and
creating a set of tags defining the identified prosodic characteristics of the training corpus.
-
-
28. A method of placing tags in text for text to speech processing comprising the steps of:
-
placing tags in a body of training text to model prosodic characteristics of a training corpus produced by reading of the training text;
analyzing the placement of the tags in the training text to develop a set of rules for placement of tags in text; and
applying the rules to text for which text to speech processing is desired to place tags in the text in order to produce speech having desired prosodic characteristics.
-
-
29. A text to speech system for receiving text inputs comprising text to be processed to generate speech and tags defining prosodic characteristics of the speech to be generated, comprising:
-
a prosody tag generation component to analyze a training corpus to identify characteristics exhibited by one or more readings of text by one or more target speakers and to generate a set of tags defining the identified characteristics;
a text input interface for receiving the text input;
a speech modeler operative to process the text inputs to produce speech having the prosodic characteristics specified by the tags, such that the speech produced by the speech modeler is similar to that of the one or more target speakers; and
a speech output interface for producing the speech output. - View Dependent Claims (30)
-
Specification