Methods and apparatus for speaker specific durational adaptation
First Claim
1. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
- retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;
developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, the modification parameters being developed using a training corpus developed using text chosen independently of text used to develop the source model; and
applying the modification parameters to the source model coefficients to produce the target speaker model.
9 Assignments
0 Petitions
Accused Products
Abstract
A text to speech system modeling durational characteristics of a target speaker is addressed herein. A body of target speaker training text is selected having maximum possible information about speaker specific characteristics. The body of target speaker training text is read by a target speaker to produce a target speaker training corpus. A previously generated source model reflecting characteristics of a source model is retrieved and the target speaker training corpus is processed to produce modification parameters reflecting differences between durational characteristics of the target speaker and those predicted by the source model. The modification parameters are applied to the source model to produce a target model. Text inputs are processed using the target model to produce speech outputs reflecting durational characteristics of the target speaker.
16 Citations
23 Claims
-
1. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
-
retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;
developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, the modification parameters being developed using a training corpus developed using text chosen independently of text used to develop the source model; and
applying the modification parameters to the source model coefficients to produce the target speaker model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A text to speech system for receiving text input and producing speech output corresponding to the text input, the speech outputs reflecting durational characteristics of a target speaker, comprising:
-
a text input interface for receiving the text input;
a target speaker modeler for applying a target speaker model to the text input to define durational characteristics of the speech output, the target speaker model including source model coefficients reflecting durational characteristics of a source speaker and a set of modification parameters applied to the source model, the set of modification parameters reflecting differences in durational characteristics between the source speaker and the target speaker, the target speaker model and the source speaker model having been developed independently of the text input for which speech output is to be produced; and
a speech output interface for producing the speech output. - View Dependent Claims (12, 13)
-
-
14. A communication system comprising:
-
a communication channel; and
a plurality of transmission stations, each of the plurality of transmission stations being operative to transmit a text message over the communication channel, each of the transmission stations being operative to create a set of modification parameters reflecting voice characteristics of a target speaker;
a plurality of receiving stations, each of the receiving stations being operative to receive the message and the modification parameters, to apply the modification parameters to a target speaker model including source model coefficients, the target speaker model and the source speaker model having been developed using text chosen independently of text comprising the text message, and to use the target speaker model to process the text message to produce speech output reflecting durational characteristics of the target speaker. - View Dependent Claims (15, 16, 17)
-
-
18. A voice communication system comprising:
-
a communication channel;
a first transceiver operative to store a set of modification parameters reflecting speech characteristics of a user and to transmit the set of modification parameters upon initiation of a communication connection, the first transceiver being also operative to receive a voice input from a speaker and to process the voice input using a speech to text system to produce a text transmission, the first transceiver being operative to transmit the text transmission; and
a second transceiver operative to receive the set of modification parameters from the first transceiver and to apply the modification parameters to a target speaker model including source model coefficients, the target speaker model and the source model coefficients having been developed using text chosen independently of the text comprising the text transmission, the second transceiver being also operative to receive the text transmission from the first transceiver and process the text transmission using the target speaker model to produce speech output reflecting the durational characteristics of the target speaker. - View Dependent Claims (19, 20)
-
-
21. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
-
(1) retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;
(2) developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, developing the modification parameters including developing a target speaker training corpus reflecting speaker specific characteristics of the target speaker and processing the target speaker training corpus to produce the modification parameters, developing the target speaker training corpus including;
(a) selecting a target speaker training text and receiving voice inputs produced by a reading of the target speaker training text by the target speaker, selecting the target speaker training text including assembling a large body of text and selecting a subset of the large body of text chosen to reflect speaker specific characteristics of the large body of text, selecting the subset of the large body of text comprising finding an optimum full-rank matrix comprising a small set of sentences containing enough data to develop the modification parameters to be used to create the target speaker model;
(b) receiving voice inputs produced by a reading of the target speaker training text by the target speaker; and
(3) applying the modification parameters to the source model coefficients to produce the target speaker model, generating the target speaker model including developing and using parameters in the target speaker model equation Dur(p)s=Dmean(p)k*D1(f1)k1* . . . *Dn(fn)kn, where Dur(p)s is the duration of each phone as spoken by the target speaker s, Dmean(p) is the coefficient of the corrected mean duration of each phone as taken from the source model, D1(f1), . . . ,Dn(fn) are durational effects of each parameter present in the source model, k is the modification parameter of the phone, and k1, . . . ,kn are modification parameters of factors, the modification parameters reflecting durational differences between the source model and the target speaker. - View Dependent Claims (22)
-
-
23. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
-
(1) retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker, the source model predicting durational characteristics of Chinese Mandarin speech;
(2) developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, developing the modification parameters including developing a target speaker training corpus reflecting speaker specific characteristics of the target speaker and processing the target speaker training corpus to produce the modification parameters, developing the target speaker training corpus including;
(a) selecting a target speaker training text and receiving voice inputs produced by a reading of the target speaker training text by the target speaker, selecting the target speaker training text including assembling a large body of text and selecting a subset of the large body of text chosen to reflect speaker specific characteristics of the large body of text, selecting the subset of the large body of text including operating on the large body of text to produce sets of feature vectors corresponding to each sentence in the large body of text, mapping the sets of feature vectors into a plurality of incidence matrices, converting the incidence matrices to design matrices based on a target speaker model equation chosen to receive parameters to create the target speaker model, and finding a matroid cover for the plurality of design matrices, finding the matroid cover for the plurality of design matrices including applying a greedy algorithm incorporating a modified Gram-Schmidt orthonormalization procedure;
(b) receiving voice inputs produced by a reading of the target speaker training text by the target speaker; and
(3) applying the modification parameters to the source model coefficients to produce the target speaker model, the target speaker model predicting durational characteristics of Chinese Mandarin speech.
-
Specification