Methods and apparatus for speaker specific durational adaptation

US 6,813,604 B1
Filed: 11/13/2000
Issued: 11/02/2004
Est. Priority Date: 11/18/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:

retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;

developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, the modification parameters being developed using a training corpus developed using text chosen independently of text used to develop the source model; and

applying the modification parameters to the source model coefficients to produce the target speaker model.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text to speech system modeling durational characteristics of a target speaker is addressed herein. A body of target speaker training text is selected having maximum possible information about speaker specific characteristics. The body of target speaker training text is read by a target speaker to produce a target speaker training corpus. A previously generated source model reflecting characteristics of a source model is retrieved and the target speaker training corpus is processed to produce modification parameters reflecting differences between durational characteristics of the target speaker and those predicted by the source model. The modification parameters are applied to the source model to produce a target model. Text inputs are processed using the target model to produce speech outputs reflecting durational characteristics of the target speaker.

16 Citations

23 Claims

1. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
- retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;
  
  developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, the modification parameters being developed using a training corpus developed using text chosen independently of text used to develop the source model; and
  
  applying the modification parameters to the source model coefficients to produce the target speaker model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the step of developing the modification parameters includes developing a target speaker training corpus reflecting speaker specific characteristics of the target speaker and processing the target speaker training corpus to produce the modification parameters.
  - 3. The method of claim 2 wherein developing the target speaker training corpus includes selecting a target speaker training text and receiving voice inputs produced by a reading of the target speaker training text by the target speaker.
  - 4. The method of claim 3 wherein selecting the target speaker training text includes assembling a large body of text and selecting a subset of the large body of text chosen to reflect speaker specific characteristics of the large body of text.
  - 5. The method of claim 4 wherein selecting the subset of the large body of text comprises finding an optimum full-rank matrix comprising a small set of sentences containing enough data to develop the modification parameters to be used to create the target speaker model.
  - 6. The method of claim 5 wherein the target speaker model is generated by developing and using parameters in the target speaker model equation Dur(p)_s=D_mean(p)^k*D₁(f1)^k1* . . . *Dn(fn)^kn, where Dur(p)_sis the duration of each phone as spoken by the target speaker s, D_mean(p) is the coefficient of the corrected mean duration of each phone as taken from the source model, D1(f1), . . . ,D_n(fn) are durational effects of each parameter present in the source model, k is the modification parameter of the phone, and k1, . . . ,kn are modification parameters of factors, the modification parameters reflecting durational differences between the source model and the target speaker.
  - 7. The method of claim 6 wherein the step of generating the modification parameters includes substituting the values of D_mean(p) and D1(f1), . . . ,D_n(fn) taken from the source model into the target speaker model equation for each of the phones in the target speaker training corpus, substituting the actual values of Dur(p)_sfrom the target speaker training corpus into the target speaker model equation, and solving the target speaker model equation for the values of k, k1, . . . ,kn.
  - 8. The method of claim 4 wherein selecting the subset of the large body of text includes operating on the large body of text to produce sets of feature vectors corresponding to each sentence in the large body of text, mapping the sets of feature vectors into a plurality of incidence matrices, converting the incidence matrices to design matrices based on a target speaker model equation chosen to receive parameters to create the target speaker model, and finding a matroid cover for the plurality of design matrices.
  - 9. The method of claim 8 wherein finding the matroid cover for the plurality of design matrices includes applying a greedy algorithm incorporating a modified Gram-Schmidt orthonormalization procedure.
  - 10. The method of claim 9 wherein the source model and the target model predict durational characteristics of Chinese Mandarin speech.

11. A text to speech system for receiving text input and producing speech output corresponding to the text input, the speech outputs reflecting durational characteristics of a target speaker, comprising:
- a text input interface for receiving the text input;
  
  a target speaker modeler for applying a target speaker model to the text input to define durational characteristics of the speech output, the target speaker model including source model coefficients reflecting durational characteristics of a source speaker and a set of modification parameters applied to the source model, the set of modification parameters reflecting differences in durational characteristics between the source speaker and the target speaker, the target speaker model and the source speaker model having been developed independently of the text input for which speech output is to be produced; and
  
  a speech output interface for producing the speech output.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, further comprising a target speaker trainer for processing a target speaker corpus to develop the modification parameters.
  - 13. The system of claim 12 wherein the target speaker corpus is obtained by reading of a target speaker text selection by the target speaker.

14. A communication system comprising:
- a communication channel; and
  
  a plurality of transmission stations, each of the plurality of transmission stations being operative to transmit a text message over the communication channel, each of the transmission stations being operative to create a set of modification parameters reflecting voice characteristics of a target speaker;
  
  a plurality of receiving stations, each of the receiving stations being operative to receive the message and the modification parameters, to apply the modification parameters to a target speaker model including source model coefficients, the target speaker model and the source speaker model having been developed using text chosen independently of text comprising the text message, and to use the target speaker model to process the text message to produce speech output reflecting durational characteristics of the target speaker.
- View Dependent Claims (15, 16, 17)
- - 15. The communication system of claim 14 wherein each of the transmission stations is operative to create a set of modification parameters by processing a training corpus created by reading of a target speaker training text by the target speaker.
  - 16. The communication system of claim 15 wherein each of the transmission stations is also operative to function as a receiving station and wherein each of the receiving stations is also operative to function as a transmission station.
  - 17. The communication system of claim 16 wherein the communication channel is the Internet.

18. A voice communication system comprising:
- a communication channel;
  
  a first transceiver operative to store a set of modification parameters reflecting speech characteristics of a user and to transmit the set of modification parameters upon initiation of a communication connection, the first transceiver being also operative to receive a voice input from a speaker and to process the voice input using a speech to text system to produce a text transmission, the first transceiver being operative to transmit the text transmission; and
  
  a second transceiver operative to receive the set of modification parameters from the first transceiver and to apply the modification parameters to a target speaker model including source model coefficients, the target speaker model and the source model coefficients having been developed using text chosen independently of the text comprising the text transmission, the second transceiver being also operative to receive the text transmission from the first transceiver and process the text transmission using the target speaker model to produce speech output reflecting the durational characteristics of the target speaker.
- View Dependent Claims (19, 20)
- - 19. The system of claim 18 wherein the first transceiver is also operative to receive modification parameters from the second transceiver, to create a target speaker model using the modification parameters, to receive a text transmission from the second transceiver and to process the text transmission using the target speaker model to produce speech output reflecting the durational characteristics of the target speaker, and wherein the second transceiver is also operative to store a set of modification parameters reflecting speech characteristics of a user and to transmit the set of modification parameters upon initiation of a communication connection, to receive a voice input from a speaker and to process the voice input using a speech to text system to produce a text transmission and to transmit the text transmission.
  - 20. The communication system of claim 19 wherein the communication channel is a wireless communication channel.

21. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
- (1) retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker;
  
  (2) developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, developing the modification parameters including developing a target speaker training corpus reflecting speaker specific characteristics of the target speaker and processing the target speaker training corpus to produce the modification parameters, developing the target speaker training corpus including;
  
  (a) selecting a target speaker training text and receiving voice inputs produced by a reading of the target speaker training text by the target speaker, selecting the target speaker training text including assembling a large body of text and selecting a subset of the large body of text chosen to reflect speaker specific characteristics of the large body of text, selecting the subset of the large body of text comprising finding an optimum full-rank matrix comprising a small set of sentences containing enough data to develop the modification parameters to be used to create the target speaker model;
  
  (b) receiving voice inputs produced by a reading of the target speaker training text by the target speaker; and
  
  (3) applying the modification parameters to the source model coefficients to produce the target speaker model, generating the target speaker model including developing and using parameters in the target speaker model equation Dur(p)_s=D_mean(p)^k*D₁(f1)^k1* . . . *D_n(fn)^kn, where Dur(p)_sis the duration of each phone as spoken by the target speaker s, D_mean(p) is the coefficient of the corrected mean duration of each phone as taken from the source model, D₁(f1), . . . ,D_n(fn) are durational effects of each parameter present in the source model, k is the modification parameter of the phone, and k1, . . . ,kn are modification parameters of factors, the modification parameters reflecting durational differences between the source model and the target speaker.
- View Dependent Claims (22)
- - 22. The method of claim 21 wherein the step of generating the modification parameters includes substituting the values of D_mean(p) and D₁(f1), . . . ,D_n(fn) taken from the source model into the target speaker model equation for each of the phones in the target speaker training corpus, substituting the actual values of Dur(p)_sfrom the target speaker training corpus into the target speaker model equation, and solving the target speaker model equation for the values of k, k1, . . . ,kn.

23. A method of training a target speaker model for use in text to speech processing, the target speaker model reflecting durational characteristics of a target speaker, comprising the steps of:
- (1) retrieving a previously developed source model having source model coefficients reflecting durational characteristics of a source speaker, the source model predicting durational characteristics of Chinese Mandarin speech;
  
  (2) developing modification parameters reflecting differences in durational characteristics of the source speaker and those of the target speaker, developing the modification parameters including developing a target speaker training corpus reflecting speaker specific characteristics of the target speaker and processing the target speaker training corpus to produce the modification parameters, developing the target speaker training corpus including;
  
  (a) selecting a target speaker training text and receiving voice inputs produced by a reading of the target speaker training text by the target speaker, selecting the target speaker training text including assembling a large body of text and selecting a subset of the large body of text chosen to reflect speaker specific characteristics of the large body of text, selecting the subset of the large body of text including operating on the large body of text to produce sets of feature vectors corresponding to each sentence in the large body of text, mapping the sets of feature vectors into a plurality of incidence matrices, converting the incidence matrices to design matrices based on a target speaker model equation chosen to receive parameters to create the target speaker model, and finding a matroid cover for the plurality of design matrices, finding the matroid cover for the plurality of design matrices including applying a greedy algorithm incorporating a modified Gram-Schmidt orthonormalization procedure;
  
  (b) receiving voice inputs produced by a reading of the target speaker training text by the target speaker; and
  
  (3) applying the modification parameters to the source model coefficients to produce the target speaker model, the target speaker model predicting durational characteristics of Chinese Mandarin speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
WSOU Investments, LLC (WSOU Holdings, LLC)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
van Santen, Jan Pieter Hendrik, Shih, Chi-Lin
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
HARPER, V PAUL

Application Number

US09/711,563
Time in Patent Office

1,450 Days
Field of Search

705/37, 705/14, 705/1, 704/227, 704/275, 704/268, 704/267, 704/260, 704/258, 358/1, 358/15
US Class Current

704/260
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 15/07   to the speaker

G10L 2021/0135   Voice conversion or morphing

Methods and apparatus for speaker specific durational adaptation

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for speaker specific durational adaptation

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links