Training of text-to-speech systems

US 6,535,852 B2
Filed: 03/29/2001
Issued: 03/18/2003
Est. Priority Date: 03/29/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:

providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;

providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;

obtaining a first set of features and a first corresponding observation value from the first input of speech;

said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;

obtaining a second set of features and a second corresponding observation value from the second input of speech;

said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and

pooling said first and second corresponding observation values to obtain the model.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Building a data-driven text-to-speech system involves collecting a database of natural speech from which to train models or select segments for concatenation. Typically the speech in that database is produced by a single speaker. In this invention we include in our database speech from a multiplicity of speakers.

179 Citations

18 Claims

1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
  
  providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;
  
  obtaining a first set of features and a first corresponding observation value from the first input of speech;
  
  said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
  
  obtaining a second set of features and a second corresponding observation value from the second input of speech;
  
  said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and
  
  pooling said first and second corresponding observation values to obtain the model.

2. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
  
  providing additional inputs of speech from a plurality of additional training speakers, the additional inputs of speech each including at least one sentence;
  
  obtaining a set of features and a corresponding observation value from the first input of speech;
  
  said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
  
  repeating said step of obtaining a set of features and a corresponding observation value, including tracking pitch over each sentence, for each of the plurality of additional inputs of speech;
  
  pooling said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.

3. A method for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
- collecting speech data from at least two speakers, the speech data from each speaker including at least one sentence;
  
  ascertaining at least one characteristic relating to the speech data of each speaker;
  
  said ascertaining step comprising tracking pitch over each sentence; and
  
  creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
- View Dependent Claims (4, 5, 6, 7, 8)
- - 4. The method according to claim 3, wherein said ascertaining step comprises obtaining a set of features and a corresponding observation value from each of said at least two speakers.
  - 5. The method according to claim 4, wherein said step of creating a target range comprises pooling the observation values obtained from each of said at least two speakers.
  - 6. The method according to claim 4, wherein said step of creating a target range of speech data further comprises normalizing the observation values obtained from each of said at least two speakers.
  - 7. The method according to claim 6, wherein:
8. The method according to claim 7, wherein said transforming step comprises multiplying each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.

9. An apparatus for constructing a model for use in a text-to speech synthesis system, said apparatus comprising:
- an input arrangement which provides;
  
  a first input of speech from a first training speaker, the first input of speech including at least one sentence; and
  
  a second input of speech from a second training speaker, the second input of speech including at least one sentence;
  
  an extracting arrangement which obtains a first set of features and a first corresponding observation value from the first input of speech;
  
  said extracting arrangement being adapted to further obtain a second set of features and a second corresponding observation value from the input of speech;
  
  said extracting arrangement being adapted to track pitch over each sentence; and
  
  a pooling arrangement which pools said first and second corresponding observation values to obtain the model.

10. An apparatus for constructing a model for use in a text-to-speech synthesis system, said apparatus comprising:
- an input arrangement which provides;
  
  a first input of speech from a first training speaker, the first input of speech including at least one sentence; and
  
  additional inputs of speech from a plurality of additional training speakers, the additional inputs of speech each including at least one sentence;
  
  an extracting arrangement which obtains a set of features and a corresponding observation value from the first input of speech;
  
  said extracting arrangement being adapted to further obtain a set of features and a corresponding observation value for each of the plurality of additional inputs of;
  
  said extracting arrangement being adapted to track pitch over each sentence; and
  
  a pooling arrangement which pools said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.

11. An apparatus for enrolling training data for a text-to-speech synthesis system, said apparatus comprising:
- an input arrangement which collects speech data from at least two speakers, the speech data from each speaker including at least one sentence;
  
  an ascertaining arrangement which ascertains at least one characteristic relating to the speech data of each speaker;
  
  said ascertaining arrangement being adapted to track pitch over each sentence; and
  
  a target range creator which creates a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The apparatus according to claim 11, wherein said ascertaining arrangement is adapted to obtain a set of features and a corresponding observation value from each of said at least two speakers.
  - 13. The apparatus according to claim 12, wherein target range creator is adapted to pool the observation values obtained from each of said at least two speakers.
  - 14. The apparatus according to claim 12, wherein said target range creator comprises a normalizer which normalizes the observation values obtained from each of said at least two speakers.
  - 15. The apparatus according to claim 14, wherein:
16. The apparatus according to claim 15, wherein said target range creator is adapted to multiply each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.

17. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- providing a first input of speech from a first training speaker, the first input of speech including at least one sentence;
  
  providing a second input of speech from a second training speaker, the second input of speech including at least one sentence;
  
  obtaining a first set of features and a first corresponding observation value from the first input of speech;
  
  said step of obtaining a first set of features and a first corresponding observation value including tracking pitch over each sentence;
  
  obtaining a second set of features and a second corresponding observation value from the second input of speech;
  
  said step of obtaining a second set of features and a second corresponding observation value including tracking pitch over each sentence; and
  
  pooling said first and second corresponding observation values to obtain the model.

18. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
- collecting speech data from at least two speakers, the speech data from each speaker including at least one sentence;
  
  ascertaining at least one characteristic relating to the speech data of each speaker;
  
  said ascertaining step comprising tracking pitch over each sentence; and
  
  creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Eide, Ellen M.
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US09/821,399
Publication Number

US 20020143542A1
Time in Patent Office

719 Days
Field of Search

704/258, 704/260, 704/266, 704/267, 704/268, 704/270, 704/275
US Class Current

704/260
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/04 Details of speech synthesis...

Training of text-to-speech systems

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

179 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Training of text-to-speech systems

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

179 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links