Training of text-to-speech systems

US 20020143542A1
Filed: 03/29/2001
Published: 10/03/2002
Est. Priority Date: 03/29/2001
Status: Active Grant

First Claim

Patent Images

1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:

obtaining a set of features and a first corresponding observation value from a first training speaker;

obtaining said set of features and a second corresponding observation value from a second training speaker; and

pooling said first and second corresponding observation values to obtain the model.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Building a data-driven text-to-speech system involves collecting a database of natural speech from which to train models or select segments for concatenation. Typically the speech in that database is produced by a single speaker. In this invention we include in our database speech from a multiplicity of speakers.

Citations

18 Claims

1. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- obtaining a set of features and a first corresponding observation value from a first training speaker;
  
  obtaining said set of features and a second corresponding observation value from a second training speaker; and
  
  pooling said first and second corresponding observation values to obtain the model.

2. A method of constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- obtaining a set of features and a corresponding observation value from a first training speaker;
  
  repeating said step of obtaining a set of features and a corresponding observation value for each of a plurality of additional speakers; and
  
  pooling said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.

3. A method for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
- collecting speech data from at least two speakers;
  
  ascertaining at least one characteristic relating to the speech data of each speaker; and
  
  creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
- View Dependent Claims (4, 5, 6, 7, 8)
- - 4. The method according to claim 3, wherein said ascertaining step comprises obtaining a set of features and a corresponding observation value from each of said at least two speakers.
  - 5. The method according to claim 4, wherein said step of creating a target range comprises pooling the observation values obtained from each of said at least two speakers.
  - 6. The method according to claim 4, wherein said step of creating a target range of speech data further comprises normalizing the observation values obtained from each of said at least two speakers.
  - 7. The method according to claim 6, wherein:
    - the observation values comprise pitch values; and
      
      said normalizing step comprises calculating average pitch over a predetermined quantity of speech data and thence obtaining normalized pitch values via dividing each pitch value within the predetermined quantity of speech data by said average.
  - 8. The method according to claim 7, wherein said transforming step comprises multiplying each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.

9. An apparatus for constructing a model for use in a text-to-speech synthesis system, said apparatus comprising:
- an obtaining arrangement which obtains a set of features and a first corresponding observation value from a first training speaker;
  
  said obtaining arrangement being adapted to obtain said set of features and a second corresponding observation value from a second training speaker; and
  
  a pooling arrangement which pools said first and second corresponding observation values to obtain the model.

10. An apparatus for constructing a model for use in a text-to-speech synthesis system, said apparatus comprising:
- an obtaining arrangement which obtains a set of features and a corresponding observation value from a first training speaker;
  
  said obtaining arrangement being adapted to further obtain a set of features and a corresponding observation value for each of a plurality of additional speakers; and
  
  a pooling arrangement which pools said corresponding observation values, from said first speaker and said additional speakers, to obtain the model.

11. An apparatus for enrolling training data for a text-to-speech synthesis system, said apparatus comprising:
- a collector arrangement which collects speech data from at least two speakers;
  
  an ascertaining arrangement which ascertains at least one characteristic relating to the speech data of each speaker; and
  
  a target range creator which creates a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The apparatus according to claim 11, wherein said ascertaining arrangement is adapted to obtain a set of features and a corresponding observation value from each of said at least two speakers.
  - 13. The apparatus according to claim 12, wherein target range creator is adapted to pool the observation values obtained from each of said at least two speakers.
  - 14. The apparatus according to claim 12, wherein said target range creator comprises a normalizer which normalizes the observation values obtained from each of said at least two speakers.
  - 15. The apparatus according to claim 14, wherein:
    - the observation values comprise pitch values; and
      
      said normalizer is adapted to calculate average pitch over a predetermined quantity of speech data and thence obtain normalized pitch values via dividing each pitch value within the predetermined quantity of speech data by said average.
  - 16. The apparatus according to claim 15, wherein said target range creator is adapted to multiply each normalized pitch value by a target pitch value, the target pitch value being the average pitch of a target speaker.

17. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for constructing a model for use in a text-to-speech synthesis system, said method comprising the steps of:
- obtaining a set of features and a first corresponding observation value from a first training speaker;
  
  obtaining said set of features and a second corresponding observation value from a second training speaker; and
  
  pooling said first and second corresponding observation values to obtain the model.

18. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for enrolling training data for a text-to-speech synthesis system, said method comprising the steps of:
- collecting speech data from at least two speakers;
  
  ascertaining at least one characteristic relating to the speech data of each speaker; and
  
  creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Eide, Ellen M.

Granted Patent

US 6,535,852 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/04 Details of speech synthesis...

Training of text-to-speech systems

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Training of text-to-speech systems

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links