Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis

US 7,869,999 B2
Filed: 08/10/2005
Issued: 01/11/2011
Est. Priority Date: 08/11/2004
Status: Active Grant

First Claim

Patent Images

1. At least one computer readable storage device storing instructions that, when executed on at least one processor, perform a method of selecting a preferred phonetic transcription for use in text-to-speech synthesizing an input text, the method comprising:

generating a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions;

computing at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from a plurality of stored speech segments associated with the respective one of the plurality of phonetic transcriptions; and

selecting the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for generating synthetic speech, which operates in a computer implemented Text-To-Speech system. The system comprises at least a speaker database that has been previously created from user recordings, a Front-End system to receive an input text and a Text-To-Speech engine. The Front-End system generates multiple phonetic transcriptions for each word of the input text, and the TTS engine uses a cost function to select which phonetic transcription is the more appropriate for searching the speech segments within the speaker database to be concatenated and synthesized.

329 Citations

19 Claims

1. At least one computer readable storage device storing instructions that, when executed on at least one processor, perform a method of selecting a preferred phonetic transcription for use in text-to-speech synthesizing an input text, the method comprising:
- generating a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions;
  
  computing at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from a plurality of stored speech segments associated with the respective one of the plurality of phonetic transcriptions; and
  
  selecting the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The at least one computer readable storage device of claim 1, wherein selecting the preferred phonetic transcription includes selecting a phonetic transcription having a lowest concatenative cost score from the plurality of concatenative cost scores.
  - 3. The at least one computer readable storage device of claim 1, wherein the method further comprises:
    - selecting from the plurality of stored speech segments a sequence of speech segments associated with the preferred phonetic transcription; and
      
      concatenating the selected sequence of speech segments to text-to-speech synthesize the at least one word.
  - 4. The at least one computer readable storage device of claim 3, wherein the sequence of speech segments is selected based at least in part on the at least one concatenative cost score associated with the preferred phonetic transcription.
  - 5. The at least one computer readable storage device of claim 3, wherein the at least one concatenative cost score associated with the preferred phonetic transcription comprises a first set of one or more concatenative cost scores for the preferred phonetic transcription, and wherein selecting the sequence of speech segments comprises:
    - computing a second set of one or more concatenative cost scores for the preferred phonetic transcription; and
      
      selecting the sequence of speech segments based at least in part on the second set of one or more concatenative cost scores.
  - 6. The at least one computer readable storage device of claim 5, wherein the first set of one or more concatenative cost scores is computed using a first concatenative cost function that favors at least one phonetic criterion, and the second set of one or more concatenative cost scores is computed using a second concatenative cost function that does not favor the at least one phonetic criterion.
  - 7. The at least one computer readable storage device of claim 1, wherein the plurality of concatenative cost scores are computed using a concatenative cost function that favors at least one phonetic criterion.
  - 8. The at least one computer readable storage device of claim 7, wherein the concatenative cost function comprises at least one prosody criterion.
  - 9. The at least one computer readable storage device of claim 8, wherein the concatenative cost function comprises at least one pitch criterion, at least one duration criterion and/or at least one energy criterion.

10. A system for selecting a preferred phonetic transcription for use in synthesizing speech from an input text, the system comprising:
- at least one storage medium storing a plurality of speech segments that may be concatenated to synthesize speech;
  
  at least one input to receive the input text; and
  
  at least one computer coupled to the at least one input and capable of accessing the at least one storage medium, the at least one computer programmed to;
  
  generate a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions;
  
  compute at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from the stored plurality of speech segments associated with the respective one of the plurality of phonetic transcriptions; and
  
  select the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The system of claim 10, wherein the at least one computer is programmed to select as the preferred phonetic transcription a phonetic transcription having a lowest concatenative cost score from the plurality of concatenative cost scores.
  - 12. The system of claim 10, wherein the at least one computer is further programmed to:
    - select from the plurality of speech segments a sequence of speech segments associated with the preferred phonetic transcription; and
      
      concatenate the selected sequence of speech segments to text-to-speech synthesize the at least one word.
  - 13. The system of claim 12, wherein the at least one computer is programmed to select the sequence of speech segments based at least in part on the at least one concatenative cost score associated with the preferred phonetic transcription.
  - 14. The system of claim 12, wherein the at least one concatenative cost score associated with the preferred phonetic transcription comprises a first set of one or more concatenative cost scores for the preferred phonetic transcription, and wherein the at least one computer is programmed to select the sequence of speech segments by:
    - computing a second set of one or more concatenative cost scores for the preferred phonetic transcription; and
      
      selecting the sequence of speech segments based at least in part on the second set of one or more concatenative cost scores.
  - 15. The system of claim 14, wherein the at least one computer is programmed to compute the first set of one or more concatenative cost scores using a first concatenative cost function that favors at least one phonetic criterion, and to compute the second set of one or more concatenative cost scores using a second concatenative cost function that does not favor the at least one phonetic criterion.
  - 16. The system of claim 10, wherein the at least one computer is programmed to compute the plurality of concatenative cost scores using a concatenative cost function that favors at least one phonetic criterion.
  - 17. The system of claim 16, wherein the concatenative cost function comprises at least one prosody criterion.
  - 18. The system of claim 17, wherein the concatenative cost function comprises at least one pitch criterion, at least one duration criterion and/or at least one energy criterion.
  - 19. The system of claim 10, wherein the at least one storage medium includes a speaker database storing speech segments previously recorded from a speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Crepy, Hubert, Amato, Christel, Revelin, Stephane, Waast-Richard, Claire
Primary Examiner(s)
Smits; Talivaldis I
Assistant Examiner(s)
BORSETTI, GREG

Application Number

US11/200,808
Publication Number

US 20060041429A1
Time in Patent Office

1,980 Days
Field of Search

704/260, 704/258, 704/E13.002, 704/E13.012
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

329 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

329 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links