Technique of Generating High Quality Synthetic Speech

US 20080183473A1
Filed: 01/30/2008
Published: 07/31/2008
Est. Priority Date: 01/30/2007
Status: Active Grant

First Claim

Patent Images

1. A system for generating synthetic speech, comprising:

a phoneme segment storage section for storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and

a synthesis section for generating voice data representing synthetic speech of text by receiving an inputted text, by reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and then by connecting the read-out phoneme segment data pieces to each other;

a computing section for computing a score indicating the unnaturalness of the synthetic speech of the text, on the basis of the voice data;

a paraphrase storage section for storing a plurality of second notations, the second notations being paraphrases of first notations and for associating the second notations with the respective first notations;

a replacement section for searching the text for a notation matching with any of the first notations and for replacing the searched-out notation with the second notation corresponding to the first notation; and

a judgment section for receiving the score and for outputting the generated voice data on condition that the score is smaller than a predetermined reference value, and for inputting the text to the synthesis section in order for the synthesis section to generate further voice data for the text after replacement when the score is equal to or greater than the reference value.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A synthetic speech system includes a phoneme segment storage section for storing multiple phoneme segment data pieces; a synthesis section for generating voice data from text by reading phoneme segment data pieces representing the pronunciation of an inputted text from the phoneme segment storage section and connecting the phoneme segment data pieces to each other; a computing section for computing a score indicating the unnaturalness of the voice data representing the synthetic speech of the text; a paraphrase storage section for storing multiple paraphrases of the multiple first phrases; a replacement section for searching the text and replacing with appropriate paraphrases; and a judgment section for outputting generated voice data on condition that the computed score is smaller than a reference value and for inputting the text after the replacement to the synthesis section to cause the synthesis section to further generate voice data for the text.

Citations

12 Claims

1. A system for generating synthetic speech, comprising:
- a phoneme segment storage section for storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and
  
  a synthesis section for generating voice data representing synthetic speech of text by receiving an inputted text, by reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and then by connecting the read-out phoneme segment data pieces to each other;
  
  a computing section for computing a score indicating the unnaturalness of the synthetic speech of the text, on the basis of the voice data;
  
  a paraphrase storage section for storing a plurality of second notations, the second notations being paraphrases of first notations and for associating the second notations with the respective first notations;
  
  a replacement section for searching the text for a notation matching with any of the first notations and for replacing the searched-out notation with the second notation corresponding to the first notation; and
  
  a judgment section for receiving the score and for outputting the generated voice data on condition that the score is smaller than a predetermined reference value, and for inputting the text to the synthesis section in order for the synthesis section to generate further voice data for the text after replacement when the score is equal to or greater than the reference value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system according to claim 1, wherein the computing section computes, as the score, a degree of difference in pronunciation between first and second phoneme segment data pieces contained in the voice data and connected to each other, at a boundary between the first and second phoneme segment data pieces.
  - 3. The system according to claim 2, whereinthe phoneme segment storage section stores a data piece representing fundamental frequency and tone of the sound of each phoneme as the phoneme segment data piece, andthe computing section computes, as the score, a degree of difference in the fundamental frequency and tone between the first and second phoneme segment data pieces at the boundary between the first and second phoneme segment data pieces.
  - 4. The system according to claim 1, whereinthe synthesis section includes:
    - a word storage section for storing a reading way of each of a plurality of words in association with a notation of the word;
      
      a word search section for searching the word storage section for a word whose notation matches with the notation of each of the words contained in the inputted text, and for generating a reading way of the text by reading the reading ways corresponding to the respective searched-out words from the word storage section, and then by connecting the reading ways to each other; and
      
      a phoneme segment search section for generating the voice data by retrieving a phoneme segment data piece representing a prosody closest to a prosody of each phoneme determined based on the generated reading way, from the phoneme segment storage section, and then by connecting the plurality of retrieved phoneme segment data pieces to each other, andthe computing section computes, as the score, a difference between the prosody of each phoneme determined based on the generated reading way, and a prosody indicated by the phoneme segment data piece retrieved in correspondence to each phoneme.
  - 5. The system according to claim 1, wherein the synthesis section includes:
    - a word storage section for storing a reading way of each of a plurality of words in association with a notation of the word;
      
      a word search section for searching the word storage section for a word whose notation matches with the notation of each of the words contained in the inputted text, and for generating a reading way of the text by reading the reading ways corresponding to the respective searched-out words from the word storage section, and then by connecting the reading ways to each other;
      
      a phoneme segment search section for generating the voice data by retrieving a phoneme segment data piece representing a tone closest to tone of each phoneme determined based on the generated reading way, from the phoneme segment storage section, and then by connecting the plurality of retrieved phoneme segment data pieces to each other, andthe computing section computes, as the score, a difference between the tone of each phoneme determined based on the generated reading way, and the tone indicated by the phoneme segment data piece retrieved in correspondence to each phoneme.
  - 6. The system according to claim 1, whereinthe phoneme segment storage section previously obtains target voice data that is target speaker'"'"'s voice data to be targeted for synthetic speech generation, and then previously generates and stores a plurality of phoneme segment data pieces representing sounds of a plurality of phonemes contained in the target voice data,the paraphrase storage section stores, as each of the plurality of second notations, the notation of a word contained in a text representing the content of the target voice data, andthe replacement section replaces a notation contained in the inputted text and matching with any of the first notations, with one of the second notations that is a notation of a word contained in the text representing the content of the target voice data.
  - 7. The system according to claim 1, whereinthe replacement section computes a score indicating the unnaturalness of synthetic speech corresponding to each of combinations of a predetermined number of words successively written in the inputted text, searches the paraphrase storage section for a notation matching with a notation of the word contained in the combination having a largest score thus computed, and replaces the notation of the word with the second notation.
  - 8. The system according to claim 1, whereinthe paraphrase storage section further stores a similarity score in association with each of combinations of a first notation and a second notation that is a paraphrase of the first notation, the similarity score indicating a degree of similarity between meanings of the first and second notations, andwhen a notation contained in the inputted text matches with each of a plurality of first notations, the replacement section replaces the matching notation with the second notation corresponding to one of the plurality of first notations having a highest similarity score.
  - 9. The system according to claim 1, whereinthe replacement section does not replace a notation of a sentence containing at least any one of a proper name and a numeral value, but searches a sentence not containing any one of a proper name and a numeral value to find a notation matching with any of the first notations, and replaces the found notation with the second notation corresponding to the first notation.
  - 10. The system according to claim 1, further comprising a display section for displaying the text, having the notation replaced, to a user on condition that the replacement section replaces the notation, whereinthe judgment section outputs voice data based on the text having the notation replaced, also on condition that an input permitting the replacement in the displayed text is received, and outputs voice data based on the text before the replacement no matter how great the score is, on condition that an input permitting the replacement in the displayed text is not received.

11. A method for generating synthetic speech, comprising the steps of:
- storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes different from each other;
  
  generating voice data representing synthetic speech of text by receiving an inputted text, by reading out the phoneme segment data pieces corresponding to respective phonemes indicating the pronunciation of the inputted text, and then by connecting the read-out phoneme segment data pieces to each other;
  
  computing a score indicating the unnaturalness of the synthetic speech of the text, on the basis of the voice data;
  
  storing a plurality of second notations that are paraphrases of a plurality of first notations and associating the second notations with the respective first notations;
  
  searching the text for a notation matching with any of the first notations, and replacing the searched-out notation with the second notation corresponding to the first notation; and
  
  outputting the generated voice data when the score is smaller than a predetermined reference value, and further generating synthetic speech in order to generate further voice data for the text after replacement on condition that the score is equal to or greater than the reference value.

12. A program allowing an information processing apparatus to function as a system for generating synthetic speech, the program causing the information apparatus to function as:
- a phoneme segment storage section for storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and
  
  a synthesis section for generating voice data representing synthetic speech of text by receiving an inputted text, by reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and then by connecting the read-out phoneme segment data pieces to each other;
  
  a computing section for computing a score indicating the unnaturalness of the synthetic speech of the text, on the basis of the voice data;
  
  a paraphrase storage section for storing a plurality of second notations, the second notations being paraphrases of first notations and for associating the second notations with the respective first notations;
  
  a replacement section for searching the text for a notation matching with any of the first notations and for replacing the searched-out notation with the second notation corresponding to the first notation; and
  
  a judgment section for receiving the score and for outputting the generated voice data on condition that the score is smaller than a predetermined reference value, and for inputting the text to the synthesis section in order for the synthesis section to generate further voice data for the text after replacement when the score is equal to or greater than the reference value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Nishimura, Masafumi, Nagano, Tohru, Tachibana, Ryuki

Granted Patent

US 8,015,011 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/258
CPC Class Codes

G10L 13/07 Concatenation rules

Technique of Generating High Quality Synthetic Speech

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Technique of Generating High Quality Synthetic Speech

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links