Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases

US 8,015,011 B2
Filed: 01/30/2008
Issued: 09/06/2011
Est. Priority Date: 01/30/2007
Status: Active Grant

First Claim

Patent Images

1. A system for generating synthetic speech, comprising:

a phoneme segment storage section operable to store a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and

a synthesis section operable to generate voice data representing synthetic speech of text by receiving an inputted text, reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and connecting the read-out phoneme segment data pieces to each other;

a computing section operable to compute a score indicating naturalness of the synthetic speech of the text, on the basis of the voice data;

a paraphrase storage section operable to store a plurality of notations each comprising a word or phrase, the plurality of notations comprising a plurality of first notations and a plurality of second notations, each second notation being a paraphrase of a respective first notation;

a replacement section operable to search the text for a notation matching any of the first notations and to replace a matching notation with the second notation corresponding to the first notation; and

a judgment section operable to receive the score computed by the computing section and determine whether the score indicates the synthetic speech is sufficiently natural, and;

if the score indicates the synthetic speech is sufficiently natural, output the generated voice data; and

if the score indicates the synthetic speech is not sufficiently natural, cause the replacement section to generate revised text by replacing at least one other notation in the inputted text matching a first notation with a corresponding second notation, and cause the synthesis section to generate voice data for the revised text.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A synthetic speech system includes a phoneme segment storage section for storing multiple phoneme segment data pieces; a synthesis section for generating voice data from text by reading phoneme segment data pieces representing the pronunciation of an inputted text from the phoneme segment storage section and connecting the phoneme segment data pieces to each other; a computing section for computing a score indicating the unnaturalness of the voice data representing the synthetic speech of the text; a paraphrase storage section for storing multiple paraphrases of the multiple first phrases; a replacement section for searching the text and replacing with appropriate paraphrases; and a judgment section for outputting generated voice data on condition that the computed score is smaller than a reference value and for inputting the text after the replacement to the synthesis section to cause the synthesis section to further generate voice data for the text.

306 Citations

12 Claims

1. A system for generating synthetic speech, comprising:
- a phoneme segment storage section operable to store a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and
  
  a synthesis section operable to generate voice data representing synthetic speech of text by receiving an inputted text, reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and connecting the read-out phoneme segment data pieces to each other;
  
  a computing section operable to compute a score indicating naturalness of the synthetic speech of the text, on the basis of the voice data;
  
  a paraphrase storage section operable to store a plurality of notations each comprising a word or phrase, the plurality of notations comprising a plurality of first notations and a plurality of second notations, each second notation being a paraphrase of a respective first notation;
  
  a replacement section operable to search the text for a notation matching any of the first notations and to replace a matching notation with the second notation corresponding to the first notation; and
  
  a judgment section operable to receive the score computed by the computing section and determine whether the score indicates the synthetic speech is sufficiently natural, and;
  
  if the score indicates the synthetic speech is sufficiently natural, output the generated voice data; and
  
  if the score indicates the synthetic speech is not sufficiently natural, cause the replacement section to generate revised text by replacing at least one other notation in the inputted text matching a first notation with a corresponding second notation, and cause the synthesis section to generate voice data for the revised text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system according to claim 1, wherein the computing section is operable to compute, as the score, a degree of difference in pronunciation between first and second phoneme segment data pieces contained in the voice data and connected to each other, at a boundary between the first and second phoneme segment data pieces.
  - 3. The system according to claim 2, wherein:
    - the phoneme segment storage section is operable to store a data piece representing fundamental frequency and tone of the sound of each phoneme as the phoneme segment data piece, andthe computing section is operable to compute, as the score, a degree of difference in the fundamental frequency and tone between the first and second phoneme segment data pieces at the boundary between the first and second phoneme segment data pieces.
  - 4. The system according to claim 1, wherein:
    - the synthesis section includes;
      
      a word storage section for storing a reading way of a plurality of words in association with a notation of the plurality of words;
      
      a word search section for searching the word storage section for a word whose notation matches with the notation of each of the words contained in the inputted text, and for generating a reading way of the text by reading the reading ways corresponding to the respective searched-out words from the word storage section, and then by connecting the reading ways to each other; and
      
      a phoneme segment search section for generating the voice data by retrieving a phoneme segment data piece representing a prosody closest to a prosody of each phoneme determined based on the generated reading way, from the phoneme segment storage section, and then by connecting the plurality of retrieved phoneme segment data pieces to each other, andthe computing section is operable to compute, as the score, a difference between the prosody of each phoneme determined based on the generated reading way, and a prosody indicated by the phoneme segment data piece retrieved in correspondence to each phoneme.
  - 5. The system according to claim 1, wherein the synthesis section includes:
    - a word storage section for storing a reading way of a plurality of words in association with a notation of the plurality of words;
      
      a word search section for searching the word storage section for a word whose notation matches with the notation of each of the words contained in the inputted text, and for generating a reading way of the text by reading the reading ways corresponding to the respective searched-out words from the word storage section, and then by connecting the reading ways to each other;
      
      a phoneme segment search section for generating the voice data by retrieving a phoneme segment data piece representing a tone closest to tone of each phoneme determined based on the generated reading way, from the phoneme segment storage section, and then by connecting the plurality of retrieved phoneme segment data pieces to each other, andwherein the computing section is operable to compute, as the score, a difference between the tone of each phoneme determined based on the generated reading way, and the tone indicated by the phoneme segment data piece retrieved in correspondence to each phoneme.
  - 6. The system according to claim 1, wherein:
    - the phoneme segment storage section is operable to store obtained target voice data that is target speaker'"'"'s voice data to be targeted for synthetic speech generation, and to generate and store a plurality of phoneme segment data pieces representing sounds of a plurality of phonemes contained in the target voice data,the paraphrase storage section is operable to store, as each of the plurality of second notations, the notation of a word contained in a text representing the content of the target voice data, andthe replacement section is operable to replace a notation contained in the inputted text which matches any of the first notations, with a corresponding one of the second notations that is a notation representing content of target voice data.
  - 7. The system according to claim 1, wherein:
    - the replacement section is operable to search the text for combinations of a predetermined number of words successively written in the inputted text, in which any match a first notation, and replaces a word contained in the combination having a greatest degree of difference between included words with a corresponding second notation.
  - 8. The system according to claim 1, wherein:
    - the paraphrase storage section is operable to store a similarity score in association with each of combinations of a first notation and a second notation that is a paraphrase of the first notation, the similarity score indicating a degree of similarity between meanings of the first and second notations, andwhen a notation contained in the inputted text matches with each of a plurality of first notations, the replacement section replaces the matching notation with the second notation having a highest similarity to the corresponding first notation.
  - 9. The system according to claim 1, wherein:
    - the replacement section is operable to not replace a notation included in a sentence that contains at least any one of a proper name and a numeral value.
  - 10. The system according to claim 1, further comprising a display section operable to display the text, having the notation replaced, to a user on condition that the replacement section replaces the notation, and whereinthe judgment section is operable to output voice data based on the text having the notation replaced, if an input permitting the replacement in the displayed text is received, and outputs voice data based on the text before replacement if an input permitting the replacement in the displayed text is not received.

11. A method for generating synthetic speech, comprising acts of:
- storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes different from each other;
  
  generating voice data representing synthetic speech of text by receiving an inputted text, reading out the phoneme segment data pieces corresponding to respective phonemes indicating the pronunciation of the inputted text, and connecting the read-out phoneme segment data pieces to each other;
  
  computing a score indicating naturalness of the synthetic speech of the text, on the basis of the voice data;
  
  storing a plurality of notations each comprising a word or phrase, the plurality of notations comprising a plurality of first notations and a plurality of second notations, each second notation being a paraphrase of a respective first notation;
  
  searching the text for a notation matching any of the first notations, and replacing a matching notation with the second notation corresponding to the first notation;
  
  determining whether the score indicates that the synthetic speech is sufficiently natural; and
  
  if the score indicates that the synthetic speech is sufficiently natural, outputting the generated voice data; and
  
  if the score indicates that the synthetic speech is not sufficiently natural, generating revised text by replacing at least one other notation in the inputted text matching a first notation with a corresponding second notation, and generating voice data for the revised text.

12. At least one storage device having instructions encoded thereon which, when executed, perform a method of generating synthetic speech, the method comprising acts of:
- storing a plurality of phoneme segment data pieces indicating a plurality of sounds of phonemes which are different from each other; and
  
  generating voice data representing synthetic speech of text by receiving an inputted text, reading out phoneme segment data pieces that correspond to respective phonemes indicating the pronunciation of the inputted text, and connecting the read-out phoneme segment data pieces to each other;
  
  computing a score indicating naturalness of the synthetic speech of the text, on the basis of the voice data;
  
  storing a plurality of notations each comprising a word or phrase, the plurality of notations comprising a plurality of first notations and a plurality of second notations, each of the second notations being a paraphrase of a respective first notation; and
  
  searching the text for a notation matching any of the first notations and replacing a matching notation with the second notation corresponding to the first notation; and
  
  determining whether the score indicates that the synthetic speech is sufficiently natural; and
  
  if the score indicates that the synthetic speech is sufficiently natural, outputting the generated voice data; and
  
  if the score indicates that the synthetic speech is not sufficiently natural, generating revised text by replacing at least one other notation in the inputted text matching a first notation with a respective second notation, and generating voice data for the revised text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Nagano, Tohru, Tachibana, Ryuki, Nishimura, Masafumi
Primary Examiner(s)
Smits; Talivaldis Ivars

Application Number

US12/022,333
Publication Number

US 20080183473A1
Time in Patent Office

1,315 Days
Field of Search

None
US Class Current

704/260
CPC Class Codes

G10L 13/07 Concatenation rules

Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

306 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

306 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links