Accuracy of text-to-speech synthesis

US 9,311,913 B2
Filed: 02/05/2013
Issued: 04/12/2016
Est. Priority Date: 02/05/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

detecting, by at least one processor, occurrence of an out-of-vocabulary word in a text sample;

detecting a likelihood that the out-of-vocabulary word will be mispronounced using a primary text-to-speech synthesizer associated with a primary language;

receiving feedback from a source other than the primary text-to-speech synthesizer, the feedback indicating a conversion in accordance with a secondary language of the out-of-vocabulary word into a corresponding audio output;

storing the feedback in a repository;

generating, based on the feedback and by a secondary text-to-speech synthesizer associated with the secondary language, a first audio pronunciation of the out-of-vocabulary word pronounced in accordance with a native secondary language speaking person speaking the secondary language; and

generating, in accordance with a native primary language speaking person speaking the primary language, a second audio pronunciation of the out of vocabulary word.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to a first example configuration, a pair of text-to-speech synthesizers produces audio representations for each of multiple words. The outputs are compared to identify instances in which a lexicon lookup algorithm and a grapheme-to-phoneme algorithm produce different audio representations for the same words. Results of the analysis are used to train a classifier that subsequently determines a degree to which a grapheme-to-phoneme algorithm is likely to detect a newly detected out-of-vocabulary word to be converted into an audio representation. According to a second example configuration, a text analyzer tags a non-standard word. A group of reviewers generate one or more proposed text-to-speech expansion rules for a detected non-standard word. When there is a high amount of agreement amongst the reviewers how to expand the non-standard word, the proposed expansion rule is published for use by respective one or more text-to-speech synthesizers.

Citations

20 Claims

1. A method comprising:
- detecting, by at least one processor, occurrence of an out-of-vocabulary word in a text sample;
  
  detecting a likelihood that the out-of-vocabulary word will be mispronounced using a primary text-to-speech synthesizer associated with a primary language;
  
  receiving feedback from a source other than the primary text-to-speech synthesizer, the feedback indicating a conversion in accordance with a secondary language of the out-of-vocabulary word into a corresponding audio output;
  
  storing the feedback in a repository;
  
  generating, based on the feedback and by a secondary text-to-speech synthesizer associated with the secondary language, a first audio pronunciation of the out-of-vocabulary word pronounced in accordance with a native secondary language speaking person speaking the secondary language; and
  
  generating, in accordance with a native primary language speaking person speaking the primary language, a second audio pronunciation of the out of vocabulary word.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as in claim 1, wherein the occurrence is a first occurrence of the out-of-vocabulary word, the method further comprising:
    - detecting a second occurrence of the out-of-vocabulary in a subsequent text sample;
      
      accessing the feedback in the repository; and
      
      determining, based on a setting associated with the second text-to-speech synthesizer, whether to provide the first audio pronunciation of the out-of-vocabulary word or the second audio pronunciation of the out-of-vocabulary word.
  - 3. The method as in claim 1, wherein the primary text-to-speech synthesizer converts the text sample in accordance with the primary language;
    - andwherein the feedback indicates conversion of the out-of-vocabulary word into a corresponding audio output in accordance with a foreign language with respect to the primary language.
  - 4. The method as in claim 1, wherein receiving the feedback includes:
    - receiving the feedback from a human reviewer that provides the conversion of the out-of-vocabulary word into the corresponding audio output.
  - 5. The method as in claim 1, further comprising:
    - initiating distribution of the feedback in the repository over a network to each of multiple remotely located text-to-speech synthesizer systems, each of the remotely located text-to-speech synthesizers configured to convert respective text samples for respective clients that access the remotely located text-to-speech synthesizers.
  - 6. The method as in claim 1, wherein detecting the likelihood that the out-of-vocabulary word will be mispronounced using the primary text-to-speech synthesizer includes:
    - implementing the primary text-to-speech synthesizer in a first language, the out-of-vocabulary word being absent from a lexicon lookup of the first language.
  - 7. The method as in claim 6, wherein receiving the feedback includes:
    - analyzing the out-of-vocabulary word via a secondary text-to-speech synthesizer that attempts to convert the out-of-vocabulary in a foreign language with respect to the first language; and
      
      producing the feedback in response to detecting that the out-of-vocabulary word is present in a lexicon lookup used by the secondary text-to-speech synthesizer to convert text into speech.

8. A method comprising:
- implementing, by at least one processor, a lexicon lookup algorithm via first text-to-speech hardware to produce a first audio output for each word in a set of multiple words comprising one or more words from a base language and one or more words from a foreign language;
  
  implementing a grapheme-to-phoneme algorithm comprising one or more grapheme-to-phoneme rules via second text-to-speech hardware to produce a second audio output for each word in the set of multiple words;
  
  comparing the first audio output and the second audio output by analyzing instances in which the lexicon lookup algorithm produces a different audio output than the grapheme-to-phoneme algorithm for respective text; and
  
  generating a set of predictors based on the comparing, the set of predictors indicating circumstances in which use of the one or more grapheme-to-phoneme rules results in identifying one or more audio output representations that correspond to one or more words from the foreign language.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method as in claim 8, further comprising:
    - classifying each of the multiple words by;
      
      generating a first class of words to include each respective word of the multiple words in which the lexicon lookup algorithm and the grapheme-to-phoneme algorithm produce a substantially different audio output representation; and
      
      generating a second class of words to include each respective word of the multiple words in which the lexicon lookup algorithm and the grapheme-to-phoneme algorithm produce a substantially same audio output representation; and
      
      generating the set of predictors based on the classifying.
  - 10. The method as in claim 8, further comprising:
    - for each of the multiple words;
      
      selecting a word from the multiple words;
      
      utilizing the first text-to-speech hardware to generate a first audio output representative of the selected word;
      
      utilizing the second text-to-speech hardware to generate a second audio output representative of the selected word;
      
      comparing the first audio output to the second audio output representation; and
      
      classifying the respective first audio output and the second audio output as being either substantially the same or substantially different.
  - 11. The method as in claim 8, wherein the set of predictors indicating indicate circumstances in which use of the one or more grapheme-to-phoneme rules results in generation of substantially different audio output representations by the lexicon lookup algorithm and by the grapheme-to-phoneme algorithm.
  - 12. The method as in claim 11, further comprising:
    - utilizing the set of predictors to train a classification model.
  - 13. The method as in claim 12, further comprising:
    - receiving a text sample on which to perform text-to-speech synthesis; and
      
      utilizing the classification model to detect which out-of-vocabulary words in the text sample are likely to be mispronounced during the text-to-speech synthesis of the text sample.
  - 14. The method as in claim 9, further comprising:
    - identifying which subset of the multiple words the lexicon lookup algorithm produces a different audio output than the grapheme-to-phoneme algorithm;
      
      analyzing the subset of words to identify instances in which the grapheme-to-phoneme algorithm produces an improper audio output for words in the subset;
      
      producing a set of rules based on the instances; and
      
      utilizing the set of rules to train a classification model, the classification model configured to detect which out-of-vocabulary words in a future received text sample are likely to be mispronounced during text-to-speech synthesis of the text sample.
  - 15. The method as in claim 14, further comprising:
    - receiving a text sample on which to perform text-to-speech synthesis; and
      
      utilizing the classification model to detect which out-of-vocabulary words in the text sample are likely to be mispronounced during the text-to-speech synthesis of the text sample.

16. A method comprising:
- detecting, by at least one processor, occurrence of an out-of-vocabulary word in a text sample to be converted into audio output by detecting that the out-of-vocabulary word is not located in a lexicon associated with a default language;
  
  determining a probability that the out-of-vocabulary word will be mispronounced using a text-to-speech synthesizer;
  
  in response to the probability that the out-of-vocabulary word will be mispronounced being below a first threshold probability, producing, via a first text-to-speech synthesizer configured to generate audio in accordance with the default language, a first audio output of the entire out-of-vocabulary word and any words in the text sample that are located in the lexicon associated with the default language; and
  
  in response to the probability that the out-of-vocabulary word will be mispronounced meeting a second threshold probability, producing, via a second text-to-speech synthesizer configured to generate audio in accordance with a foreign language, a second audio output of the out-of-vocabulary word.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The method as in claim 16 further comprising:
    - utilizing the first text-to-speech synthesizer to produce an audio output of at least one word other than the out-of-vocabulary word in the text sample;
      
      utilizing the second text-to-speech synthesizer to produce the second audio output of the out-of-vocabulary word; and
      
      combining the audio output of the at least one word and the second audio output of the out-of-vocabulary word to produce an audio output.
  - 18. The method as in claim 16, wherein the second audio output of the out-of-vocabulary word comprises an audio pronunciation of the out-of-vocabulary word pronounced in accordance with a native default language speaking person speaking the default language.
  - 19. The method as in claim 16, wherein detecting occurrence of the out-of-vocabulary word in the text sample includes:
    - performing a morpho-syntactic analysis to one or more words in the text sample to detect the out-of-vocabulary word.
  - 20. The method as in claim 16, wherein the second audio output of the out-of-vocabulary word comprises an audio pronunciation of the entire out-of-vocabulary word pronounced in accordance with a native foreign language speaking person speaking the foreign language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Legat, Milan
Primary Examiner(s)
ALBERTALLI, BRIAN LOUIS

Application Number

US13/759,924
Publication Number

US 20140222415A1
Time in Patent Office

1,162 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 13/086 Detection of language

Accuracy of text-to-speech synthesis

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Accuracy of text-to-speech synthesis

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links