ACCURACY OF TEXT-TO-SPEECH SYNTHESIS

US 20140222415A1
Filed: 02/05/2013
Published: 08/07/2014
Est. Priority Date: 02/05/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

detecting occurrence of an out-of-vocabulary word in a text sample;

detecting a likelihood that the out-of-vocabulary word will be mispronounced using a primary text-to-speech synthesizer;

receiving feedback from a source other than the primary text-to-speech synthesizer, the feedback indicating a conversion of the out-of-vocabulary word into a corresponding audio representation; and

storing the feedback in a repository.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to a first example configuration, a pair of text-to-speech synthesizers produces audio representations for each of multiple words. The outputs are compared to identify instances in which a lexicon lookup algorithm and a grapheme-to-phoneme algorithm produce different audio representations for the same words. Results of the analysis are used to train a classifier that subsequently determines a degree to which a grapheme-to-phoneme algorithm is likely to detect a newly detected out-of-vocabulary word to be converted into an audio representation. According to a second example configuration, a text analyzer tags a non-standard word. A group of reviewers generate one or more proposed text-to-speech expansion rules for a detected non-standard word. When there is a high amount of agreement amongst the reviewers how to expand the non-standard word, the proposed expansion rule is published for use by respective one or more text-to-speech synthesizers.

Citations

35 Claims

1. A method comprising:
- detecting occurrence of an out-of-vocabulary word in a text sample;
  
  detecting a likelihood that the out-of-vocabulary word will be mispronounced using a primary text-to-speech synthesizer;
  
  receiving feedback from a source other than the primary text-to-speech synthesizer, the feedback indicating a conversion of the out-of-vocabulary word into a corresponding audio representation; and
  
  storing the feedback in a repository.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as in claim 1, wherein the occurrence is a first occurrence of the out-of-vocabulary word, the method further comprising:
    - detecting a second occurrence of the out-of-vocabulary in a subsequent text sample;
      
      accessing the feedback in the repository;
      
      via the accessed feedback, converting the second occurrence of the out-of-vocabulary word to the corresponding audio representation.
  - 3. The method as in claim 1, wherein the primary text-to-speech synthesizer converts the text sample in accordance with a primary language;
    - andwherein the feedback indicates conversion of the out-of-vocabulary word into a corresponding audio representation in accordance with a foreign language with respect to the primary language.
  - 4. The method as in claim 1, wherein receiving the feedback includes:
    - receiving the feedback from a human reviewer that provides the conversion of the out-of-vocabulary word into the corresponding audio representation.
  - 5. The method as in claim 1 further comprising:
    - initiating distribution of the feedback in the repository over a network to each of multiple remotely located text-to-speech synthesizer systems, each of the remotely located text-to-speech synthesizers configured to convert respective text samples for respective clients that access the remotely located text-to-speech synthesizers.
  - 6. The method as in claim 1, wherein detecting the likelihood that the out-of-vocabulary word will be mispronounced using the primary text-to-speech synthesizer includes:
    - implementing the primary text-to-speech synthesizer in a first language, the out-of-vocabulary word absent from a lexicon lookup of the first language.
  - 7. The method as in claim 6, wherein receiving the feedback indicating the conversion of the out-of-vocabulary word into the corresponding audio representation includes:
    - analyzing the out-of-vocabulary word via a secondary text-to-speech synthesizer that attempts to convert the out-of-vocabulary in a second language, the second language being a foreign language with respect to the first language; and
      
      producing the feedback in response to detecting that out-of-vocabulary word is present in a lexicon lookup used by the secondary text-to-speech synthesizer to convert text into speech.

8. A method comprising:
- implementing a lexicon lookup algorithm in first text-to-speech hardware to produce an audio output representation for each word in a set of multiple words;
  
  implementing a grapheme-to-phoneme algorithm in second text-to-speech hardware to produce an audio output representation for each word in the set of multiple words;
  
  for each word in the set;
  
  performing a comparison of an audio output representation of the first text-to-speech hardware and an audio output representation of the second text-to-speech hardware; and
  
  classifying each of the multiple words depending on the comparison.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method as in claim 8, wherein classifying each of the multiple words further includes:
    - generating a first class of words to include each respective word of the multiple words in which the lexicon lookup algorithm and the grapheme-to-phoneme algorithm produce a substantially different audio output representation; and
      
      generating a second class of words to include each respective word of the multiple words in which the lexicon lookup algorithm and the grapheme-to-phoneme algorithm produce a substantially same audio output representation.
  - 10. The method as in claim 8 further comprising:
    - for each of the multiple words;
      
      selecting a word from the multiple words;
      
      utilizing the first text-to-speech hardware to generate a first audio output representation representative of the selected word;
      
      utilizing the second text-to-speech hardware to generate a second audio output representation representative of the selected word;
      
      comparing the first audio output representation to the second audio output representation; and
      
      classifying the respective first audio output representation and the second audio output representation as being either substantially the same or substantially different.
  - 11. The method as in claim 8, wherein the grapheme-to-phoneme algorithm implements multiple grapheme-to-phoneme rules to produce audio output representations for the multiple words, the method further comprising:
    - based on analyzing instances in which the lexicon lookup algorithm produces a different audio output representation than the grapheme-to-phoneme algorithm for respective text, generating a set of predictors, the set of predictors indicating circumstances in which use of the grapheme-to-phoneme rules results in generation of substantially different audio output representations.
  - 12. The method as in claim 11 further comprising:
    - utilizing the set of predictors to train a classification model.
  - 13. The method as in claim 12 further comprising:
    - receiving a text sample on which to perform text-to-speech synthesis; and
      
      utilizing the classification model to detect which out-of-vocabulary words in the text sample are likely to be mispronounced during the text-to-speech synthesis of the text sample.
  - 14. The method as in claim 9 further comprising:
    - identifying which subset of the multiple words the lexicon lookup algorithm produces a different audio output representation than the grapheme-to-phoneme algorithm;
      
      analyzing the subset of words to identify instances in which the grapheme-to-phoneme algorithm produces an improper audio output representation for words in the subset;
      
      producing a set of rules based on the instances; and
      
      utilizing the set of rules to train a classification model, the classification model configured to detect which out-of-vocabulary words in a future received text sample are likely to be mispronounced during text-to-speech synthesis of the text sample.
  - 15. The method as in claim 14 further comprising:
    - receiving a text sample on which to perform text-to-speech synthesis; and
      
      utilizing the classification model to detect which out-of-vocabulary words in the text sample are likely to be mispronounced during the text-to-speech synthesis of the text sample.

16-29. -29. (canceled)

30. A method comprising:
- detecting occurrence of an out-of-vocabulary word in a text sample to be converted into audio output;
  
  estimating a probability that the out-of-vocabulary word will be mispronounced using a text-to-speech synthesizer; and
  
  selecting amongst multiple sources from which to produce an audio rendition of the out-of-vocabulary word based at least in part on a magnitude of the probability.
- View Dependent Claims (31, 32, 33, 34)
- - 31. The method as in claim 30, wherein the text-to-speech synthesizer is a first text-to-speech synthesizer configured to convert respective words in the text sample in accordance with a primary language, the method further comprising:
    - detecting that the out-of-vocabulary word can be properly pronounced using a second text-to-speech synthesizer, the second text-to-speech synthesizer configured to convert respective words in the text sample in accordance with a foreign language with respect to the primary language; and
      
      selecting the second text-to-speech synthesizer as the source from which to receive the audio rendition of the out-of-vocabulary word for inclusion in the audio output.
  - 32. The method as in claim 31 further comprising:
    - utilizing the first text-to-speech synthesizer to produce an audio rendition of at least one word other than the out-of-vocabulary word in the text sample;
      
      utilizing the second text-to-speech synthesizer to produce the audio rendition of the out-of-vocabulary word; and
      
      combining the audio rendition of the at least one word and the audio rendition of the out-of-vocabulary word to produce the audio output.
  - 33. The method as in claim 31 further comprising:
    - producing the audio rendition of the out-of-vocabulary word in the foreign language in accordance with a person speaking the base language.
  - 34. The method as in claim 30, wherein detecting occurrence of the out-of-vocabulary word in the text sample includes:
    - performing a morpho-syntactic analysis to words in the text sample to detect the out-of-vocabulary word.

35-37. -37. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Legat, Milan

Granted Patent

US 9,311,913 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 13/086 Detection of language

ACCURACY OF TEXT-TO-SPEECH SYNTHESIS

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

ACCURACY OF TEXT-TO-SPEECH SYNTHESIS

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links