Predicting pronunciations with word stress

US 10,255,905 B2
Filed: 06/10/2016
Issued: 04/09/2019
Est. Priority Date: 06/10/2016
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers of a text-to-speech synthesis system, the method comprising:

determining, by the one or more computers of the text-to-speech synthesis system, spelling data that indicates the spelling of a word;

determining, by the one or more computers of the text-to-speech synthesis system, first pronunciation data that indicates at least one stress location for the word;

providing, by the one or more computers of the text-to-speech synthesis system, the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;

receiving, by the one or more computers of the text-to-speech synthesis system, output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input;

using, by the one or more computers of the text-to-speech synthesis system, the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word;

generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and

providing, by the one or more computers of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating word pronunciations. One of the methods includes determining, by one or more computers, spelling data that indicates the spelling of a word, providing the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words, receiving output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input, using the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word, and providing, by the one or more computers, the pronunciation data to a text-to-speech system or an automatic speech recognition system.

36 Citations

View as Search Results

19 Claims

1. A method performed by one or more computers of a text-to-speech synthesis system, the method comprising:
- determining, by the one or more computers of the text-to-speech synthesis system, spelling data that indicates the spelling of a word;
  
  determining, by the one or more computers of the text-to-speech synthesis system, first pronunciation data that indicates at least one stress location for the word;
  
  providing, by the one or more computers of the text-to-speech synthesis system, the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving, by the one or more computers of the text-to-speech synthesis system, output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input;
  
  using, by the one or more computers of the text-to-speech synthesis system, the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word;
  
  generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and
  
  providing, by the one or more computers of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein:
    - providing the spelling data and the first pronunciation data as input to the trained recurrent neural network comprises providing the spelling data and the first pronunciation data as input to a trained long short-term memory recurrent neural network; and
      
      receiving the output representing the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output representing the stress pattern for pronunciation of the word generated by the trained long short-term memory recurrent neural network in response to providing the spelling data and the first pronunciation data as input.
  - 3. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the second pronunciation data that indicates at least one primary stress location.
  - 4. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the second pronunciation data that indicates a sequence of phones for the word with stress and syllable divisions and stress values.
  - 5. The method of claim 1, wherein:
    - providing, by the one or more computers, the spelling data and the first pronunciation data as input to the trained recurrent neural network comprises providing a plurality of input vectors for the spelling data as input to the trained recurrent neural network, each of the plurality of input vectors indicating a particular character from the spelling data or filler; and
      
      receiving the output representing the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving a plurality of output vectors that each indicate a probability distribution over a set of symbols, a combination of the plurality of output vectors indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network.
  - 6. The method of claim 5, wherein:
    - providing the plurality of input vectors comprises providing a predetermined number of input vectors to the trained recurrent neural network as the input; and
      
      receiving the plurality of output vectors comprises receiving the predetermined number of output vectors from the trained recurrent neural network as the output.
  - 7. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word.
  - 8. The method of claim 7, wherein applying the one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises:
    - using beam search on the output of the trained recurrent neural network to determine a path in the output with a highest likelihood of satisfying the one or more constraints; and
      
      using the path with the highest likelihood of satisfying the one or more constraints to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word.
  - 9. The method of claim 7, wherein applying the one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying, by a network of finite state transducers, the one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word.

10. A text-to-speech synthesis system comprising:
- one or more data processing apparatus; and
  
  one or more non-transitory computer readable storage media in data communication with the one or more data processing apparatus and storing instructions executable by the one or more data processing apparatus and upon such execution cause the one or more data processing apparatus to perform operations comprising;
  
  determining spelling data that indicates the spelling of a word;
  
  determining first pronunciation data that indicates at least one stress location for the word;
  
  providing the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input;
  
  using the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word;
  
  generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and
  
  providing, as output of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The system of claim 10, wherein:
    - providing the spelling data and the first pronunciation data as input to the trained recurrent neural network comprises providing the spelling data and the first pronunciation data as input to a trained long short-term memory recurrent neural network; and
      
      receiving the output representing the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output representing the stress pattern for pronunciation of the word generated by the trained long short-term memory recurrent neural network in response to providing the spelling data and the first pronunciation data as input.
  - 12. The system of claim 10, wherein using the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the second pronunciation data that indicates at least one primary stress location.
  - 13. The system of claim 10, wherein using the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the second pronunciation data that indicates a sequence of phones for the word with stress and syllable divisions and stress values.
  - 14. The system of claim 10, wherein:
    - providing the spelling data and the first pronunciation data as input to the trained recurrent neural network comprises providing a plurality of input vectors for the spelling data as input to the trained recurrent neural network, each of the plurality of input vectors indicating a particular character from the spelling data or filler; and
      
      receiving the output representing the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving a plurality of output vectors that each indicate a probability distribution over a set of symbols, a combination of the plurality of output vectors indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network.
  - 15. The system of claim 14, wherein:
    - providing the plurality of input vectors comprises providing a predetermined number of input vectors to the trained recurrent neural network as the input; and
      
      receiving the plurality of output vectors comprises receiving the predetermined number of output vectors from the trained recurrent neural network as the output.
  - 16. The system of claim 10, wherein using the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word.

17. One or more non-transitory computer readable storage media storing instructions executable by one or more data processing apparatus of a text-to-speech synthesis system and upon such execution cause the one or more data processing apparatus to perform operations comprising:
- determining spelling data that indicates the spelling of a word;
  
  determining first pronunciation data that indicates at least one stress location for the word;
  
  providing the spelling data and the first pronunciation data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving output representing a stress pattern for pronunciation of the word, the output being generated by the trained recurrent neural network in response to providing the spelling data and the first pronunciation data as input;
  
  using the output of the trained recurrent neural network to generate second pronunciation data indicating a stress pattern for a pronunciation of the word, wherein the second pronunciation data is different from the first pronunciation data that indicates at least one stress location for the word;
  
  generating, using the second pronunciation data, audio data that includes a synthesized utterance of the word and applies stress to the word based on the stress pattern indicated by the second pronunciation data; and
  
  providing, as output of the text-to-speech synthesis system, the audio data to a system that includes at least one speaker for audible presentation of the synthesized utterance of the word using the audio data.
- View Dependent Claims (18, 19)
- - 18. The computer readable storage medium of claim 17, wherein:
    - providing the spelling data and the first pronunciation data as input to the trained recurrent neural network comprises providing the spelling data and the first pronunciation data as input to a trained long short-term memory recurrent neural network; and
      
      receiving the output representing the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output representing the stress pattern for pronunciation of the word generated by the trained long short-term memory recurrent neural network in response to providing the spelling data and the first pronunciation data as input.
  - 19. The computer readable storage medium of claim 17, wherein using the output of the trained recurrent neural network to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying one or more constraints to the output to generate the second pronunciation data indicating the stress pattern for the pronunciation of the word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Chua, Mason Vijay, Rao, Kanury Kanishka, van Esch, Daniel Jacobus Josef
Primary Examiner(s)
Yen, Eric

Application Number

US15/178,719
Publication Number

US 20170358293A1
Time in Patent Office

1,033 Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/0335   Pitch control

G10L 13/047   Architecture of speech synt...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/187   Phonemic context, e.g. pron...

G10L 17/18   Artificial neural networks;...

G10L 2015/027   Syllables being the recogni...

G10L 25/30   using neural networks

Predicting pronunciations with word stress

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

36 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Predicting pronunciations with word stress

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links