PREDICTING PRONUNCIATIONS WITH WORD STRESS

US 20170358293A1
Filed: 06/10/2016
Published: 12/14/2017
Est. Priority Date: 06/10/2016
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers, the method comprising:

determining, by the one or more computers, spelling data that indicates the spelling of a word;

providing, by the one or more computers, the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;

receiving, by the one or more computers, output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input;

using, by the one or more computers, the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word; and

providing, by the one or more computers, the pronunciation data to a text-to-speech system or an automatic speech recognition system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating word pronunciations. One of the methods includes determining, by one or more computers, spelling data that indicates the spelling of a word, providing the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words, receiving output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input, using the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word, and providing, by the one or more computers, the pronunciation data to a text-to-speech system or an automatic speech recognition system.

Citations

20 Claims

1. A method performed by one or more computers, the method comprising:
- determining, by the one or more computers, spelling data that indicates the spelling of a word;
  
  providing, by the one or more computers, the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving, by the one or more computers, output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input;
  
  using, by the one or more computers, the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word; and
  
  providing, by the one or more computers, the pronunciation data to a text-to-speech system or an automatic speech recognition system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein:
    - providing the spelling data as input to the trained recurrent neural network comprises providing the spelling data as input to a trained long short-term memory recurrent neural network; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output indicating the stress pattern for pronunciation of the word generated by the trained long short-term memory recurrent neural network in response to providing the spelling data as input.
  - 3. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the pronunciation that indicates at least one primary stress location.
  - 4. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the pronunciation that indicates a sequence of phones for the word with stress and syllable divisions and stress values.
  - 5. The method of claim 1, comprising determining, by the one or more computers, pronunciation data that indicates at least one stress location for the word, wherein:
    - providing, by the one or more computers, the spelling data as input to the trained recurrent neural network comprises providing the spelling data and the pronunciation data as input to the trained recurrent neural network; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data and the pronunciation data as input.
  - 6. The method of claim 1, wherein:
    - providing, by the one or more computers, the spelling data as input to the trained recurrent neural network comprises providing a plurality of input vectors for the spelling data as input to the trained recurrent neural network, each of the plurality of input vectors indicating a particular character from the spelling data or filler; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving a plurality of output vectors that each indicate a probability distribution over a set of symbols, a combination of the plurality of output vectors indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network.
  - 7. The method of claim 6, wherein:
    - providing the plurality of input vectors comprises providing a predetermined number of input vectors to the trained recurrent neural network as the input; and
      
      receiving the plurality of output vectors comprises receiving the predetermined number of output vectors from the trained recurrent neural network as the output.
  - 8. The method of claim 1, wherein using, by the one or more computers, the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying one or more constraints to the output to generate the pronunciation data indicating the stress pattern for the pronunciation of the word.
  - 9. The method of claim 8, wherein applying the one or more constraints to the output to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises:
    - using beam search on the output of the trained recurrent neural network to determine a path in the output with a highest likelihood of satisfying the one or more constraints; and
      
      using the path with the highest likelihood of satisfying the one or more constraints to generate the pronunciation data indicating the stress pattern for the pronunciation of the word.
  - 10. The method of claim 8, wherein applying the one or more constraints to the output to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying, by a network of finite state transducers, the one or more constraints to the output to generate the pronunciation data indicating the stress pattern for the pronunciation of the word.
  - 11. The method of claim 1, comprising:
    - receiving, from the text-to-speech system, audio data generated using the pronunciation data in response to providing, by the one or more computers, the pronunciation data to the text-to-speech system, wherein providing, by the one or more computers, the pronunciation data to the text-to-speech system or an automatic speech recognition system comprises providing, by the one or more computers, the pronunciation data to a text-to-speech system.

12. A system comprising:
- a data processing apparatus; and
  
  a non-transitory computer readable storage medium in data communication with the data processing apparatus and storing instructions executable by the data processing apparatus and upon such execution cause the data processing apparatus to perform operations comprising;
  
  determining spelling data that indicates the spelling of a word;
  
  providing the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input;
  
  using the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word; and
  
  providing the pronunciation data to a text-to-speech system or an automatic speech recognition system.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system of claim 12, wherein:
    - providing the spelling data as input to the trained recurrent neural network comprises providing the spelling data as input to a trained long short-term memory recurrent neural network; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output indicating the stress pattern for pronunciation of the word generated by the trained long short-term memory recurrent neural network in response to providing the spelling data as input.
  - 14. The system of claim 12, wherein using the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the pronunciation that indicates at least one primary stress location.
  - 15. The system of claim 12, wherein using the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises using the output to generate the pronunciation that indicates a sequence of phones for the word with stress and syllable divisions and stress values.
  - 16. The system of claim 12, the operations comprising determining pronunciation data that indicates at least one stress location for the word, wherein:
    - providing the spelling data as input to the trained recurrent neural network comprises providing the spelling data and the pronunciation data as input to the trained recurrent neural network; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data and the pronunciation data as input.
  - 17. The system of claim 12, wherein:
    - providing the spelling data as input to the trained recurrent neural network comprises providing a plurality of input vectors for the spelling data as input to the trained recurrent neural network, each of the plurality of input vectors indicating a particular character from the spelling data or filler; and
      
      receiving the output indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network comprises receiving a plurality of output vectors that each indicate a probability distribution over a set of symbols, a combination of the plurality of output vectors indicating the stress pattern for pronunciation of the word generated by the trained recurrent neural network.
  - 18. The system of claim 17, wherein:
    - providing the plurality of input vectors comprises providing a predetermined number of input vectors to the trained recurrent neural network as the input; and
      
      receiving the plurality of output vectors comprises receiving the predetermined number of output vectors from the trained recurrent neural network as the output.
  - 19. The system of claim 12, wherein using the output of the trained recurrent neural network to generate the pronunciation data indicating the stress pattern for the pronunciation of the word comprises applying one or more constraints to the output to generate the pronunciation data indicating the stress pattern for the pronunciation of the word.

20. A non-transitory computer readable storage medium storing instructions executable by a data processing apparatus and upon such execution cause the data processing apparatus to perform operations comprising:
- determining spelling data that indicates the spelling of a word;
  
  providing the spelling data as input to a trained recurrent neural network, the trained recurrent neural network being trained to indicate characteristics of word pronunciations based at least on data indicating the spelling of words;
  
  receiving output indicating a stress pattern for pronunciation of the word generated by the trained recurrent neural network in response to providing the spelling data as input;
  
  using the output of the trained recurrent neural network to generate pronunciation data indicating the stress pattern for a pronunciation of the word; and
  
  providing the pronunciation data to a text-to-speech system or an automatic speech recognition system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Chua, Mason Vijay, Rao, Kanury Kanishka, van Esch, Daniel Jacobus Josef

Granted Patent

US 10,255,905 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/0335   Pitch control

G10L 13/047   Architecture of speech synt...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/187   Phonemic context, e.g. pron...

G10L 17/18   Artificial neural networks;...

G10L 2015/027   Syllables being the recogni...

G10L 25/30   using neural networks

PREDICTING PRONUNCIATIONS WITH WORD STRESS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

PREDICTING PRONUNCIATIONS WITH WORD STRESS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links