Multilingual prosody generation

US 9,905,220 B2
Filed: 11/16/2015
Issued: 02/27/2018
Est. Priority Date: 12/30/2013
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;

accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages;

providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language;

generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and

providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network.

52 Citations

View as Search Results

20 Claims

1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages;
  
  providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language;
  
  generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and
  
  providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The system of claim 1, wherein providing the input to the neural network comprises:
    - providing, by the one or more computers, input to the neural network that includes a sequence of phonetic units that form a phonetic representation of the text in the first language.
  - 3. The system of claim 2, wherein the prosody information for the text that is output by the neural network indicates, for each phonetic unit in the sequence of phonetic units that form the phonetic representation of the text in the first language, a duration, an energy level, or a fundamental frequency coefficient.
  - 4. The system of claim 2, wherein the operations further comprise identifying one or more groups of phonetic units from among the sequence of phonetic units that form the phonetic representation of the text in the first language;
    - andwherein providing, by the one or more computers, input to the neural network, comprises providing, by the one or more computers, input to the neural network that includes (i) the sequence of phonetic units that form the phonetic representation of the text in the first language, (ii) the language identifier for the first language, and (iii) data indicating the one or more groups of phonetic units.
  - 5. The system of claim 4, wherein generating, by the one or more computers, audio data for the synthesized utterance of the text in the first language comprises:
    - using the prosody information output by the neural network to determine a fundamental frequency contour for each of the one or more groups of phonetic units;
      
      concatenating the fundamental frequency contours for the one or more groups of phonetic units to generate a continuous fundamental frequency contour for the text; and
      
      generating audio data for the synthesized utterance of the text in the first language using the continuous fundamental frequency contour.
  - 6. The system of claim 4, wherein the operations further comprise identifying one or more phonetic units that represent stressed sounds in the sequence of phonetic units that form the phonetic representation of the text in the first language;
    - andwherein identifying the one or more groups of phonetic units comprises identifying the one or more groups of phonetic units based on positions of the one or more phonetic units that represent stressed sounds within the sequence of phonetic units that form the phonetic representation of the text in the first language.
  - 7. The system of claim 1, wherein generating, by the one or more computers, audio data for the synthesized utterance of the text in the first language comprises:
    - selecting multiple recorded speech samples based on the prosody information for the text that is output by the neural network; and
      
      forming the synthesized utterance from the multiple recorded speech samples.
  - 8. The system of claim 1, wherein generating, by the one or more computers, audio data for the synthesized utterance of the text in the first language comprises generating the audio data using the prosody information for the text that is output by the neural network and audio coefficients representing synthesized speech characteristics.
  - 9. The system of claim 1, wherein:
    - accessing, by the one or more computers, the neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages comprises;
      
      accessing, by the one or more computers, a neural network that has been trained, using training data for each of multiple languages that includes, for each of the multiple languages, (i) at least one sample of speech in the language and (ii) a language identifier that indicates the language of the at least one sample of speech; and
      
      providing the language identifier for the first language as input to the neural network comprises;
      
      providing, to the neural network, the same language identifier for the first language that was provided to the neural network to identify the first language during training of the neural network.
  - 10. The system of claim 1, wherein the operations further comprise:
    - providing, by the one or more computers, input to the neural network that includes (i) a representation of a second text in a second language that is different than the first language and (ii) a language identifier for the second language that is different from the language identifier for the first language;
      
      generating, by the one or more computers, audio data for a synthesized utterance of the second text in the second language based on prosody information for the second text that is output by the neural network in response to receiving the representation of the second text and the language identifier for the second language; and
      
      providing, by the one or more computers, the audio data for the synthesized utterance of the second text in the second language.
  - 11. The system of claim 10, wherein:
    - providing, by the one or more computers, input to the neural network that includes the representation of the text in the first language, comprises;
      
      providing, by the one or more computers, input to the neural network that includes a first sequence of phonetic units selected from a phonetic alphabet to form a phonetic representation of the text in the first language; and
      
      providing, by the one or more computers, input to the neural network that includes the representation of the second text in the second language that is different than the first language, comprises;
      
      providing, by the one or more computers, input to the neural network that includes a second sequence of phonetic units selected from the phonetic alphabet to form a phonetic representation of the second text in the second language.

12. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages;
  
  providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language;
  
  generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and
  
  providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The computer program product of claim 12, wherein providing the input to the neural network comprises:
    - providing, by the one or more computers, input to the neural network that includes a sequence of phonetic units that form a phonetic representation of the text in the first language.
  - 14. The computer program product of claim 13, wherein the prosody information for the text that is output by the neural network indicates, for each phonetic unit in the sequence of phonetic units that form the phonetic representation of the text in the first language, a duration, an energy level, or a fundamental frequency coefficient.
  - 15. The computer program product of claim 13, wherein the operations further comprise identifying one or more groups of phonetic units from among the sequence of phonetic units that form the phonetic representation of the text in the first language;
    - andwherein providing, by the one or more computers, input to the neural network, comprises providing, by the one or more computers, input to the neural network that includes (i) the sequence of phonetic units that form the phonetic representation of the text in the first language, (ii) the language identifier for the first language, and (iii) data indicating the one or more groups of phonetic units.
  - 16. The computer program product of claim 15, wherein generating, by the one or more computers, audio data for the synthesized utterance of the text in the first language comprises:
    - using the prosody information output by the neural network to determine a fundamental frequency contour for each of the one or more groups of phonetic units;
      
      concatenating the fundamental frequency contours for the one or more groups of phonetic units to generate a continuous fundamental frequency contour for the text; and
      
      generating audio data for the synthesized utterance of the text in the first language using the continuous fundamental frequency contour.

17. A computer-implemented method comprising:
- accessing, by one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages;
  
  providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language;
  
  generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and
  
  providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, wherein providing the input to the neural network comprises:
    - providing, by the one or more computers, input to the neural network that includes a sequence of phonetic units that form a phonetic representation of the text in the first language.
  - 19. The method of claim 18, wherein the prosody information for the text that is output by the neural network indicates, for each phonetic unit in the sequence of phonetic units that form the phonetic representation of the text in the first language, a duration, an energy level, or a fundamental frequency coefficient.
  - 20. The method of claim 18, further comprising identifying one or more groups of phonetic units from among the sequence of phonetic units that form the phonetic representation of the text in the first language;
    - andwherein providing, by the one or more computers, input to the neural network, comprises providing, by the one or more computers, input to the neural network that includes (i) the sequence of phonetic units that form the phonetic representation of the text in the first language, (ii) the language identifier for the first language, and (iii) data indicating the one or more groups of phonetic units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Fructuoso, Javier Gonzalvo, Senior, Andrew W., Chun, Byungha
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US14/942,300
Publication Number

US 20160071512A1
Time in Patent Office

834 Days
Field of Search

704260
US Class Current
CPC Class Codes

G06F 40/58   Use of machine translation,...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

G10L 13/086   Detection of language

G10L 13/10   Prosody rules derived from ...

G10L 25/30   using neural networks

Multilingual prosody generation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

52 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multilingual prosody generation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

52 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links