Multilingual prosody generation

US 9,195,656 B2
Filed: 12/30/2013
Issued: 11/24/2015
Est. Priority Date: 12/30/2013
Status: Active Grant

First Claim

Patent Images

1. A method performed by data processing apparatus, the method comprising:

obtaining data indicating a set of linguistic features corresponding to a text;

providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to a neural network that has been trained to provide output indicating prosody information for multiple languages, the neural network having been trained using speech in multiple languages;

receiving, from the neural network, output indicating prosody information for the linguistic features; and

generating audio data representing the text using the output of the neural network.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network.

Citations

20 Claims

1. A method performed by data processing apparatus, the method comprising:
- obtaining data indicating a set of linguistic features corresponding to a text;
  
  providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to a neural network that has been trained to provide output indicating prosody information for multiple languages, the neural network having been trained using speech in multiple languages;
  
  receiving, from the neural network, output indicating prosody information for the linguistic features; and
  
  generating audio data representing the text using the output of the neural network.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the text is a first text in a first language;
    - wherein the method further comprises;
      
      obtaining data indicating a set of second linguistic features corresponding to a second text in a second language that is different from the first language;
      
      providing (i) data indicating the second linguistic features and (ii) data indicating the language of the second text as input to the neural network that has been trained to provide output indicating prosody information for multiple languages;
      
      receiving, from the neural network, second output indicating prosody information for the second linguistic features; and
      
      generating an audio representation of the second text using the second output of the neural network.
  - 3. The method of claim 1, wherein receiving, from the neural network, output indicating prosody information comprises receiving data indicating one or more of a duration, an energy level, and one or more fundamental frequency coefficients.
  - 4. The method of claim 1, further comprising determining a linguistic group that includes a subset of the linguistic features in the set of linguistic features;
    - wherein providing data indicating the linguistic features to the neural network comprises providing data indicating the subset of linguistic features in the linguistic group as input to the neural network; and
      
      wherein receiving, from the neural network, output indicating prosody information for the linguistic features comprises receiving, from the neural network, output indicating prosody information for the linguistic group.
  - 5. The method of claim 4, wherein obtaining data indicating the set of linguistic features corresponding to the text comprises obtaining data indicating a sequence of linguistic features in a phonetic representation of the text;
    - andwherein determining the linguistic group comprises determining the linguistic group based on a position of one or more stressed linguistic features in the sequence of linguistic features.
  - 6. The method of claim 1, further comprising determining multiple linguistic groups within the set of linguistic features, each of the multiple linguistic groups including a different portion of the set of linguistic features;
    - wherein providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to the neural network comprises providing, for each of the multiple linguistic groups, data indicating the linguistic features in the linguistic group and data indicating the language of the text;
      
      wherein receiving, from the neural network, the output indicating prosody information for the linguistic features comprises receiving, from the neural network, a set of output indicating prosody information for each of the multiple linguistic groups; and
      
      wherein generating the audio data representing the text using the output of the neural network comprises;
      
      using the output of the neural network to determine a fundamental frequency contour for each of the multiple linguistic groups;
      
      concatenating the fundamental frequency contours for the multiple linguistic groups to generate a continuous fundamental frequency contour for the text; and
      
      generating the audio representation using the continuous fundamental frequency contour.
  - 7. The method of claim 1, wherein generating the audio data representing the text using the output of the neural network comprises selecting one or more recorded speech samples based on the output of the neural network.
  - 8. The method of claim 1, wherein generating the audio data representing the text using the output of the neural network comprises generating the audio representation using the output of the neural network and audio coefficients representing synthesized speech characteristics.

9. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  obtaining data indicating a set of linguistic features corresponding to a text;
  
  providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to a neural network that has been trained to provide output indicating prosody information for multiple languages, the neural network having been trained using speech in multiple languages;
  
  receiving, from the neural network, output indicating prosody information for the linguistic features; and
  
  generating audio data representing the text using the output of the neural network.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the text is a first text in a first language;
    - wherein the operations further comprise;
      
      obtaining data indicating a set of second linguistic features corresponding to a second text in a second language that is different from the first language;
      
      providing (i) data indicating the second linguistic features and (ii) data indicating the language of the second text as input to the neural network that has been trained to provide output indicating prosody information for multiple languages;
      
      receiving, from the neural network, second output indicating prosody information for the second linguistic features; and
      
      generating an audio representation of the second text using the second output of the neural network.
  - 11. The system of claim 9, wherein receiving, from the neural network, output indicating prosody information comprises receiving data indicating one or more of a duration, an energy level, and one or more fundamental frequency coefficients.
  - 12. The system of claim 9, wherein the operation further comprise determining a linguistic group that includes a subset of the linguistic features in the set of linguistic features;
    - wherein providing data indicating the linguistic features to the neural network comprises providing data indicating the subset of linguistic features in the linguistic group as input to the neural network; and
      
      wherein receiving, from the neural network, output indicating prosody information for the linguistic features comprises receiving, from the neural network, output indicating prosody information for the linguistic group.
  - 13. The system of claim 12, wherein obtaining data indicating the set of linguistic features corresponding to the text comprises obtaining data indicating a sequence of linguistic features in a phonetic representation of the text;
    - andwherein determining the linguistic group comprises determining the linguistic group based on a position of one or more stressed linguistic features in the sequence of linguistic features.
  - 14. The system of claim 9, wherein the operations further comprise determining multiple linguistic groups within the set of linguistic features, each of the multiple linguistic groups including a different portion of the set of linguistic features;
    - wherein providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to the neural network comprises providing, for each of the multiple linguistic groups, data indicating the linguistic features in the linguistic group and data indicating the language of the text;
      
      wherein receiving, from the neural network, the output indicating prosody information for the linguistic features comprises receiving, from the neural network, a set of output indicating prosody information for each of the multiple linguistic groups; and
      
      wherein generating the audio data representing the text using the output of the neural network comprises;
      
      using the output of the neural network to determine a fundamental frequency contour for each of the multiple linguistic groups;
      
      concatenating the fundamental frequency contours for the multiple linguistic groups to generate a continuous fundamental frequency contour for the text; and
      
      generating the audio representation using the continuous fundamental frequency contour.
  - 15. The system of claim 9, wherein generating the audio data representing the text using the output of the neural network comprises selecting one or more recorded speech samples based on the output of the neural network.
  - 16. The system of claim 9, wherein generating the audio data representing the text using the output of the neural network comprises generating the audio representation using the output of the neural network and audio coefficients representing synthesized speech characteristics.

17. A computer-readable storage device storing a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- obtaining data indicating a set of linguistic features corresponding to a text;
  
  providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to a neural network that has been trained to provide output indicating prosody information for multiple languages, the neural network having been trained using speech in multiple languages;
  
  receiving, from the neural network, output indicating prosody information for the linguistic features; and
  
  generating audio data representing the text using the output of the neural network.
- View Dependent Claims (18, 19, 20)
- - 18. The computer-readable storage device of claim 17, wherein the text is a first text in a first language;
    - wherein the operations further comprise;
      
      obtaining data indicating a set of second linguistic features corresponding to a second text in a second language that is different from the first language;
      
      providing (i) data indicating the second linguistic features and (ii) data indicating the language of the second text as input to the neural network that has been trained to provide output indicating prosody information for multiple languages;
      
      receiving, from the neural network, second output indicating prosody information for the second linguistic features; and
      
      generating an audio representation of the second text using the second output of the neural network.
  - 19. The computer-readable storage device of claim 17, wherein receiving, from the neural network, output indicating prosody information comprises receiving data indicating one or more of a duration, an energy level, and one or more fundamental frequency coefficients.
  - 20. The computer-readable storage device of claim 17, wherein the operations further comprise determining a linguistic group that includes a subset of the linguistic features in the set of linguistic features;
    - wherein providing data indicating the linguistic features to the neural network comprises providing data indicating the subset of linguistic features in the linguistic group as input to the neural network; and
      
      wherein receiving, from the neural network, output indicating prosody information for the linguistic features comprises receiving, from the neural network, output indicating prosody information for the linguistic group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Fructuoso, Javier Gonzalvo, Senior, Andrew W., Chun, Byungha
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US14/143,627
Publication Number

US 20150186359A1
Time in Patent Office

694 Days
Field of Search

704/260
US Class Current

1/1
CPC Class Codes

G06F 40/58   Use of machine translation,...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

G10L 13/086   Detection of language

G10L 13/10   Prosody rules derived from ...

G10L 25/30   using neural networks

Multilingual prosody generation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multilingual prosody generation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links