Method and system for text-to-speech synthesis with personalized voice

US 9,368,102 B2
Filed: 10/10/2014
Issued: 06/14/2016
Est. Priority Date: 03/20/2007
Status: Active Grant

First Claim

Patent Images

1. A method for text-to-speech synthesis, comprising:

receiving, at a first device and from a second device, incidental audio speech data over a first network communication link, wherein the incidental audio speech data comprises speech of an operator of the second device recorded during an audio communication in which the operator of the second device participates;

generating, by the first device, a voice dataset for the operator based, at least in part, on the incidental audio speech data;

receiving, at the first device, text data from the second device over a second network communication link subsequent to receiving the incidental audio speech data;

converting, by the first device, the text data to synthesized speech, at least in part, using the voice dataset to personalize the synthesized speech to sound like the operator of the second device.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input (403) of speech in the form of an audio communication from an input speaker (401) and generating a voice dataset (404) for the input speaker (401). The method includes receiving a text input (411) at the same device as the audio input (403) and synthesizing (312) the text from the text input (411) to synthesized speech including using the voice dataset (404) to personalize the synthesized speech to sound like the input speaker (401). In addition, the method includes analyzing (316) the text for expression and adding the expression (315) to the synthesized speech. The audio communication may be part of a video communication (453) and the audio input (403) may have an associated visual input (455) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input (455).

Citations

20 Claims

1. A method for text-to-speech synthesis, comprising:
- receiving, at a first device and from a second device, incidental audio speech data over a first network communication link, wherein the incidental audio speech data comprises speech of an operator of the second device recorded during an audio communication in which the operator of the second device participates;
  
  generating, by the first device, a voice dataset for the operator based, at least in part, on the incidental audio speech data;
  
  receiving, at the first device, text data from the second device over a second network communication link subsequent to receiving the incidental audio speech data;
  
  converting, by the first device, the text data to synthesized speech, at least in part, using the voice dataset to personalize the synthesized speech to sound like the operator of the second device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein personalizing the synthesized speech comprises training a concatenative text-to-speech synthesizer using the incidental audio speech data.
  - 3. The method of claim 1, further comprising:
    - identifying at least one emotion indicator transmitted with the text data; and
      
      adding expression to the synthesized speech based on the identified at least one emotion indicator.
  - 4. The method of claim 3, further comprising:
    - identifying paralinguistic elements in the incidental audio speech data;
      
      storing at least one of the paralinguistic elements;
      
      selecting a paralinguistic element from the stored paralinguistic elements based upon an identified emotion indicator transmitted with the text data; and
      
      adding the selected paralinguistic element to the synthesized speech.
  - 5. The method of claim 3, wherein an emotion indicator includes punctuation, letter case, an acronym, emotion icon, annotated text, or a key word.
  - 6. The method of claim 3, wherein an emotion indicator is included in metadata provided with the text data.
  - 7. The method of claim 1, further comprising storing an identifier for the operator in association with the voice dataset.
  - 8. The method of claim 1, further comprising transmitting from the first device the voice data set and/or the synthesized speech to a third device, wherein the first device is a server.
  - 9. The method of claim 1, further comprising:
    - storing at least one image of the operator; and
      
      synthesizing a dynamic image, based on the at least one image, to appear like the operator for display during reproduction of the synthesized speech.
  - 10. The method of claim 9, further comprising:
    - identifying at least one visual expression from a video of the operator;
      
      storing the at least one visual expression;
      
      identifying an emotion indicator transmitted with the text data;
      
      selecting a visual expression from the stored at least one visual expression based upon the identified emotion indicator; and
      
      adding the selected visual expression to the synthesized dynamic image.

11. A first communication device comprising:
- at least one processor; and
  
  memory elements, wherein the at least one processor is configured to;
  
  receive from a second communication device incidental audio speech data over a first network communication link, wherein the incidental audio speech data comprises speech of an operator of the second device recorded during an audio communication in which the operator of the second communication device participates;
  
  generate a voice dataset for the operator based, at least in part, on the incidental audio speech data;
  
  receive text data from the second communication device over a second network communication link subsequent to receiving the incidental audio speech data;
  
  convert the text data to synthesized speech,at least in part, using the voice dataset to personalize the synthesized speech to sound like the operator of the second device.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The first communication device of claim 11, wherein personalizing the synthesized speech comprises training a concatenative text-to-speech synthesizer using the incidental audio speech data.
  - 13. The first communication device of claim 11, wherein the at least one processor is further configured to:
    - identify at least one emotion indicator transmitted with the text data; and
      
      add expression to the synthesized speech based on the identified at least one emotion indicator.
  - 14. The first communication device of claim 13, wherein the at least one processor is further configured to:
    - identify paralinguistic elements in the incidental audio speech data;
      
      store at least one of the paralinguistic elements;
      
      select a first paralinguistic element from the stored paralinguistic elements based upon an identified emotion indicator transmitted with the text data; and
      
      add the first paralinguistic element to the synthesized speech.
  - 15. The first communication device of claim 13, wherein an emotion indicator includes punctuation, letter case, an acronym, emotion icon, annotated text, or a key word.
  - 16. The first communication device of claim 13, wherein an emotion indicator is included in metadata associated with the text data.
  - 17. The first communication device of claim 11, wherein the at least one processor is further configured to store an identifier for the operator in association with the voice dataset.
  - 18. The first communication device of claim 11, wherein the at least one processor is further configured to transmit the voice data set and/or the synthesized speech to a third communication device.
  - 19. The first communication device of claim 11, wherein the at least one processor is further configured to:
    - store at least one image of the operator; and
      
      synthesize a dynamic image, based on the at least one image, to appear like the operator for displaying on a visual display during reproduction of the synthesized speech.
  - 20. The first communication device of claim 19, wherein the at least one processor is further configured to:
    - identify at least one visual expression from a video of the operator;
      
      store the at least one visual expression;
      
      identify an emotion indicator transmitted with the text data;
      
      select a visual expression from the stored at least one visual expression based upon the identified emotion indicator; and
      
      add the selected visual expression to the synthesized dynamic image.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Goldberg, Itzhack, Hoory, Ron, Mizrachi, Boaz, Kons, Zvi
Primary Examiner(s)
He, Jialong

Application Number

US14/511,458
Publication Number

US 20150025891A1
Time in Patent Office

613 Days
Field of Search

704/258
US Class Current

1/1
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/04   Details of speech synthesis...

Method and system for text-to-speech synthesis with personalized voice

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for text-to-speech synthesis with personalized voice

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links