METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE

US 20080235024A1
Filed: 03/20/2007
Published: 09/25/2008
Est. Priority Date: 03/20/2007
Status: Active Grant

First Claim

Patent Images

1. A method for text-to-speech synthesis with personalized voice, comprising:

receiving an incidental audio input of speech in the form of an audio communication from an input speaker and generating a voice dataset for the input speaker;

receiving a text input at a same device as the audio input;

synthesizing the text from the text input to synthesized speech including using the voice dataset to personalize the synthesized speech to sound like the input speaker.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input (403) of speech in the form of an audio communication from an input speaker (401) and generating a voice dataset (404) for the input speaker (401). The method includes receiving a text input (411) at the same device as the audio input (403) and synthesizing (312) the text from the text input (411) to synthesized speech including using the voice dataset (404) to personalize the synthesized speech to sound like the input speaker (401). In addition, the method includes analyzing (316) the text for expression and adding the expression (315) to the synthesized speech. The audio communication may be part of a video communication (453) and the audio input (403) may have an associated visual input (455) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input (455).

Citations

35 Claims

1. A method for text-to-speech synthesis with personalized voice, comprising:
- receiving an incidental audio input of speech in the form of an audio communication from an input speaker and generating a voice dataset for the input speaker;
  
  receiving a text input at a same device as the audio input;
  
  synthesizing the text from the text input to synthesized speech including using the voice dataset to personalize the synthesized speech to sound like the input speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as claimed in claim 1, wherein personalizing the synthesized speech includes training a concatenative synthetic voice to sound like the input speaker by using a voice morphing transformation.
  - 3. The method as claimed in claim 1, wherein the audio input of speech has an associated visual input of an image of the input speaker and the method includes generating an image dataset, and wherein synthesizing to synthesized speech includes synthesizing an associated synthesized image, including using the image dataset to personalize the synthesized image to look like the input speaker image.
  - 4. The method as claimed in claim 1, including:
    - analyzing the text for expression;
      
      adding the expression to the synthesized speech.
  - 5. The method as claimed in claim 4, including:
    - storing paralinguistic expression elements from the audio input of speech;
      
      adding the paralinguistic expression elements to the personalized synthesized speech.
  - 6. The method as claimed in claim 4, including:
    - storing visual expressions from the visual input; and
      
      adding the visual expressions to the personalized synthesized image.
  - 7. The method as claimed in claim 4, wherein analyzing the text includes identifying one or more of the group of:
    - punctuation, letter case, paralinguistic elements, acronyms, emotion icons, and key words.
  - 8. The method as claimed in claim 4, wherein metadata is provided in association with text elements to indicate the expression.
  - 9. The method as claimed in claim 4, wherein the text is annotated to indicate the expression.
  - 10. The method as claimed in claim 1, wherein the device is one of the group of:
    - an instant messaging client system, a mobile communication device, a broadcasting device, all with both audio and text capabilities.
  - 11. The method as claimed in claim 1, wherein an identifier of the source of the audio input is stored in association with the voice dataset and the voice dataset is used in synthesis of text inputs from the same source.

12. A method for text-to-speech synthesis with personalized voice, comprising:
- receiving an audio input of speech from an input speaker and generating a voice dataset for the input speaker;
  
  receiving a text input at a same device as the audio input;
  
  analyzing the text for expression;
  
  synthesizing the text from the text input to synthesized speech including using the voice dataset to personalize the synthesized speech to sound like the input speaker and adding expression in the personalized synthesized speech.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The method as claimed in claim 12, wherein the audio input of speech is incidental at a device.
  - 14. The method as claimed in claim 12, including training a concatenative synthetic voice to sound like the input speaker including a voice morphing transformation.
  - 15. The method as claimed in claim 12, wherein the audio input of speech has an associated visual input of an image of the input speaker and the method includes generating an image dataset, and wherein synthesizing to synthesized speech includes synthesizing an associated synthesized image, including using the image dataset to personalize the synthesized image to look like the input speaker image.
  - 16. The method as claimed in claim 12, including:
    - storing paralinguistic expression elements from the audio input of speech;
      
      adding the paralinguistic expression elements to the personalized synthesized speech.
  - 17. The method as claimed in claim 15, including:
    - storing visual expressions from the visual input; and
      
      adding the visual expressions to the personalized synthesized image.
  - 18. The method as claimed in claim 12, wherein analyzing the text includes identifying one or more of the group of:
    - punctuation, letter case, paralinguistic elements, acronyms, emotion icons, and key words.
  - 19. The method as claimed in claim 12, wherein metadata is provided in association with text elements to indicate the expression.
  - 20. The method as claimed in claim 12, wherein the text is annotated to indicate the expression.
  - 21. The method as claimed in claim 12, wherein the device is one of the group of an instant messaging client system, a mobile communication device, or a broadcasting device, all with both audio and text capabilities.
  - 22. The method as claimed in claim 12, wherein an identifier of the source of the audio input is stored in association with the voice dataset and the voice dataset is used in synthesis of text inputs from the same source.

23. A computer program product stored on a computer readable storage medium for text-to-speech synthesis, comprising computer readable program code means for performing the steps of:
- receiving an incidental audio input of speech in the form of an audio communication from an input speaker and generating a voice dataset for the input speaker;
  
  receiving a text input at a same device as the audio input;
  
  synthesizing the text from the text input to synthesized speech including using the voice dataset to personalize the synthesized speech to sound like the input speaker.

24. A system for text-to-speech synthesis with personalized voice, comprising:
- audio communication means for input of speech from an input speaker and means for generating a voice dataset for an input speaker;
  
  text input means at the same device as the audio communication means;
  
  a text-to-speech synthesizer for producing synthesized speech including means for converting the synthesized speech to sound like the input speaker.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 25. The system as claimed in claim 24, including a text expression analyzer;
    - and wherein the text-to-speech synthesizer includes means for adding expression to the synthesized speech.
  - 26. The system as claimed in claim 24, including a video communication means including the audio communication means with an associated visual communication means for visual input of an image of the input speaker, the system also including means for generating an image dataset for an input speaker, wherein the synthesizer provides a synthesized image which looks like the input speaker image.
  - 27. The system as claimed in claim 26, wherein the synthesizer includes means for adding expression to the synthesized image.
  - 28. The system as claimed in claim 24, including a training module for training a concatenative synthetic voice to sound like the input speaker, wherein the training module includes a voice morphing transformation.
  - 29. The system as claimed in claim 24, including:
    - means for storing expression elements from the speech input or image input; and
      
      the means for adding expression adds the expression elements to the synthesized speech or synthesized image.
  - 30. The system as claimed in claim 24, wherein the text expression analyzer provides metadata in association with text elements to indicate the expression.
  - 31. The system as claimed in claim 24, wherein the text expression analyzer provides text annotation to indicate the expression.
  - 32. The system as claimed in claim 24, wherein the system is one of the group of:
    - an instant messaging system, a mobile communication device, or a broadcasting device, all with audio and text capabilities.
  - 33. The system as claimed in claim 24, wherein one or more of the group of:
    - the text expression analyzer;
      
      the text-to-speech synthesizer, and the training module are provided remotely on a server.
  - 34. The system as claimed in claim 24, including a server that includes means for obtaining the audio input from a device for training and text-to-speech synthesis, and output means for sending the output audio from the server to a device.

35. A method of providing a service to a customer over a network, the service comprising:
- obtaining a received incidental audio input of speech, in the form of an audio communication, from an input speaker and generating a voice dataset for the input speaker;
  
  receiving a text input from a client;
  
  synthesizing the text from the text input to synthesized speech including using the voice dataset to personalize the synthesized speech to sound like the input speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Mizrachi, Boaz, Goldberg, Itzhack, Hoory, Ron, Kons, Zvi

Granted Patent

US 8,886,537 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/04   Details of speech synthesis...

METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links