Personalized text-to-speech synthesis and personalized speech feature extraction

US 8,655,659 B2
Filed: 08/12/2010
Issued: 02/18/2014
Est. Priority Date: 01/05/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A personalized text-to-speech synthesizing device, comprising:

a processor;

a memory;

a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by recognizing whether a keyword from preset keywords associated with the specific speaker occurs in a random speech fragment of the specific speaker that includes multiple words including the keyword and speech in addition to the keyword, the random speech fragment being part of a multiple speaker conversation including the speaker, and, if the keyword is found in the random speech fragment, recognizing the personalized speech features of the specific speaker based on a comparison of a standard speech of the keyword and the speech of the keyword by the specific speaker in the random speech fragment, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and

a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A personalized text-to-speech synthesizing device includes: a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by comparing a random speech fragment of the specific speaker with preset keywords, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker. A personalized speech feature library of a specific speaker is established without a deliberate training process, and a text is synthesized into personalized speech with the speech characteristics of the speaker.

Citations

37 Claims

1. A personalized text-to-speech synthesizing device, comprising:
- a processor;
  
  a memory;
  
  a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by recognizing whether a keyword from preset keywords associated with the specific speaker occurs in a random speech fragment of the specific speaker that includes multiple words including the keyword and speech in addition to the keyword, the random speech fragment being part of a multiple speaker conversation including the speaker, and, if the keyword is found in the random speech fragment, recognizing the personalized speech features of the specific speaker based on a comparison of a standard speech of the keyword and the speech of the keyword by the specific speaker in the random speech fragment, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and
  
  a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The personalized text-to-speech synthesizing device according to claim 1, wherein the personalized speech feature library creator comprises:
    - a keyword setting unit, configured to set one or more keywords suitable for reflecting the pronunciation characteristics of the specific speaker with respect to a specific language, and store the set keywords in association with the specific speaker;
      
      a speech feature recognition unit, configured to recognize the speech features of the specific speaker according to a standard pronunciation of the recognized keyword and the pronunciation of the specific speaker; and
      
      a speech feature filtration unit, configured to filter out abnormal speech features through statistical analysis while retaining speech features reflecting the normal pronunciation characteristics of the specific speaker, when the speech features of the specific speaker recognized by the speech feature recognition unit reach a predetermined number, thereby to create the personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker.
  - 3. The personalized text-to-speech synthesizing device according to claim 2, wherein the keyword setting unit is further configured to set keywords suitable for reflecting the pronunciation characteristics of the specific speaker with respect to a plurality of specific languages.
  - 4. The personalized text-to-speech synthesizing device according to claim 2, wherein the speech feature recognition unit is further configured to recognize whether the keyword occurs in the speech fragment of the specific speaker by comparing the speech fragment of the specific speaker with the standard pronunciation of the keyword in terms of their respective speech frequency spectrums, which are derived by performing a time-domain to frequency-domain transform to the respective speech data in time domain.
  - 5. The personalized text-to-speech synthesizing device according to claim 1, wherein the personalized speech feature library creator is further configured to update the personalized speech feature library associated with the specific speaker when a new speech fragment of the specific speaker is received.
  - 6. The personalized text-to-speech synthesizing device according to claim 2, wherein parameters representing the speech features include frequency, volume, rhythm and end sound.
  - 7. The personalized text-to-speech synthesizing device according to claim 6, wherein the speech feature filtration unit is further configured to filter speech features with respect to the parameters representing the respective speech features.
  - 8. The personalized text-to-speech synthesizing device according to claim 1, wherein the keyword is a monosyllable high frequency word.
  - 18. A communication terminal capable of text transmission and speech session, wherein a number of the communication terminals are connected to each other through a wireless communication network or a wired communication network, so that a text transmission or speech session can be carried out therebetween,wherein the communication terminal comprises a text transmission synthesizing device, a speech session device and the personalized text-to-speech synthesizing device according to claim 1.
  - 19. The communication terminal according to claim 18, further comprising:
    - a speech feature recognition trigger device, configured to trigger the personalized text-to-speech synthesizing device to perform a personalized speech feature recognition of speech fragment of any or both speakers in a speech session, when the communication terminal is used for the speech session, thereby to create and store a personalized speech feature library associated with the any or both speakers in the speech session; and
      
      a text-to-speech trigger synthesis device, configured to enquire whether any personalized speech feature library associated with a subscriber transmitting a text message or a subscriber from whom a text message is received is included in the communication terminal when the communication terminal is used for transmitting or receiving text messages, and trigger the personalized text-to-speech synthesizing device to synthesize the text messages to be transmitted or having been received into a speech fragment when the enquiry result is affirmative, and transmit the speech fragment to the counterpart or display to the local subscriber at the communication terminal.
  - 20. The communication terminal according to claim 18, wherein the communication terminal is a mobile phone.
  - 21. The communication terminal according to claim 18, wherein the communication terminal is a computer client.
  - 22. A communication system capable of text transmission and speech session, comprising a controlling device, and a plurality of communication terminals capable of text transmission and speech session via the controlling device,wherein the controlling device is provided with the personalized text-to-speech synthesizing device according to claim 1.
  - 23. The communication system according to claim 22, wherein the controlling device further comprises:
    - a speech feature recognition trigger device, configured to trigger the personalized text-to-speech synthesizing device to perform a personalized speech feature recognition of speech fragments of speakers in a speech session, when two or more of the plurality of communication terminals are used for the speech session via the controlling device, thereby to create and store personalized speech feature libraries associated with respective speakers in the speech session respectively; and
      
      a text-to-speech trigger synthesis device configured to enquire whether any personalized speech feature library associated with a subscriber transmitting a text message occurs in the controlling device when the controlling device receives the text messages transmitted by any of the plurality of communication terminals to another communication terminal, trigger the personalized text-to-speech synthesizing device to synthesize the text messages having been received into a speech fragment when the enquiry result is affirmative, and transfer the speech fragment to the another communication terminal.
  - 24. The communication system according to claim 22, wherein the controlling device is a wireless network controller, the communication terminal is a mobile phone, and the wireless network controller and the mobile phone are connected to each other through a wireless communication network.
  - 25. The communication system according to claim 22, wherein the controlling device is a server, the communication terminal is a computer client, and the server and the computer client are connected to each other through Internet.

9. A personalized text-to-speech synthesizing method, comprising:
- presetting one or more keywords with respect to a specific language;
  
  receiving a random speech fragment of a specific speaker that includes multiple words including a keyword from the preset one or more keywords and speech in addition to the keyword, wherein the random speech fragment is part of a multiple speaker conversation including the speaker;
  
  recognizing personalized speech features of the specific speaker by recognizing whether the keyword is found in the random speech fragment of the specific speaker, and, if the keyword is found in the random speech fragment, recognizing the personalized speech features of the specific speaker based on a comparison of a standard speech of the keyword and the speech of the keyword by the specific speaker in the random speech fragment, thereby creating a personalized speech feature library associated with the specific speaker, and storing in a memory the personalized speech feature library in association with the specific speaker; and
  
  performing a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker, thereby generating and outputting a speech fragment having pronunciation characteristics of the specific speaker.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The personalized text-to-speech synthesizing method according to claim 9, wherein the keywords are suitable for reflecting the pronunciation characteristics of the specific speaker and stored in association with the specific speaker.
  - 11. The personalized text-to-speech synthesizing method according to claim 10, wherein creating the personalized speech feature library associated with the specific speaker comprises:
    - recognizing the speech features of the speaker according to a standard pronunciation of the recognized keyword and the pronunciation of the specific speaker; and
      
      filtering out abnormal speech features through statistical analysis while retaining speech features reflecting the normal pronunciation characteristics of the specific speaker, when the recognized speech features of the specific speaker reach a predetermined number, thereby creating the personalized speech feature library associated with the specific speaker, and storing the personalized speech feature library in association with the specific speaker.
  - 12. The personalized text-to-speech synthesizing method according to claim 11, wherein keywords suitable for reflecting the pronunciation characteristics of the specific speaker are set with respect to a plurality of specific languages.
  - 13. The personalized text-to-speech synthesizing method according to claim 11, wherein recognizing whether the keyword occurs in the speech fragment of the specific speaker is performed by comparing the speech fragment of the specific speaker with the standard pronunciation of the keyword in terms of their respective speech spectrums, which are derived by performing a time-domain to frequency-domain transform to the respective speech data in time domain.
  - 14. The personalized text-to-speech synthesizing method according to claim 9, wherein creating the personalized speech feature library comprising updating the personalized speech feature library associated with the specific speaker when a new speech fragment of the specific speaker is received.
  - 15. The personalized text-to-speech synthesizing method according to claim 11, wherein parameters representing the speech features include frequency, volume, rhythm and end sound.
  - 16. The personalized text-to-speech synthesizing method according to claim 15, wherein the speech features are filtered with respect to the parameters representing the respective speech features.
  - 17. The personalized text-to-speech synthesizing method according to claim 9, wherein the keyword is a monosyllable high frequency word.

26. A personalized speech feature extraction device, comprising:
- a processor;
  
  a memory;
  
  a keyword setting unit, configured to set one or more keywords suitable for reflecting the pronunciation characteristics of a specific speaker with respect to a specific language, and store the keywords in association with the specific speaker;
  
  a speech feature recognition unit, configured to recognize whether any keyword associated with the specific speaker occurs in a random speech fragment of the specific speaker that includes multiple words including the keyword and speech in addition to the keyword, the random speech fragment obtained from a multiple speaker conversation including the speaker, and when a keyword associated with the specific speaker is found in the speech fragment of the specific speaker, recognize speech features of the specific speaker according to a standard pronunciation of the recognized keyword and the pronunciation of the speaker;
  
  a speech feature filtration unit, configured to filter out abnormal speech features from the keyword as found in the speech fragment through statistical analysis while retaining speech features reflecting the normal pronunciation characteristics of the specific speaker, when the speech features of the specific speaker recognized by the speech feature recognition unit reach a predetermined number, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and
  
  a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the stored personalized speech feature library associated with the specific speaker.
- View Dependent Claims (27, 28, 29, 30, 31)
- - 27. The personalized speech feature extraction device according to claim 26, wherein the keyword setting unit is further configured to set keywords suitable for reflecting the pronunciation characteristics of the specific speaker with respect to a plurality of specific languages.
  - 28. The personalized speech feature extraction device according to claim 26, wherein the speech feature recognition unit is further configured to recognize whether the keyword occurs in the speech fragment of the specific speaker by comparing the speech fragment of the specific speaker with the standard pronunciation of the keyword in terms of their respective speech spectrums, which are derived by performing a time-domain to frequency-domain transform to the respective speech data in time domain.
  - 29. The personalized speech feature extraction device according to claim 26, wherein parameters representing the speech features include frequency, volume, rhythm and end sound.
  - 30. The personalized speech feature extraction device according to claim 29, wherein the speech feature filtration unit is further configured to filter out speech features with respect to the parameters representing the respective speech features.
  - 31. The personalized speech feature extraction device according to claim 26, wherein the keyword is a monosyllable high frequency word.

32. A personalized speech feature extraction method, comprising:
- setting one or more keywords suitable for reflecting the pronunciation characteristics of a specific speaker with respect to a specific language, and storing in a memory the keywords in association with the specific speaker;
  
  recognizing whether any keyword associated with the specific speaker occurs in a random speech fragment of the specific speaker obtained from a multiple speaker conversation including the speaker and that includes multiple words including the keyword and speech in addition to the keyword, and when a keyword associated with the specific speaker is found in the speech fragment of the specific speaker, recognizing speech features of the specific speaker according to a standard pronunciation of the recognized keyword and the pronunciation of the speaker; and
  
  filtering out abnormal speech features from the keyword as found in the speech fragment through statistical analysis while retaining speech features reflecting the normal pronunciation characteristics of the specific speaker, when the speech features of the specific speaker recognized by the speech feature recognition unit reach a predetermined number, thereby creating a personalized speech feature library associated with the specific speaker, and storing the personalized speech feature library in association with the specific speaker; and
  
  performing a speech synthesis of a text message from the specific speaker based on the stored personalized speech feature library associated with the specific speaker.
- View Dependent Claims (33, 34, 35, 36, 37)
- - 33. The personalized speech feature extraction method according to claim 32, wherein the setting comprises:
    - setting keywords suitable for reflecting the pronunciation characteristics of the specific speaker with respect to a plurality of specific languages.
  - 34. The personalized speech feature extraction method according to claim 32, wherein the recognizing comprises:
    - recognizing whether the keyword occurs in the speech fragment of the specific speaker by comparing the speech fragment of the specific speaker with the standard pronunciation of the keyword in terms of their respective speech spectrum.
  - 35. The personalized speech feature extraction method according to claim 32, wherein parameters representing the speech features include frequency, volume, rhythm and end sound.
  - 36. The personalized speech feature extraction method according to claim 35, wherein the filtering comprising:
    - filtering out speech features with respect to the parameters representing the respective speech features.
  - 37. The personalized speech feature extraction method according to claim 32, wherein the keyword is a monosyllable high frequency word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Mobile Communications AB (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Mobile Communications AB (Sony Group Corp.)
Inventors
WANG, Qingfang, HE, Shouchun
Primary Examiner(s)
Shah, Paras D

Application Number

US12/855,119
Publication Number

US 20110165912A1
Time in Patent Office

1,286 Days
Field of Search

704231-235, 704/251, 704/258, 704/260, 704/266, 704/268, 704/270, 704/275
US Class Current

704/258
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 2015/088 Word spotting

Personalized text-to-speech synthesis and personalized speech feature extraction

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Personalized text-to-speech synthesis and personalized speech feature extraction

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links