System and method for generating customized text-to-speech voices

US 8,666,746 B2
Filed: 05/13/2004
Issued: 03/04/2014
Est. Priority Date: 05/13/2004
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end;

in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically;

collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data;

selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;

caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and

generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice.

24 Citations

View as Search Results

20 Claims

1. A method comprising:
- when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end;
  
  in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically;
  
  collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising determining whether the custom text-to-speech voice conforms to a selected level of synthesis quality.
  - 3. The method of claim 2, further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with the domain.
  - 4. The method of claim 3, further comprising iteratively collecting additional text data until the custom text-to-speech voice conforms to the selected level of synthesis quality.
  - 5. The method of claim 1, wherein the pre-existing text data source is one of a domain-related website, e-mail and transcriptions of conversations.
  - 6. The method of claim 1, wherein the pre-existing text data source is a sector-related website.
  - 7. The method of claim 6, further comprising categorizing websites by sector to identify websites as pre-existing text data sources prior to collecting the text data.
  - 8. The method of claim 1, wherein collecting text data associated with the domain from the pre-existing text data source further comprises mining specific words and phrases from the pre-existing text data source.
  - 9. The method of claim 8, wherein mining specific words and phrases from the pre-existing text data source further comprises mining specific words and phrases using an n-gram selection.
  - 10. The method of claim 8, wherein mining specific words and phrases from the pre-existing text data source further comprises mining specific words and phrases using a maximal mutual information approach.
  - 11. The method of claim 1, further comprising manually adding one of relevant words and relevant phrases to the collected text data for use in generating the in-domain inventory of synthesis speech units.
  - 12. The method of claim 1, further comprising applying active learning to identify one of problematic speech units and problematic phrases within the in-domain inventory of synthesis speech units.
  - 13. The method of claim 12, further comprising:
    - recording one of words and phrases according to the one of problematic speech units and problematic phrases; and
      
      integrating the one of words and phrases into the in-domain inventory of synthesis speech units.
  - 14. The method of claim 13, further comprising:
    - determining whether the custom text-to-speech voice conforms to a selected synthesis quality.
  - 15. The method of claim 14, further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with the domain.
  - 16. The method of claim 14, wherein when the custom text-to-speech voice does not conform to the selected synthesis quality, recording one of additional words and additional phrases to increase the in-domain inventory of synthesis speech units.
  - 17. The method of claim 12, further comprising:
    - determining a minimal in-domain inventory for recording to meet a selected custom voice synthesis quality;
      
      based on the minimal in-domain inventory, recording one of words and phrases according to the one of problematic speech units and problematic phrases; and
      
      integrating the one of words and phrases into the in-domain task-independent inventory of synthesis speech units.
  - 18. The method of claim 13, further comprising:
    - determining whether the custom text-to-speech voice conforms to a custom voice synthesis quality; and
      
      when the custom text-to-speech voice does not conform to the custom voice synthesis quality, recording one of additional words and additional phrases to increase the in-domain inventory of speech units.

19. A system comprising:
- a processor; and
  
  a computer readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end;
  
  in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically;
  
  collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.

20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
- when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end;
  
  in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically;
  
  collecting, via the computing device, text data associated with the domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching, via the computing device, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via the computing device, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Intellectual Property II LP (AT&T, Inc.)
Inventors
Syrdal, Ann K., Bangalore, Srinivas, Feng, Junlan, Rahim, Mazin G., Schroeter, Juergen, Schulz, David Eugene
Primary Examiner(s)
Neway, Samuel G

Application Number

US10/845,364
Publication Number

US 20050256716A1
Time in Patent Office

3,582 Days
Field of Search

704258-269
US Class Current

704/258
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/06   Elementary speech units use...

G10L 13/08   Text analysis or generation...

G10L 15/197   Probabilistic grammars, e.g...

System and method for generating customized text-to-speech voices

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for generating customized text-to-speech voices

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links