System and method for generating customized text-to-speech voices

US 9,240,177 B2
Filed: 03/04/2014
Issued: 01/19/2016
Est. Priority Date: 05/13/2004
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

modifying, based on identification of a user as new, a front-end which converts text into linguistic tokens, to yield a user-specific front-end;

receiving a user selection of an animated character to guide the user; and

generating a custom text-to-speech voice by;

collecting text data associated with a domain from a pre-existing text data source, to yield collected text data;

selecting, based on the user-specific front-end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;

caching the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and

generating, via a processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units, wherein the animated character will use the custom text-to-speech voice.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice.

14 Citations

View as Search Results

20 Claims

1. A method comprising:
- modifying, based on identification of a user as new, a front-end which converts text into linguistic tokens, to yield a user-specific front-end;
  
  receiving a user selection of an animated character to guide the user; and
  
  generating a custom text-to-speech voice by;
  
  collecting text data associated with a domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user-specific front-end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via a processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units, wherein the animated character will use the custom text-to-speech voice.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising determining whether the custom text-to-speech voice conforms to a selected level of synthesis quality.
  - 3. The method of claim 2, further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with the domain.
  - 4. The method of claim 3, further comprising iteratively collecting additional text data until the custom text-to-speech voice conforms to the selected level of synthesis quality.
  - 5. The method of claim 1, wherein the pre-existing text data source is one of a domain-related website, e-mail and transcriptions of conversations.
  - 6. The method of claim 1, wherein the pre-existing text data source is a sector-related website.
  - 7. The method of claim 6, further comprising categorizing websites by sector to identify websites as pre-existing text data sources prior to collecting the text data.
  - 8. The method of claim 1, wherein collecting text data associated with the domain from the pre-existing text data source further comprises mining specific words and phrases from the pre-existing text data source.
  - 9. The method of claim 8, wherein mining specific words and phrases from the pre-existing text data source further comprises mining specific words and phrases using an n-gram selection.
  - 10. The method of claim 8, wherein mining specific words and phrases from the pre-existing text data source further comprises mining specific words and phrases using a maximal mutual information approach.
  - 11. The method of claim 1, wherein the user-specific front-end is a portion of a speech recognition system specifically configured for a speaking style and a language of the user.
  - 12. The method of claim 1, further comprising applying active learning to identify one of problematic speech units and problematic phrases within the in-domain inventory of synthesis speech units.
  - 13. The method of claim 12, further comprising:
    - recording one of words and phrases according to the one of problematic speech units and problematic phrases; and
      
      integrating the one of words and phrases into the in-domain inventory of synthesis speech units.
  - 14. The method of claim 13, further comprising:
    - determining whether the custom text-to-speech voice conforms to a selected synthesis quality.
  - 15. The method of claim 14, further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with the domain.
  - 16. The method of claim 14, wherein when the custom text-to-speech voice does not conform to the selected synthesis quality, recording one of additional words and additional phrases to increase the in-domain inventory of synthesis speech units.
  - 17. The method of claim 12, further comprising:
    - determining a minimal in-domain inventory for recording to meet a selected custom voice synthesis quality;
      
      based on the minimal in-domain inventory, recording one of words and phrases according to the one of problematic speech units and problematic phrases; and
      
      integrating the one of words and phrases into the in-domain inventory of synthesis speech units.
  - 18. The method of claim 13, further comprising:
    - determining whether the custom text-to-speech voice conforms to a custom voice synthesis quality; and
      
      when the custom text-to-speech voice does not conform to the custom voice synthesis quality, recording one of additional words and additional phrases to increase the in-domain inventory of speech units.

19. A system comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  modifying, based on identification of a user as new, a front-end which converts text into linguistic tokens, to yield a user-specific front-end;
  
  receiving a user selection of an animated character to guide the user; and
  
  generating a custom text-to-speech voice by;
  
  collecting text data associated with a domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user-specific front-end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via a processor, a custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units, wherein the animated character will use the custom text-to-speech voice.

20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
- modifying, based on identification of a user as new, a front-end which converts text into linguistic tokens, to yield a user-specific front-end;
  
  receiving a user selection of an animated character to guide the user; and
  
  generating a custom text-to-speech voice by;
  
  collecting text data associated with a domain from a pre-existing text data source, to yield collected text data;
  
  selecting, based on the user-specific front-end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
  
  caching the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
  
  generating, via a processor, a custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units, wherein the animated character will use the custom text-to-speech voice.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Inc., Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Intellectual Property II LP (AT&T, Inc.)
Inventors
Bangalore, Srinivas, Feng, Junlan, Rahim, Mazin G., Schroeter, Juergen, Syrdal, Ann K., Schulz, David
Primary Examiner(s)
Neway, Samuel G

Application Number

US14/196,578
Publication Number

US 20140188480A1
Time in Patent Office

686 Days
Field of Search

704258-269
US Class Current

1/1
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/06   Elementary speech units use...

G10L 13/08   Text analysis or generation...

G10L 15/197   Probabilistic grammars, e.g...

System and method for generating customized text-to-speech voices

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for generating customized text-to-speech voices

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links