SYSTEM AND METHOD FOR GENERATING CUSTOMIZED TEXT-TO-SPEECH VOICES

US 20170330554A1
Filed: 07/31/2017
Published: 11/16/2017
Est. Priority Date: 05/13/2004
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

collecting, at a first time, text data from a pre-existing text data source, to yield collected text data, wherein the collected text data is associated with a website, wherein the pre-existing text data source exists at the first time, and wherein no website-related inventory of speech units exists at the first time;

selecting synthesis speech units specific to the website from a pre-existing inventory of synthesis speech units existing at the first time, wherein the selecting occurs using the collected text data, to yield selected synthesis speech units, wherein the synthesis speech units comprise one or more of phonemes, diphones, triphones and syllables;

generating an in-domain inventory of synthesis speech units based on the selected synthesis speech units; and

generating, via a processor and at a second time which is later than the first time, a custom text-to-speech voice for use with the website utilizing the in-domain inventory of synthesis speech units.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice.

27 Citations

20 Claims

1. A method comprising:
- collecting, at a first time, text data from a pre-existing text data source, to yield collected text data, wherein the collected text data is associated with a website, wherein the pre-existing text data source exists at the first time, and wherein no website-related inventory of speech units exists at the first time;
  
  selecting synthesis speech units specific to the website from a pre-existing inventory of synthesis speech units existing at the first time, wherein the selecting occurs using the collected text data, to yield selected synthesis speech units, wherein the synthesis speech units comprise one or more of phonemes, diphones, triphones and syllables;
  
  generating an in-domain inventory of synthesis speech units based on the selected synthesis speech units; and
  
  generating, via a processor and at a second time which is later than the first time, a custom text-to-speech voice for use with the website utilizing the in-domain inventory of synthesis speech units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - caching the selected synthesis speech units to generate the in-domain inventory of synthesis speech units.
  - 3. The method of claim 1, further comprising determining whether the custom text-to-speech voice conforms to a selected level of synthesis quality.
  - 4. The method of claim 3, further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with a domain.
  - 5. The method of claim 4, further comprising iteratively collecting the additional text data until the custom text-to-speech voice conforms to the selected level of synthesis quality.
  - 6. The method of claim 1, wherein the pre-existing text data source is one of a domain-related website, e-mail, and transcriptions of conversations.
  - 7. The method of claim 1, wherein the pre-existing text data source is a sector-related website distinct from the website.
  - 8. The method of claim 7, further comprising categorizing websites by sector to identify websites as pre-existing text data sources prior to collecting the text data.
  - 9. The method of claim 1, wherein collecting the text data from the pre-existing text data source further comprises mining specific words and phrases from the pre-existing text data source.
  - 10. The method of claim 9, wherein mining the specific words and phrases from the pre-existing text data source further comprises mining the specific words and phrases using an n-gram selection.
  - 11. The method of claim 9, wherein mining the specific words and phrases from the pre-existing text data source further comprises mining the specific words and phrases using a maximal mutual information approach.

12. A system comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  collecting, at a first time, text data from a pre-existing text data source, to yield collected text data, wherein the collected text data is associated with a website, wherein the pre-existing text data source exists at the first time, and wherein no website-related inventory of speech units exists at the first time;
  
  selecting synthesis speech units specific to the website from a pre-existing inventory of synthesis speech units existing at the first time, wherein the selecting occurs using the collected text data, to yield selected synthesis speech units, wherein the synthesis speech units comprise one or more of phonemes, diphones, triphones and syllables;
  
  generating an in-domain inventory of synthesis speech units based on the selected synthesis speech units; and
  
  generating, at a second time which is later than the first time, a custom text-to-speech voice for use with the website utilizing the in-domain inventory of synthesis speech units.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system of claim 12, wherein the computer-readable storage medium stores additional instructions stored which, when executed by the processor, cause the processor to perform operations further comprising:
    - caching the selected synthesis speech units to generate the in-domain inventory of synthesis speech units.
  - 14. The system of claim 12, wherein the computer-readable storage medium stores additional instructions stored which, when executed by the processor, cause the processor to perform operations further comprising:
    - determining whether the custom text-to-speech voice conforms to a selected level of synthesis quality.
  - 15. The system of claim 14, wherein the computer-readable storage medium stores additional instructions stored which, when executed by the processor, cause the processor to perform operations further comprising:
    - when the custom text-to-speech voice does not conform to the selected level of synthesis quality, collecting additional text data associated with a domain.
  - 16. The system of claim 15, wherein the computer-readable storage medium stores additional instructions stored which, when executed by the processor, cause the processor to perform operations further comprising:
    - iteratively collecting the additional text data until the custom text-to-speech voice conforms to the selected level of synthesis quality.
  - 17. The system of claim 12, wherein the pre-existing text data source is one of a domain-related website, e-mail, and transcriptions of conversations.
  - 18. The system of claim 12, wherein the pre-existing text data source is a sector-related website distinct from the website.
  - 19. The system of claim 18, wherein the computer-readable storage medium stores additional instructions stored which, when executed by the processor, cause the processor to perform operations further comprising:
    - categorizing websites by sector to identify websites as pre-existing text data sources prior to collecting the text data.

20. A computer-readable storage device having instructions stored which, when executed by a processor, cause the processor to perform operations comprising:
- collecting, at a first time, text data from a pre-existing text data source, to yield collected text data, wherein the collected text data is associated with a website, wherein the pre-existing text data source exists at the first time, and wherein no website-related inventory of speech units exists at the first time;
  
  selecting synthesis speech units specific to the website from a pre-existing inventory of synthesis speech units existing at the first time, wherein the selecting occurs using the collected text data, to yield selected synthesis speech units, wherein the synthesis speech units comprise one or more of phonemes, diphones, triphones and syllables;
  
  generating an in-domain inventory of synthesis speech units based on the selected synthesis speech units; and
  
  generating, at a second time which is later than the first time, a custom text-to-speech voice for use with the website utilizing the in-domain inventory of synthesis speech units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
BANGALORE, Srinivas, FENG, Junlan, GILBERT, Mazin, SCHROETER, Juergen, SYRDAL, Ann K., SCHULZ, David

Granted Patent

US 10,991,360 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/06   Elementary speech units use...

G10L 13/08   Text analysis or generation...

G10L 15/197   Probabilistic grammars, e.g...

SYSTEM AND METHOD FOR GENERATING CUSTOMIZED TEXT-TO-SPEECH VOICES

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

27 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR GENERATING CUSTOMIZED TEXT-TO-SPEECH VOICES

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links