Back-end database reorganization for application-specific concatenative text-to-speech systems

US 8,412,528 B2
Filed: 05/02/2006
Issued: 04/02/2013
Est. Priority Date: 06/21/2005
Status: Active Grant

First Claim

Patent Images

1. A method for use in a Concatenative Text-To-Speech (CTTS) system that comprises a speech segment database comprising a plurality of speech segments organized in accordance with a context hierarchy, the method comprising:

evaluating the context hierarchy by using new text and at least one processor, wherein the context hierarchy comprises a plurality of contexts determined using a base text, wherein each of the plurality of speech segments is associated with at least one of the plurality of contexts in the context hierarchy, and wherein the new text is different from the base text, wherein at least a portion of the new text has no corresponding acoustic data used during the evaluation;

updating the context hierarchy, based on results of evaluating the context hierarchy by using the new text, by merging two contexts in the context hierarchy to form a new coarser context and/or by splitting a context in the context hierarchy to form at least two new refined contexts; and

reorganizing the plurality of speech segments in accordance with the updated context hierarchy.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to computer-generated text-to-speech conversion. It relates in particular to a method and system for updating a Concatenative Text-To-Speech (CTTS) system with a speech database from a base version to a new version. The present invention performs an application-specific re-organization of a synthesizer'"'"'s speech database by means of certain decision tree modifications. By that reorganization, certain synthesis units are made available for the new application, which are not available in prior art without a new speech session. This allows the creation of application-specific synthesizers with improved output speech quality for arbitrary domains and applications at very low cost.

13 Citations

View as Search Results

25 Claims

1. A method for use in a Concatenative Text-To-Speech (CTTS) system that comprises a speech segment database comprising a plurality of speech segments organized in accordance with a context hierarchy, the method comprising:
- evaluating the context hierarchy by using new text and at least one processor, wherein the context hierarchy comprises a plurality of contexts determined using a base text, wherein each of the plurality of speech segments is associated with at least one of the plurality of contexts in the context hierarchy, and wherein the new text is different from the base text, wherein at least a portion of the new text has no corresponding acoustic data used during the evaluation;
  
  updating the context hierarchy, based on results of evaluating the context hierarchy by using the new text, by merging two contexts in the context hierarchy to form a new coarser context and/or by splitting a context in the context hierarchy to form at least two new refined contexts; and
  
  reorganizing the plurality of speech segments in accordance with the updated context hierarchy.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the plurality of contexts comprises a first context;
    - wherein the evaluating the context hierarchy comprises determining a value indicative of a number of times the first context is present in the new text; and
      
      wherein the updating the context hierarchy comprises, in response to determining that the value is below a threshold, merging the first context with a second context in the context hierarchy.
  - 3. The method of claim 2, wherein merging the first context with the second context comprises creating a new merged context associated with speech segments associated with the first context and speech segments associated with the second context.
  - 4. The method of claim 1, wherein the plurality of contexts comprises a second context;
    - wherein the evaluating the context hierarchy comprises determining a value indicative of a number of times the second context is present in the new text; and
      
      wherein the updating the context hierarchy comprises, in response to determining that the value is above a threshold, splitting the second context to form the at least two new refined contexts.
  - 5. The method of claim 1, wherein the context hierarchy is a decision tree comprising a plurality of leaf nodes, wherein each leaf node in the plurality of leaf nodes is associated with a context in the plurality of contexts, and wherein updating the context hierarchy comprises changing the structure of the decision tree.
  - 6. The method of claim 1, wherein updating the context hierarchy is performed without using any new speech segments.
  - 7. The method of claim 1, further comprising:
    - selecting, by using the updated context hierarchy, speech segments in the plurality of speech segments for synthesizing speech corresponding to at least one text utterance; and
      
      synthesizing speech corresponding to the at least one text utterance by using the selected speech segments.
  - 8. The method of claim 1, further comprising:
    - analyzing data indicative of quality of output speech synthesized in accordance with the context hierarchy to determine that the evaluating, updating, and reorganizing should be performed, wherein the evaluating, updating, and reorganizing are performed in response to determining that the evaluating, updating, and reorganizing should be performed.
  - 9. The method of claim 8, wherein the data indicative of quality of the output speech comprises data selected from the group consisting of a number of non-contiguous speech segments used to synthesize the output speech, a cost associated with speech segments used to synthesize the output speech, and a number of acoustic and/or prosodic contexts used to synthesize the output speech.

10. A recording medium storing processor-executable instructions that, when executed by at least one processor, cause the at least one processor to perform a method for use in a Concatenative Text-To-Speech (CTTS) system that comprises a speech segment database comprising a plurality of speech segments organized in accordance with a context hierarchy, the method comprising:
- evaluating the context hierarchy by using new text, wherein the context hierarchy comprises a plurality of contexts determined using a base text, wherein each of the plurality of speech segments is associated with at least one of the plurality of contexts in the context hierarchy, and wherein the new text is different from the base text, wherein at least a portion of the new text has no corresponding acoustic data used during the evaluation;
  
  updating the context hierarchy, based on results of evaluating the context hierarchy by using the new text, by merging two contexts in the context hierarchy to form a new coarser context and/or by splitting a context in the context hierarchy to form at least two new refined contexts; and
  
  reorganizing the plurality of speech segments in accordance with the updated context hierarchy.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The recording medium of claim 10, wherein the plurality of contexts comprises a first context;
    - wherein the evaluating the context hierarchy comprises determining a value indicative of a number of times the first context is present in the new text; and
      
      wherein the updating the context hierarchy comprises, in response to determining that the value is below a threshold, merging the first context with a second context in the context hierarchy.
  - 12. The recording medium of claim 10, wherein the plurality of contexts comprises a second context;
    - wherein the evaluating the context hierarchy comprises determining a value indicative of a number of times the second context is present in the new text; and
      
      wherein the updating the context hierarchy comprises, in response to determining that the value is above a threshold, splitting the second context to form the at least two new refined contexts.
  - 13. The recording medium of claim 10, wherein the context hierarchy is a decision tree comprising a plurality of leaf nodes, wherein each leaf node in the plurality of leaf nodes is associated with a context in the plurality of contexts, and wherein updating the context hierarchy comprises changing the structure of the decision tree.
  - 14. The recording medium of claim 10, wherein updating the context hierarchy is performed without using any new speech segments.
  - 15. The recording medium of claim 10, wherein the method further comprises:
    - analyzing data indicative of quality of output speech synthesized in accordance with the context hierarchy to determine that the evaluating, updating, and reorganizing should be performed, wherein the evaluating, updating, and reorganizing are performed in response to determining that the evaluating, updating, and reorganizing should be performed.
  - 16. The recording medium of claim 15, wherein the data indicative of quality of the output speech comprises data selected from the group consisting of a number of non-contiguous speech segments used to synthesize the output speech, a cost associated with speech segments used to synthesize the output speech, and a number of acoustic and/or prosodic contexts used to synthesize the output speech.

17. A Concatenative Text-To-Speech (CTTS) system comprising:
- at least one memory that stores a speech segment database comprising a plurality of speech segments organized in accordance with a context hierarchy; and
  
  at least one processor, coupled to the at least one memory, that;
  
  evaluates the context hierarchy by using new text, wherein the context hierarchy comprises a plurality of contexts determined using a base text, wherein each of the plurality of speech segments is associated with at least one of the plurality of contexts in the context hierarchy, and wherein the new text is different from the base text, wherein at least a portion of the new text has no corresponding acoustic data used during the evaluation;
  
  updates the context hierarchy, based on results of evaluating the context hierarchy by using the new text, by merging two contexts in the context hierarchy to form a new coarser context and/or by splitting a context in the context hierarchy to form at least two new refined contexts; and
  
  reorganizes the plurality of speech segments in accordance with the updated context hierarchy.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
- - 18. The CTTS system of claim 17, wherein the plurality of contexts comprises a first context;
    - wherein the at least one processor evaluates the context hierarchy by determining a value indicative of a number of times the first context is present in the new text; and
      
      wherein the at least one processor updates the context hierarchy by, in response to determining that the value is below a threshold, merging the first context with a second context in the context hierarchy.
  - 19. The CTTS system of claim 17, wherein the plurality of contexts comprises a second context;
    - wherein the at least one processor evaluates the context hierarchy by determining a value indicative of a number of times the second context is present in the new text; and
      
      wherein the at least one processor updates the context hierarchy by, in response to determining that the value is above a threshold, splitting the second context to form the at least two new refined contexts.
  - 20. The CTTS system of claim 17, wherein the context hierarchy is a decision tree comprising a plurality of leaf nodes, wherein each leaf node in the plurality of leaf nodes is associated with a context in the plurality of contexts, and wherein the at least one processor updates the context hierarchy by changing the structure of the decision tree.
  - 21. The CTTS system of claim 17, wherein the at least one processor updates the context hierarchy without using any new speech segments.
  - 22. The CTTS system of claim 17, wherein the at least one processor further:
    - selects, by using the updated context hierarchy, speech segments in the plurality of speech segments for synthesizing speech corresponding to at least one text utterance; and
      
      synthesizes speech corresponding to the at least one text utterance by using the selected speech segments.
  - 23. The CTTS system of claim 17, wherein the at least one processor further:
    - analyzes data indicative of quality of output speech synthesized in accordance with the context hierarchy to determine that the evaluating, updating, and reorganizing should be performed, wherein the at least one processor performs the evaluating, updating, and reorganizing in response to determining that the evaluating, updating, and reorganizing should be performed.
  - 24. The CTTS system of claim 17, wherein the at least one processor analyzes data indicative of quality of output speech synthesized in accordance with the context hierarchy to determine that the evaluating, updating, and reorganizing should be performed, wherein the evaluating, updating, and reorganizing are performed in response to determining that the evaluating, updating, and reorganizing should be performed.
  - 25. The CTTS system of claim 24, wherein the data indicative of quality of the output speech comprises data selected from the group consisting of a number of non-contiguous speech segments used to synthesize the output speech, a cost associated with speech segments used to synthesize the output speech, and a number of acoustic and/or prosodic contexts used to synthesize the output speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Fischer, Volker, Kunzmann, Siegfried
Primary Examiner(s)
He, Jialong

Application Number

US11/416,217
Publication Number

US 20060287861A1
Time in Patent Office

2,527 Days
Field of Search

704/258, 704/260, 704/266
US Class Current

704/258
CPC Class Codes

G10L 13/06 Elementary speech units use...

Back-end database reorganization for application-specific concatenative text-to-speech systems

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

13 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Back-end database reorganization for application-specific concatenative text-to-speech systems

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links