Method and system for statistic-based distance definition in text-to-speech conversion

US 20060074674A1
Filed: 09/29/2005
Published: 04/06/2006
Est. Priority Date: 09/30/2004
Status: Active Grant

First Claim

Patent Images

1. A method comprising the steps of:

analyzing text that is to be subjected to text-to-speech conversion to obtain text with descriptive prosody annotation;

performing clustering for samples in the obtained text; and

generating a Gaussian Mixture Model for each cluster to determine the distance between the sample and the corresponding Gaussian Mixture Model.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for distance definition in a text-to-speech conversion system by applying Gaussian Mixture Model (GMM) to a distance definition. According to an embodiment, the text that is to be subjected to text-to-speech conversion is analyzed to obtain a text with descriptive prosody annotation; clustering is performed for samples in the obtained text; and a GMM model is generated for each cluster, to determine the distance between the sample and the corresponding GMM model.

187 Citations

22 Claims

1. A method comprising the steps of:
- analyzing text that is to be subjected to text-to-speech conversion to obtain text with descriptive prosody annotation;
  
  performing clustering for samples in the obtained text; and
  
  generating a Gaussian Mixture Model for each cluster to determine the distance between the sample and the corresponding Gaussian Mixture Model.
- View Dependent Claims (2, 3)
- - 2. The method according to claim 1, wherein the step of performing clustering comprises clustering using a decision tree.
  - 3. The method according to claim 2, further comprising the step of combining two branches in the decision tree if the two branches are similar.

4. A system comprising:
- a text analysis unit for analyzing text that is to be subjected to text-to-speech conversion to obtain text with descriptive prosody annotation;
  
  a prosody prediction unit for performing clustering for samples in the text obtained by the text analysis unit; and
  
  a Gaussian Mixture Model base, coupled to the prosody prediction unit, for storing a generated Gaussian Mixture Model.
- View Dependent Claims (5, 6)
- - 5. The system according to claim 4, wherein the prosody prediction unit is configured to cluster the text obtained from the text analysis unit by using a decision tree.
  - 6. The system according to claim 5, further comprising a combining unit for combining similar branches in the decision tree used by the prosody prediction unit.

7. A method comprising the steps of:
- determining a cluster for a unit to be subjected to text-to-speech conversion;
  
  determining the Gaussian Mixture Model of the cluster;
  
  calculating the distance between candidate samples in the cluster and the determined Gaussian Mixture Model; and
  
  identifying the sample with the smallest distance for subsequent speech synthesizing.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The method according to claim 7, wherein the step of identifying the sample with the smallest distance comprises identifying the sample with the smallest target cost plus transition cost.
  - 9. The method according to claim 7, wherein the step of identifying the sample with the smallest distance comprises identifying the sample with the smallest target cost.
  - 10. The method according to claim 7, wherein the calculating step comprises calculating the target cost and the transition cost.
  - 11. The method according to claim 10, wherein the step of identifying the sample with the smallest distance comprises identifying the sample with the smallest target cost.
  - 12. The method according to claim 10, wherein the step of identifying the sample with the smallest distance comprises identifying the sample with the smallest target cost plus transition cost.
  - 13. The method according to claim 7, wherein the step of determining the cluster for the unit to be subjected to text-to-speech conversion comprises:
    - obtaining descriptive prosody annotation information of each unit to be subjected to text-to-speech conversion;
      
      finding the context equivalent cluster of each unit to be subjected to text-to-speech conversion, the cluster corresponding to a Gaussian Mixture Model; and
      
      in the space of the Gaussian Mixture Model mixture model sequence, searching for the best values based on the distance definition and criteria of overall optimization.
  - 14. The method according to claim 7, wherein the steps of calculating the distance between the candidate samples in the cluster and the determined Gaussian Mixture Model and identifying the sample with the smallest distance for subsequent speech synthesizing comprises:
    - obtaining descriptive prosody annotation information of each unit to be subjected to text-to-speech conversion;
      
      finding the context equivalent cluster of each unit to be subjected to text-to-speech conversion, the cluster corresponding to a Gaussian Mixture Model;
      
      evaluating all the candidates of the unit to be text-to-speech conversion synthesized through the Gaussian Mixture Model-based distance definition; and
      
      finding the overall optimal candidate series, for subsequent speech synthesizing, based on the distance given in the evaluating step and criteria of overall optimization.

15. A system comprising:
- a cluster determining unit for determining the cluster for the unit to be subjected to text-to-speech conversion to determine the Gaussian Mixture Model of the cluster;
  
  a distance calculating unit, for calculating the distance between the candidate samples in the cluster and the determined Gaussian Mixture Model; and
  
  an optimizing unit, for identifying the sample with the smallest distance for subsequent speech synthesizing.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
- - 16. The system according to claim 15, wherein the optimizing unit is configured to identify the sample with the smallest target cost plus transition cost.
  - 17. The system according to claim 15, wherein the optimizing unit is configured to identify the sample with the smallest target cost.
  - 18. The system according to claim 15, wherein the distance calculating unit further comprises a unit for calculating a target cost and a unit for calculating a transition cost.
  - 19. The system according to claim 18, wherein the optimizing unit is configured to identify the sample with the smallest target cost plus transition cost.
  - 20. The system according to claim 18, wherein the optimizing unit is configured to identify the sample with the smallest target cost.
  - 21. The system according to claim 15, wherein the cluster determining unit further comprises:
    - means for getting descriptive prosody annotation information of each unit to be subjected to text-to-speech conversion;
      
      means for finding the context equivalent cluster of each unit to be subjected to text-to-speech conversion, the cluster corresponding to a Gaussian Mixture Model; and
      
      means for, in the space of the mixture model sequence, searching for the best values, to be used as the as the explicit prediction, based on the distance definition and criteria of overall optimization.
  - 22. The system according to claim 15, wherein the calculating unit further comprises:
    - means for obtaining descriptive prosody annotation information of each unit to be subjected to text-to-speech conversion;
      
      means for finding the context equivalent cluster of each unit to be subjected to text-to-speech conversion, which corresponds to a mixture model;
      
      means for evaluating all the candidates of the unit to be text-to-speech conversion synthesized through the Gaussian Mixture Model-based distance definition; and
      
      wherein the optimizing unit further comprises means for finding the overall optimal candidate series, for subsequent speech synthesizing, based on the distance from the means for evaluating and criteria of overall optimization.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Ma, Xi Jun, Jin, Ling, Chai, Hai Xin, Zhang, Wei ZW

Granted Patent

US 7,590,540 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 13/10 Prosody rules derived from ...

Method and system for statistic-based distance definition in text-to-speech conversion

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

187 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for statistic-based distance definition in text-to-speech conversion

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

187 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links