Voice quality compensation system for speech synthesis based on unit-selection speech database

US 6,266,638 B1
Filed: 03/30/1999
Issued: 07/24/2001
Est. Priority Date: 03/30/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method for improving quality of stored speech units comprising the steps of:

separating said stored speech units into sessions;

separating each session into segments;

analyzing each session to develop a speech model for the session;

selecting a preferred session based on the speech model for the session developed in said step of analyzing and said stored speech for the session;

identifying, by employing the speech model of said preferred session, said speech model being a preferred speech model, those of said segments that need to be altered; and

altering those of said segments that are identified by said step of identifying.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A database of recorded speech units that consists of a number of recording sessions is processed, and appropriate segments are modified by passing the signal of those segments through an AR filter. The processing develops a Gaussian Mixture Model (GMM) for each recording session and, based on variability of the speech quality within a session, based on its model, one session selected as the preferred sessions. Thereafter, all segments of all recording sessions are evaluated based on the model of the preferred session. An assessment of the difference between the average power spectral density of each evaluated segment is compared to the power spectral density of the preferred session, and from this comparison, AR filter coefficients are derived for each segment so that, when the speech segment is passed through the AR filter, its power spectral density approaches that of the preferred session.

31 Citations

View as Search Results

20 Claims

1. A method for improving quality of stored speech units comprising the steps of:
- separating said stored speech units into sessions;
  
  separating each session into segments;
  
  analyzing each session to develop a speech model for the session;
  
  selecting a preferred session based on the speech model for the session developed in said step of analyzing and said stored speech for the session;
  
  identifying, by employing the speech model of said preferred session, said speech model being a preferred speech model, those of said segments that need to be altered; and
  
  altering those of said segments that are identified by said step of identifying.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 where the segments are approximately the same duration.
  - 3. The method of claim 1 where said step of altering comprises the steps of:
4. The method of claim 3 where said filter is an AR filter.
5. The method of claim 1 where said step of analyzing a session to develop a speech model for the session comprises the steps of:
- selecting a sufficient number of segments from said session to form a speech portion of approximately ten minutes; and
  
  developing a speech model for said session based on the segments selected in said step of selecting.
6. The method of claim 5 where said model is a Gaussian Mixture Model.
7. The method of claim 1 where said step of analyzing a session to develop a speech model for the session comprises the steps of:
- selecting a number of segments, K, from said session, where K is greater than a preselected number, where each segment includes a plurality of observations;
  
  developing speech parameters for each of said plurality of observations; and
  
  developing a speech model for said session based on said speech parameters developed for observations in said selected segments of said session.
8. The method of claim 7 where said speech parameters are cepstrum coefficients.
9. The method of claim 1 where said step of selecting a preferred speech model comprises the steps of:
- developing a measure of speech quality variability within each session based on the speech model developed for the session by said step of analyzing; and
  
  selecting as the preferred model the speech model of the session with the least speech quality variability.
10. The method of claim 1 where said step of identifying segments that need to be altered comprises the steps of:
- testing each of said segments against the hypothesis that the speech units in said segment conform to said preferred speech model.
11. The method of claim 10 where the hypothesis is accepted for a segment tested in said step of testing when the likelihood that a speech model that generated the speech units in the segment is said preferred speech model is higher than a preselected threshold level.
12. The method of claim 10 where the hypothesis is accepted for a segment tested in said step of testing when a z score for the segment tested in said step of testing, z_r_i^l, is greater than a preselected level, where $z_{r_{i}}^{l} = ℒ$
- 
  
  (Ori(l)|Λ
  
  rp)-μ
  
  ℒ
  
  σ
  
  ℒ
  
  ,l is the number of the tested segment in the tested session, r_i, ζ
  
  (O_r_i^(l)|Λ
  
  _r_p) is a log likelihood function of segment l of session r_i, relative to said preferred model, Λ
  
  _r_p, μ
  
  _ζ is a mean of the log likelihood function of all segments is said session from which said preferred model is selected r_p, and σ
  
  _ζ²is the variance of the log likelihood function of all segments is said session r_p.

13. A database of stored speech units developed by a process that comprises the steps of:
- separating said stored speech units into sessions;
  
  separating each session into segments;
  
  analyzing each session to develop a speech model for the session;
  
  selecting a preferred speech model from speech models developed in said step of analyzing;
  
  identifying, by employing said preferred speech model, those of said segments that need to be altered; and
  
  altering those of said segments that are identified by said step of identifying.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The database of claim 13 where, in said process that creates said database, said step of altering comprised the steps of:
15. The database of claim 13 where, in said process that creates said database, said step of analyzing a session to develop a speech model for the session comprises the steps of:
- selecting a sufficient number of segments from said session to form a speech portion of approximately ten minutes; and
  
  developing a speech model for said session based on the segments selected in said step of selecting.
16. The database of claim 13 where, in said process that creates said database, said step of analyzing a session to develop a speech model for the session comprises the steps of:
- selecting a number of segments, K, from said session, where K is greater than a preselected number, where each segment includes a plurality of observations;
  
  developing speech parameters for each of said plurality of observations; and
  
  developing a speech model for said session based on said speech parameters developed for observations in said selected segments of said session.
17. The database of claim 13 where, in said process that creates said database, said step of selecting a preferred speech model comprises the steps of:
- developing a measure of speech quality variability within each session based on the speech model developed for the session by said step of analyzing; and
  
  selecting as the preferred model the speech model of the session with the least speech quality variability.
18. The database of claim 13 where, in said process that creates said database, said step of identifying segments that need to be altered comprises the steps of:
- testing each of said segments against the hypothesis that the speech units in said segment conform to said preferred speech model.
19. The database of claim 18 where the hypothesis is accepted for a segment tested in said step of testing when the likelihood that a speech model that generated the speech units in the segment is said preferred speech model is higher than a preselected threshold level.
20. The database of claim 13 where the hypothesis is accepted for a segment tested in said step of testing when a z score for the segment tested in said step of testing, z_r_i^l, is greater than a preselected level, where $z_{r_{i}}^{l} = ℒ$
- 
  
  (Ori(l)|Λ
  
  rp)-μ
  
  ℒ
  
  σ
  
  ℒ
  
  ,l is the number of the tested segment in the tested session, r_i, ζ
  
  (O_r_i^(l)|Λ
  
  _r_p) is a log likelihood function of segment l of session r_i, relative to said preferred model, Λ
  
  _r_p, μ
  
  _ζ is a mean of the log likelihood function of all segments is said session from which said preferred model is selected r_p, and σ
  
  ₇₀²is the variance of the log likelihood function of all segments is said session r_p.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Stylianou, Ioannis G.
Primary Examiner(s)
Dorvil, Richemond

Application Number

US09/281,022
Time in Patent Office

847 Days
Field of Search

704/260, 704/258, 704/256, 704/255, 704/269, 704/266, 704/200, 704/201, 704/233, 704/234, 704/240, 704/267, 704/268
US Class Current

704/266
CPC Class Codes

G10L 13/06 Elementary speech units use...

Voice quality compensation system for speech synthesis based on unit-selection speech database

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Voice quality compensation system for speech synthesis based on unit-selection speech database

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links