Voice quality compensation system for speech synthesis based on unit-selection speech database
First Claim
1. A method for improving quality of stored speech units comprising the steps of:
- separating said stored speech units into sessions;
separating each session into segments;
analyzing each session to develop a speech model for the session;
selecting a preferred session based on the speech model for the session developed in said step of analyzing and said stored speech for the session;
identifying, by employing the speech model of said preferred session, said speech model being a preferred speech model, those of said segments that need to be altered; and
altering those of said segments that are identified by said step of identifying.
4 Assignments
0 Petitions
Accused Products
Abstract
A database of recorded speech units that consists of a number of recording sessions is processed, and appropriate segments are modified by passing the signal of those segments through an AR filter. The processing develops a Gaussian Mixture Model (GMM) for each recording session and, based on variability of the speech quality within a session, based on its model, one session selected as the preferred sessions. Thereafter, all segments of all recording sessions are evaluated based on the model of the preferred session. An assessment of the difference between the average power spectral density of each evaluated segment is compared to the power spectral density of the preferred session, and from this comparison, AR filter coefficients are derived for each segment so that, when the speech segment is passed through the AR filter, its power spectral density approaches that of the preferred session.
31 Citations
20 Claims
-
1. A method for improving quality of stored speech units comprising the steps of:
-
separating said stored speech units into sessions;
separating each session into segments;
analyzing each session to develop a speech model for the session;
selecting a preferred session based on the speech model for the session developed in said step of analyzing and said stored speech for the session;
identifying, by employing the speech model of said preferred session, said speech model being a preferred speech model, those of said segments that need to be altered; and
altering those of said segments that are identified by said step of identifying. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
developing filter parameters for a segment that needs to be altered; and
passing the speech units signal of said segment that needs to be altered through a filter that employs said filter parameters.
-
-
4. The method of claim 3 where said filter is an AR filter.
-
5. The method of claim 1 where said step of analyzing a session to develop a speech model for the session comprises the steps of:
-
selecting a sufficient number of segments from said session to form a speech portion of approximately ten minutes; and
developing a speech model for said session based on the segments selected in said step of selecting.
-
-
6. The method of claim 5 where said model is a Gaussian Mixture Model.
-
7. The method of claim 1 where said step of analyzing a session to develop a speech model for the session comprises the steps of:
-
selecting a number of segments, K, from said session, where K is greater than a preselected number, where each segment includes a plurality of observations;
developing speech parameters for each of said plurality of observations; and
developing a speech model for said session based on said speech parameters developed for observations in said selected segments of said session.
-
-
8. The method of claim 7 where said speech parameters are cepstrum coefficients.
-
9. The method of claim 1 where said step of selecting a preferred speech model comprises the steps of:
-
developing a measure of speech quality variability within each session based on the speech model developed for the session by said step of analyzing; and
selecting as the preferred model the speech model of the session with the least speech quality variability.
-
-
10. The method of claim 1 where said step of identifying segments that need to be altered comprises the steps of:
testing each of said segments against the hypothesis that the speech units in said segment conform to said preferred speech model.
-
11. The method of claim 10 where the hypothesis is accepted for a segment tested in said step of testing when the likelihood that a speech model that generated the speech units in the segment is said preferred speech model is higher than a preselected threshold level.
-
12. The method of claim 10 where the hypothesis is accepted for a segment tested in said step of testing when a z score for the segment tested in said step of testing, zr
i l, is greater than a preselected level, where-
( O r i ( l ) | Λ r p ) - μ ℒ σ ℒ , l is the number of the tested segment in the tested session, ri, ζ
(Ori (l)|Λ
rp ) is a log likelihood function of segment l of session ri, relative to said preferred model, Λ
rp , μ
ζ
is a mean of the log likelihood function of all segments is said session from which said preferred model is selected rp, and σ
ζ
2 is the variance of the log likelihood function of all segments is said session rp.
-
-
13. A database of stored speech units developed by a process that comprises the steps of:
-
separating said stored speech units into sessions;
separating each session into segments;
analyzing each session to develop a speech model for the session;
selecting a preferred speech model from speech models developed in said step of analyzing;
identifying, by employing said preferred speech model, those of said segments that need to be altered; and
altering those of said segments that are identified by said step of identifying. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
developing filter parameters for a segment that needs to be altered; and
passing the speech units signal of said segment that needs to be altered through a filter that employs said filter parameters.
-
-
15. The database of claim 13 where, in said process that creates said database, said step of analyzing a session to develop a speech model for the session comprises the steps of:
-
selecting a sufficient number of segments from said session to form a speech portion of approximately ten minutes; and
developing a speech model for said session based on the segments selected in said step of selecting.
-
-
16. The database of claim 13 where, in said process that creates said database, said step of analyzing a session to develop a speech model for the session comprises the steps of:
-
selecting a number of segments, K, from said session, where K is greater than a preselected number, where each segment includes a plurality of observations;
developing speech parameters for each of said plurality of observations; and
developing a speech model for said session based on said speech parameters developed for observations in said selected segments of said session.
-
-
17. The database of claim 13 where, in said process that creates said database, said step of selecting a preferred speech model comprises the steps of:
-
developing a measure of speech quality variability within each session based on the speech model developed for the session by said step of analyzing; and
selecting as the preferred model the speech model of the session with the least speech quality variability.
-
-
18. The database of claim 13 where, in said process that creates said database, said step of identifying segments that need to be altered comprises the steps of:
testing each of said segments against the hypothesis that the speech units in said segment conform to said preferred speech model.
-
19. The database of claim 18 where the hypothesis is accepted for a segment tested in said step of testing when the likelihood that a speech model that generated the speech units in the segment is said preferred speech model is higher than a preselected threshold level.
-
20. The database of claim 13 where the hypothesis is accepted for a segment tested in said step of testing when a z score for the segment tested in said step of testing, zr
i l, is greater than a preselected level, where-
( O r i ( l ) | Λ r p ) - μ ℒ σ ℒ , l is the number of the tested segment in the tested session, ri, ζ
(Ori (l)|Λ
rp ) is a log likelihood function of segment l of session ri, relative to said preferred model, Λ
rp , μ
ζ
is a mean of the log likelihood function of all segments is said session from which said preferred model is selected rp, and σ
702 is the variance of the log likelihood function of all segments is said session rp.
-
Specification