Model adaptation of neural tree networks and other fused models for speaker verification
First Claim
1. An adaptable speaker verification system with model adaptation, the system comprising:
- a receiver, the receiver obtaining a voice utterance;
a means, connected to the receiver, for extracting predetermined features of the voice utterance;
a means, operably connected to the extracting means, for segmenting the predetermined features of the voice utterance, wherein the features are segmented into a plurality of subwords; and
at least one adaptable model, connected to the segmenting means, wherein the model models the plurality of subwords and outputs one or more scores, and the models are updated dynamically based on the received voice utterance to incorporate the changing characteristics of a user'"'"'s voice and the adaptable models comprise at least one adaptable neural tree network model, the adaptable neural tree network model resulting in an NTN score.
7 Assignments
0 Petitions
Accused Products
Abstract
The model adaptation system of the present invention is a speaker verification system that embodies the capability to adapt models learned during the enrollment component to track aging of a user'"'"'s voice. The system has the advantage of only requiring a single enrollment for the user. The model adaptation system and methods can be applied to several types of speaker recognition models including neural tree networks (NTN), Gaussian Mixture Models (GMMs), and dynamic time warping (DTW) or to multiple models (i.e., combinations of NTNs, GMMs and DTW). Moreover, the present invention can be applied to text-dependent or text-independent systems.
133 Citations
11 Claims
-
1. An adaptable speaker verification system with model adaptation, the system comprising:
-
a receiver, the receiver obtaining a voice utterance;
a means, connected to the receiver, for extracting predetermined features of the voice utterance;
a means, operably connected to the extracting means, for segmenting the predetermined features of the voice utterance, wherein the features are segmented into a plurality of subwords; and
at least one adaptable model, connected to the segmenting means, wherein the model models the plurality of subwords and outputs one or more scores, and the models are updated dynamically based on the received voice utterance to incorporate the changing characteristics of a user'"'"'s voice and the adaptable models comprise at least one adaptable neural tree network model, the adaptable neural tree network model resulting in an NTN score. - View Dependent Claims (2)
a means, connected to the extracting means, for warping the voice utterance onto a dynamic warping template, the warping means providing a DTW score;
wherein the warping means is adapted based on the voice utterance.
-
-
3. An adaptable speaker verification system with model adaptation, the system comprising:
-
a receiver, the receiver obtaining a voice utterance;
a means, connected to the receiver, for extracting predetermined features of the voice utterance;
a means, operably connected to the extracting means, for segmenting the predetermined features of the voice utterance, wherein the features are segmented into a plurality of subwords;
at least one adaptable model, connected to the segmenting means, wherein the model models the plurality of subwords and outputs one or more scores, and the models are updated dynamically based on the received voice utterance to incorporate the changing characteristics of a user'"'"'s voice and wherein the the adaptable models comprise;
at least one adaptable Gaussian mixture model, the adaptable Gaussian mixture model resulting in a GMM score; and
at least one adaptable neural tree network model, the adaptable neural tree network model resulting in an NTN score. - View Dependent Claims (4)
a means, connected to the extracting means, for warping the voice utterance onto a dynamic warping template, the warping means providing a DTW score;
wherein the warping means is adapted based on the voice utterance.
-
-
5. An adaptable speaker verification method, including the steps of:
-
obtaining enrollment speech from a known individual;
receiving test speech from a user;
extracting predetermined features of the test speech;
warping the predetermined features using a dynamic time warping template, wherein the dynamic warping template is adapted based on the predetermined features of the test speech, resulting in the creation of warped feature data and a dynamic time warping score from the adapted dynamic warping template;
generating subwords from the warped feature data;
scoring the subwords using a plurality of adaptable models, wherein the adaptable models are adapted based on the subwords derived from the test speech and wherein the scoring comprises scoring at least one adaptable neural tree network model;
combining the results of each classifier score and the dynamic time warping score to generate a final score; and
comparing the final score to a threshold value to determine whether the test speech and enrollment speech are from the known individual.
-
-
6. An adaptable speaker verification method, wherein at least one neural tree network model is adapted based on an adaptation utterance, comprising the following steps:
-
storing number of speaker observations, number of imposter observations and a total number of observations from previous enrollments or verifications;
obtaining an adaptation utterance from a speaker;
extracting predetermined features from the speaker adaptation utterance;
segmenting the predetermined features into a plurality of subwords;
applying the plurality of subwords to at least one neural tree network model;
counting the number of updated speaker observations within each leaf of the neural tree network;
storing the number of updated speaker observations in memory; and
updating probabilities by dividing the number of updated speaker observations by a total number of observations at each leaf, thereby resulting in an adapted neural tree network model. - View Dependent Claims (7, 8, 9)
digitizing the obtained adaptation speaker utterance; and
preprocessing the digitized speaker utterance.
-
-
8. The adaptable speaker verification method of claim 6, wherein the step of segmenting comprises generating subwords using automatic blind speech segmentation.
-
9. The adaptable speaker verification method of claim 6, further comprising the step of:
-
warping the predetermined features from the speaker adaptation utterance using a dynamic time warping template, wherein the dynamic warping template is adapted based on the predetermined features of the test speech, resulting in the creation of warped feature data; and
wherein the step of segmenting segments the warped feature data into a plurality of subwords.
-
-
10. An adaptable speaker verification method, wherein at least one nueral tree network model is adapted based on an adaptation utterence, comprising the following steps:
-
storing number of speaker observations, number of imposter observations and a total number of observations from previous enrollments or verifications;
obtaining an adaptation utterence from an imposter;
extracting predetermined features from the imposter adaptation utterence;
segmenting the predetermined features into a plurality of subwords;
applying the plurality of subwords to at least one neural tree network model;
counting the number of updated imposter observations within each leaf of the neural tree network;
storing the number of updated imposter observations in memory; and
updating probabilities by dividing the number of updated speaker observations by a total number of observations at each leaf, thereby resulting in an adapted neural tree model.
-
-
11. An adaptable speaker verification method, including the steps of:
-
obtaining enrollment speech from a known individual;
receiving test speech from a user;
extracting predetermined features of the test speech;
warping the predetermined features using a dynamic time warping template, wherein the dynamic warping template is adapted based on the predetermined features of the test speech, resulting in the creation of warped feature data and a dynamic time warping score from the adapted dynamic warping template;
generating subwords from the warped feature data;
scoring the subwords using a plurality of adaptable models, wherein the adaptable models are adapted based on the subwords derived from the test speech and wherein the scoring comprises scoring at least one adaptable Gaussian mixture model, the adaptable Gaussian mixture model resulting in a GMM score; and
scoring at least one adaptable neural tree network model, the adaptable neural tree network model resulting in a NTN score.;
combining the results of each classifier score and the dynamic time warping score to generate a final score; and
comparing the final score to a threshold value to determine whether the test speech and enrollment speech are from the known individual.
-
Specification