Speaker recognition using a hierarchical speaker model tree
First Claim
Patent Images
1. A method for speaker identification, comprising the steps of:
- generating, for each of a plurality of speakers, a speaker model containing a collection of distributions of audio feature data associated with that speaker;
merging similar speaker models on a layer by layer basis so as to generate a hierarchical speaker model tree, wherein a lowest layer of the hierarchical speaker model tree comprises each generated speaker model; and
performing speaker identification of an unknown speaker using the hierarchical speaker model tree to determine if the unknown speaker is one of the plurality of speakers, wherein the step of performing speaker identification comprises;
i) receiving a new speech sample from the unknown speaker and generating a test speaker model therefrom;
ii) comparing said test model with merged speaker models within a higher layer, i+j, of said tree to determine which of the merged speaker models is closest to said test model;
iii) comparing said test model with children of said closest merged speaker model in a next lower layer, i+j−
l, to determine which child is closest to said test model;
repeating step (iii) on a layer by layer basis, including the lowest layer of the tree, whereby said unknown speaker is identified as the speaker corresponding to the closest speaker model in said lowest layer.
2 Assignments
0 Petitions
Accused Products
Abstract
In an illustrative embodiment, a speaker model is generated for each of a number of speakers from which speech samples have been obtained. Each speaker model contains a collection of distributions of audio feature data derived from the speech sample of the associated speaker. A hierarchical speaker model tree is created by merging similar speaker models on a layer by layer basis. Each time two or more speaker models are merged, a corresponding parent speaker model is created in the next higher layer of the tree. The tree is useful in applications such as speaker verification and speaker identification.
-
Citations
21 Claims
-
1. A method for speaker identification, comprising the steps of:
-
generating, for each of a plurality of speakers, a speaker model containing a collection of distributions of audio feature data associated with that speaker;
merging similar speaker models on a layer by layer basis so as to generate a hierarchical speaker model tree, wherein a lowest layer of the hierarchical speaker model tree comprises each generated speaker model; and
performing speaker identification of an unknown speaker using the hierarchical speaker model tree to determine if the unknown speaker is one of the plurality of speakers, wherein the step of performing speaker identification comprises;
i) receiving a new speech sample from the unknown speaker and generating a test speaker model therefrom;
ii) comparing said test model with merged speaker models within a higher layer, i+j, of said tree to determine which of the merged speaker models is closest to said test model;
iii) comparing said test model with children of said closest merged speaker model in a next lower layer, i+j−
l, to determine which child is closest to said test model;
repeating step (iii) on a layer by layer basis, including the lowest layer of the tree, whereby said unknown speaker is identified as the speaker corresponding to the closest speaker model in said lowest layer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
determining, for each distribution of said first model, which distribution of said second model has the closest distance thereto, whereby a plurality of closest distances are obtained; and
computing a final distance between said first and second models based at least upon said closest distances.
-
-
4. The method of claim 1 wherein each said distribution is a multi-dimensional Gaussian distribution.
-
5. The method of claim 1 wherein said step of merging similar models comprises:
merging a first speaker model with a second speaker model which is close in distance to said first speaker model to form a parent speaker model, by establishing distribution pairs between the first and second speaker models and forming a merged distribution from each distribution pair, whereby the parent speaker model contains a plurality of merged distributions.
-
6. The method of claim 5 wherein each merged distribution is formed by avenging statistical parameters of the distributions of the respective distribution pair.
-
7. The method of claim 1, wherein said step of merging similar speaker models comprises determining sets of n speaker models in each layer having the closest distances to one another, and merging each set of n speaker models to form a corresponding parent speaker model on a layer by layer basis.
-
8. The method of claim 7, wherein n equals two, such that said tree has a binary structure.
-
9. The method of claim 7, further comprising the step of adding a leftover speaker model of a lower layer to a next higher layer.
-
10. A method for speaker verification, comprising the steps of:
-
generating, for each of a plurality of registered speakers in a system, a speaker model containing a collection of distributions of audio feature data associated with that speaker;
merging similar speaker models on a layer by layer basis so as to generate a hierarchical speaker model tree, wherein a lowest layer of the hierarchical speaker model tree are leaf member comprising the speaker models of said registered speakers; and
performing speaker verification using the hierarchical speaker model tree to verify that a person is a registered speaker in the system, wherein the step of performing speaker verification comprising;
receiving a claimed identification from the person corresponding to a particular one of said speaker models of the lowest layer of the hierarchical speaker model tree;
determining a cohort set of similar speaker models associated with said particular speaker model using the hierarchical speaker model tree, wherein the step of determining a cohort set comprises the steps of matching the claimed identification with a leaf member of the hierarchical speaker model tree, traversing up the hierarchical speaker model tree from said leaf member to a parent node in a desired layer, and the traversing down the hierarchical speaker model tree from the parent node to all leaf members connected to the parent node, wherein all leaf members connected to the parent node comprise the cohort set;
receiving a new speech sample from the person and generating a test speaker model therefrom;
verifying that the person is a registered speaker if said particular speaker model is the closest model of said cohort set to said test model. - View Dependent Claims (11, 12, 13, 14)
generating a single cumulative complementary model (COM) by merging complementary speaker models, said complementary speaker models being outside said cohort set; and
rejecting said claimant speaker if said test model is closer in distance to said CCM than to said particular model.
-
-
12. The method of claim 11, wherein said complementary speaker models include a background model derived from speech data of speakers outside said tree.
-
13. The method of claim 10, further comprising:
-
generating a plurality of complementary speaker models, each being a sibling speaker model of an ancestor of said particular speaker model; and
rejecting said claimant speaker if said test model is closer in distance to any one of said complementary speaker models than to said particular speaker model.
-
-
14. The method of claim 13, further comprising providing a background speaker model derived from speakers outside said tree, and rejecting said claimant speaker if said test model is closer in distance to said background speaker model than to said particular speaker model.
-
15. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to provide method steps for performing speaker verification, said method steps comprising:
-
generating, for each of a plurality of registered speakers in a system, a speaker model containing a collection of distributions of audio feature data associated with that speaker;
merging similar speaker models on a layer by layer basis so as to generate a hierarchical speaker model tree, wherein a lowest layer of the hierarchical speaker model tree are leaf members comprising the speaker models of said registered speakers; and
performing speaker verification using the hierarchical speaker model tree to verify that a person is a registered speaker in the system, wherein the step of performing speaker verification comprises;
receiving a claimed identification from the person corresponding to a particular one of said speaker models of the lowest layer of the hierarchical speaker model tree;
determining a cohort set of similar speaker models associated with said particular speaker model using the hierarchical speaker model tree, wherein the step of determining a cohort set comprises the steps in matching the claimed identification with a leaf member of the hierarchical speaker model tree, traversing up the hierarchical speaker model tree from said leaf member to a parent node in a desired layer, and traversing down the hierarchical speaker model tree the parent node to all leaf members connected to the parent node, wherein all leaf members connected to the parent node comprise the cohort set;
receiving a new speech sample from the person and generating a test speaker model therefrom;
verifying that the person is a registered speaker if said particular speaker model is the closest model of said cohort act to said test model. - View Dependent Claims (16, 17, 18)
generating a single cumulative complementary model (CCM) by merging complementary speaker models, said complementary speaker models being outside said cohort set; and
rejecting said claimant speaker if said test model is closer in distance to said CCM than to said particular model.
-
-
17. The program storage device of claim 15, wherein said method steps further comprise:
-
generating a plurality of complementary speaker models, each being a sibling speaker model of an ancestor of said particular speaker model; and
rejecting said claimant speaker if said test model is closer in distance to any one of said complementary speaker models than to said particular speaker model.
-
-
18. The program storage device of claim 15, wherein said step of merging similar models comprises:
merging a first speaker model with a second speaker model which is close in distance to said first speaker model to form a parent speaker model, by establishing distribution pairs between the first and second speaker models and forming a merged distribution from each distribution pair, whereby the parent speaker model contains a plurality of merged distributions.
-
19. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine, to perform method steps for performing speaker, the method steps comprising:
-
generating a speaker model for each of a plurality of speakers, wherein each speaker model comprises a collection of distributions of audio feature data associated with a speaker;
merging similar speaker models on a layer by layer basis so as to generate a hierarchical speaker model tree, wherein a lowest layer of the hierarchical speaker model tree comprises each generated speaker model; and
performing speaker identification of an unknown speaker using the hierarchical speaker model tree to determine if the unknown speaker is one of the plurality of speakers, wherein the step of performing speaker identification comprises;
i) receiving a new speaker sample from the unknown speaker and generating a test speaker model therefrom;
ii) comparing said test model with merged speaker models within a higher layer, i+j, of said tree to determine which of the merged speaker models is closest to said test model;
iii) comparing said test model with children of said closest merged speaker model in a next lower layer, i+j−
l, to determine which child is closest to said test model;
repeating step (iii) on a layer by layer basis, including the lowest layer of the tree, whereby said unknown speaker is identified as the speaker corresponding to the closest speaker model in said lowest layer. - View Dependent Claims (20, 21)
-
Specification