Acoustic signature building for a speaker from multiple sessions
First Claim
Patent Images
1. A method of blind diarization comprising:
- receiving audio data at a communication interface of a computing system on a frame by frame basis,representing segments of the audio data according to respective feature vectors;
clustering respective segments of the audio data according to the respective feature vectors, such that agglomerative clusters of similar feature vectors are gathered as super segments of the audio data;
building respective voiceprint models for speakers from the super segments according to a size of respective agglomerative clusters;
creating a background model from a first diagonal Gaussian distribution that includes all segments associated with those feature vectors not representing a speaker;
wherein building respective voiceprint models comprises;
training a respective diagonal Gaussian distribution for each of the agglomerative clusters of super segments;
assigning a weighting value to each respective diagonal Gaussian distribution, wherein the weighting value is proportional to a total number of super-segments in the agglomerative cluster composing the respective diagonal Gaussian distribution;
merging the respective diagonal Gaussian distributions, wherein the respective diagonal Gaussian distributions are included in a merged Gaussian distribution according to the respective weighting values;
utilizing the respectively merged Gaussian distributions as respective voiceprint models and using the respective voiceprint models and the background model to label the segments of audio data with an identification of one of the speakers or a different identification as background data;
iteratively refining each of the respective voiceprint models on an audio segment by audio segment basis by calculating a log likelihood of a presence of the respective segments as fitting within either the background model or within one of the respective voiceprint models;
within each iteration, reassigning the segments of the audio data as fitting either one of the respective voiceprint models or the background model and repeating the step of utilizing the respective voiceprint models and the background model to label the segments;
verifying each of the respective voiceprint models when a comparison to sample agent models stored in a memory indicates a match at a threshold quality; and
decoding the segments identified as a speaker segment in accordance with one of the respective voiceprint models.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.
-
Citations
8 Claims
-
1. A method of blind diarization comprising:
-
receiving audio data at a communication interface of a computing system on a frame by frame basis, representing segments of the audio data according to respective feature vectors; clustering respective segments of the audio data according to the respective feature vectors, such that agglomerative clusters of similar feature vectors are gathered as super segments of the audio data; building respective voiceprint models for speakers from the super segments according to a size of respective agglomerative clusters; creating a background model from a first diagonal Gaussian distribution that includes all segments associated with those feature vectors not representing a speaker; wherein building respective voiceprint models comprises; training a respective diagonal Gaussian distribution for each of the agglomerative clusters of super segments; assigning a weighting value to each respective diagonal Gaussian distribution, wherein the weighting value is proportional to a total number of super-segments in the agglomerative cluster composing the respective diagonal Gaussian distribution; merging the respective diagonal Gaussian distributions, wherein the respective diagonal Gaussian distributions are included in a merged Gaussian distribution according to the respective weighting values; utilizing the respectively merged Gaussian distributions as respective voiceprint models and using the respective voiceprint models and the background model to label the segments of audio data with an identification of one of the speakers or a different identification as background data; iteratively refining each of the respective voiceprint models on an audio segment by audio segment basis by calculating a log likelihood of a presence of the respective segments as fitting within either the background model or within one of the respective voiceprint models; within each iteration, reassigning the segments of the audio data as fitting either one of the respective voiceprint models or the background model and repeating the step of utilizing the respective voiceprint models and the background model to label the segments; verifying each of the respective voiceprint models when a comparison to sample agent models stored in a memory indicates a match at a threshold quality; and decoding the segments identified as a speaker segment in accordance with one of the respective voiceprint models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification