Blind diarization of recorded calls with arbitrary number of speakers
First Claim
Patent Images
1. A method of diarization of audio data, the method comprising:
- receiving audio data;
segmenting the audio data into a plurality of frames,segmenting audio data into a plurality of utterances, wherein each of the plurality of utterances comprises one or more of the plurality of frames;
extracting at least one acoustic feature from each of the plurality of frames, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC);
representing each utterance as an utterance model representative of the MFCC;
approximating a distribution of the MFCC in each utterance by calculating at least one Gaussian mixture model (GMM) for each utterance;
calculating a distance between each GMM;
constructing an affinity matrix based upon the distances between utterances;
computing a stochastic matrix from the affinity matrix;
computing eigenvalues and corresponding eigenvectors for the stochastic matrix;
embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors;
clustering the utterance models;
constructing a plurality of speaker models from the clustered utterance models;
constructing a hidden Markov model of the plurality of speaker models;
decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and
creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.
2 Assignments
0 Petitions
Accused Products
Abstract
In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.
-
Citations
13 Claims
-
1. A method of diarization of audio data, the method comprising:
-
receiving audio data; segmenting the audio data into a plurality of frames, segmenting audio data into a plurality of utterances, wherein each of the plurality of utterances comprises one or more of the plurality of frames; extracting at least one acoustic feature from each of the plurality of frames, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC); representing each utterance as an utterance model representative of the MFCC; approximating a distribution of the MFCC in each utterance by calculating at least one Gaussian mixture model (GMM) for each utterance; calculating a distance between each GMM; constructing an affinity matrix based upon the distances between utterances; computing a stochastic matrix from the affinity matrix; computing eigenvalues and corresponding eigenvectors for the stochastic matrix; embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors; clustering the utterance models; constructing a plurality of speaker models from the clustered utterance models; constructing a hidden Markov model of the plurality of speaker models; decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of diarization of audio data, the method comprising:
-
receiving audio data; segmenting the audio data into a plurality of frames; segmenting audio data into a plurality of utterances wherein each utterance of the plurality comprises more than one frame of the plurality of frames; representing each utterance as an utterance model representative of a plurality of feature vectors of each utterance; projecting the utterance models onto a lower dimensional space to create a plurality of projected utterance models, wherein in the projected utterance models, a distance between utterances is a defined metric; constructing an affinity matrix based upon the distances between utterances; computing a stochastic matrix from the affinity matrix; computing eigenvalues and corresponding eigenvectors for the stochastic matrix; and embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors; clustering the projected utterance models; constructing a plurality of speaker models from the clustered projected utterance models; constructing a hidden Markov model of the plurality of speaker models; decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data. - View Dependent Claims (10, 11, 12)
-
-
13. A method of diarization of audio data, the method comprising:
-
receiving audio data; segmenting the audio data into a plurality of frames; segmenting audio data into a plurality of utterances wherein each utterance of the plurality comprises more than one frame of the plurality of frames; extracting at least one acoustic feature from each of the plurality of frames; representing each utterance as an utterance model representative of the extracted acoustic features of the plurality of frames of each utterance; approximating a distribution of the extracted acoustic features of each utterance by calculating at least one Gaussian mixture model for each utterance; calculating a distance between each of the Gaussian mixture models; constructing an affinity matric based upon the distances between utterances; computing a stochastic matrix from the affinity matrix; computing eigenvalues and corresponding eigenvectors for the stochastic matrix; and embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors; clustering the utterance models based upon the calculated distances; constructing a plurality of speaker models from the clustered projected utterance models; constructing a hidden Markov model of the plurality of speaker models; decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.
-
Specification