Blind diarization of recorded calls with arbitrary number of speakers

US 9,460,722 B2
Filed: 06/30/2014
Issued: 10/04/2016
Est. Priority Date: 07/17/2013
Status: Active Grant

First Claim

Patent Images

1. A method of diarization of audio data, the method comprising:

receiving audio data;

segmenting the audio data into a plurality of frames,segmenting audio data into a plurality of utterances, wherein each of the plurality of utterances comprises one or more of the plurality of frames;

extracting at least one acoustic feature from each of the plurality of frames, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC);

representing each utterance as an utterance model representative of the MFCC;

approximating a distribution of the MFCC in each utterance by calculating at least one Gaussian mixture model (GMM) for each utterance;

calculating a distance between each GMM;

constructing an affinity matrix based upon the distances between utterances;

computing a stochastic matrix from the affinity matrix;

computing eigenvalues and corresponding eigenvectors for the stochastic matrix;

embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors;

clustering the utterance models;

constructing a plurality of speaker models from the clustered utterance models;

constructing a hidden Markov model of the plurality of speaker models;

decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and

creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.

Citations

13 Claims

1. A method of diarization of audio data, the method comprising:
- receiving audio data;
  
  segmenting the audio data into a plurality of frames,segmenting audio data into a plurality of utterances, wherein each of the plurality of utterances comprises one or more of the plurality of frames;
  
  extracting at least one acoustic feature from each of the plurality of frames, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC);
  
  representing each utterance as an utterance model representative of the MFCC;
  
  approximating a distribution of the MFCC in each utterance by calculating at least one Gaussian mixture model (GMM) for each utterance;
  
  calculating a distance between each GMM;
  
  constructing an affinity matrix based upon the distances between utterances;
  
  computing a stochastic matrix from the affinity matrix;
  
  computing eigenvalues and corresponding eigenvectors for the stochastic matrix;
  
  embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors;
  
  clustering the utterance models;
  
  constructing a plurality of speaker models from the clustered utterance models;
  
  constructing a hidden Markov model of the plurality of speaker models;
  
  decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and
  
  creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising applying voice activity detection to the plurality of frames to segment the audio data into the plurality of utterances.
  - 3. The method of claim 1, wherein the acoustic features are perceptive linear predictive analytics.
  - 4. The method of claim 1, further comprising:
    - constructing a universal Gaussian model form the feature vectors of the plurality of utterances;
      
      wherein the GMM for each utterance is further calculated using the universal Gaussian model.
  - 5. The method of claim 1, further comprising updating the constructed hidden Markov model using the extracted acoustic features for the audio data.
  - 6. The method of claim 5, wherein the updating is a Baum-Welch re-estimation.
  - 7. The method of claim 1, further comprising:
    - projecting the utterance models onto a lower dimensional space to create a plurality of projected utterance models, wherein in the projected utterance models a distance between utterances is a defined metric;
      
      wherein the plurality of speaker models are constructed from the projected utterance models.
  - 8. The method of claim 1, wherein the audio data is streaming audio data.

9. A method of diarization of audio data, the method comprising:
- receiving audio data;
  
  segmenting the audio data into a plurality of frames;
  
  segmenting audio data into a plurality of utterances wherein each utterance of the plurality comprises more than one frame of the plurality of frames;
  
  representing each utterance as an utterance model representative of a plurality of feature vectors of each utterance;
  
  projecting the utterance models onto a lower dimensional space to create a plurality of projected utterance models, wherein in the projected utterance models, a distance between utterances is a defined metric;
  
  constructing an affinity matrix based upon the distances between utterances;
  
  computing a stochastic matrix from the affinity matrix;
  
  computing eigenvalues and corresponding eigenvectors for the stochastic matrix; and
  
  embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors;
  
  clustering the projected utterance models;
  
  constructing a plurality of speaker models from the clustered projected utterance models;
  
  constructing a hidden Markov model of the plurality of speaker models;
  
  decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and
  
  creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.
- View Dependent Claims (10, 11, 12)
- - 10. The method of claim 9, further comprising extracting at least one acoustic feature from each of the plurality of frames.
  - 11. The method of claim 10, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC).
  - 12. The method of claim 11, further comprising:
    - approximating a distribution of the MFCC in each utterance by calculating at least one Gaussian mixture model (GMM) for each utterance; and
      
      calculating a distance between each GMM.

13. A method of diarization of audio data, the method comprising:
- receiving audio data;
  
  segmenting the audio data into a plurality of frames;
  
  segmenting audio data into a plurality of utterances wherein each utterance of the plurality comprises more than one frame of the plurality of frames;
  
  extracting at least one acoustic feature from each of the plurality of frames;
  
  representing each utterance as an utterance model representative of the extracted acoustic features of the plurality of frames of each utterance;
  
  approximating a distribution of the extracted acoustic features of each utterance by calculating at least one Gaussian mixture model for each utterance;
  
  calculating a distance between each of the Gaussian mixture models;
  
  constructing an affinity matric based upon the distances between utterances;
  
  computing a stochastic matrix from the affinity matrix;
  
  computing eigenvalues and corresponding eigenvectors for the stochastic matrix; and
  
  embedding the utterances into multi-dimensional vectors, wherein the utterance models comprise the multi-dimensional vectors;
  
  clustering the utterance models based upon the calculated distances;
  
  constructing a plurality of speaker models from the clustered projected utterance models;
  
  constructing a hidden Markov model of the plurality of speaker models;
  
  decoding a sequence of identified speaker models that best corresponds to the utterances of the audio data; and
  
  creating diarized audio data using the sequence of identified speaker models that best correspond to the utterances of the audio data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Sidi, Oana, Wein, Ron
Primary Examiner(s)
Pham, Thierry L

Application Number

US14/319,860
Publication Number

US 20150025887A1
Time in Patent Office

827 Days
Field of Search

704/245, 704/246, 704/250
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/26   Speech to text systems G10L...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/16   Hidden Markov models [HMM]

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/78   Detection of presence or ab...

H04M 2201/41   using speaker recognition

H04M 2203/303   Marking

H04M 3/5175   Call or contact centers sup...

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links