Acoustic signature building for a speaker from multiple sessions

US 9,875,743 B2
Filed: 01/26/2016
Issued: 01/23/2018
Est. Priority Date: 01/26/2015
Status: Active Grant

First Claim

Patent Images

1. A method of blind diarization comprising:

receiving audio data at a communication interface of a computing system on a frame by frame basis,representing segments of the audio data according to respective feature vectors;

clustering respective segments of the audio data according to the respective feature vectors, such that agglomerative clusters of similar feature vectors are gathered as super segments of the audio data;

building respective voiceprint models for speakers from the super segments according to a size of respective agglomerative clusters;

creating a background model from a first diagonal Gaussian distribution that includes all segments associated with those feature vectors not representing a speaker;

wherein building respective voiceprint models comprises;

training a respective diagonal Gaussian distribution for each of the agglomerative clusters of super segments;

assigning a weighting value to each respective diagonal Gaussian distribution, wherein the weighting value is proportional to a total number of super-segments in the agglomerative cluster composing the respective diagonal Gaussian distribution;

merging the respective diagonal Gaussian distributions, wherein the respective diagonal Gaussian distributions are included in a merged Gaussian distribution according to the respective weighting values;

utilizing the respectively merged Gaussian distributions as respective voiceprint models and using the respective voiceprint models and the background model to label the segments of audio data with an identification of one of the speakers or a different identification as background data;

iteratively refining each of the respective voiceprint models on an audio segment by audio segment basis by calculating a log likelihood of a presence of the respective segments as fitting within either the background model or within one of the respective voiceprint models;

within each iteration, reassigning the segments of the audio data as fitting either one of the respective voiceprint models or the background model and repeating the step of utilizing the respective voiceprint models and the background model to label the segments;

verifying each of the respective voiceprint models when a comparison to sample agent models stored in a memory indicates a match at a threshold quality; and

decoding the segments identified as a speaker segment in accordance with one of the respective voiceprint models.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

Citations

8 Claims

1. A method of blind diarization comprising:
- receiving audio data at a communication interface of a computing system on a frame by frame basis,representing segments of the audio data according to respective feature vectors;
  
  clustering respective segments of the audio data according to the respective feature vectors, such that agglomerative clusters of similar feature vectors are gathered as super segments of the audio data;
  
  building respective voiceprint models for speakers from the super segments according to a size of respective agglomerative clusters;
  
  creating a background model from a first diagonal Gaussian distribution that includes all segments associated with those feature vectors not representing a speaker;
  
  wherein building respective voiceprint models comprises;
  
  training a respective diagonal Gaussian distribution for each of the agglomerative clusters of super segments;
  
  assigning a weighting value to each respective diagonal Gaussian distribution, wherein the weighting value is proportional to a total number of super-segments in the agglomerative cluster composing the respective diagonal Gaussian distribution;
  
  merging the respective diagonal Gaussian distributions, wherein the respective diagonal Gaussian distributions are included in a merged Gaussian distribution according to the respective weighting values;
  
  utilizing the respectively merged Gaussian distributions as respective voiceprint models and using the respective voiceprint models and the background model to label the segments of audio data with an identification of one of the speakers or a different identification as background data;
  
  iteratively refining each of the respective voiceprint models on an audio segment by audio segment basis by calculating a log likelihood of a presence of the respective segments as fitting within either the background model or within one of the respective voiceprint models;
  
  within each iteration, reassigning the segments of the audio data as fitting either one of the respective voiceprint models or the background model and repeating the step of utilizing the respective voiceprint models and the background model to label the segments;
  
  verifying each of the respective voiceprint models when a comparison to sample agent models stored in a memory indicates a match at a threshold quality; and
  
  decoding the segments identified as a speaker segment in accordance with one of the respective voiceprint models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method according to claim 1, further comprising utilizing the respective diagonal Gaussian distributions for the clusters to calculate the log likelihood.
  - 3. The method according to claim 1, further comprising using a Gaussian Mixture Model to calculate the log likelihood.
  - 4. The method according to claim 1, further comprising using a single diagonal Gaussian distribution on those feature vectors not representing a speaker.
  - 5. The method according to claim 1, further comprising filtering out short utterances on a time duration basis.
  - 6. The method according to claim 1, wherein the feature vectors comprise Mel-frequency cepstral coefficients (MFCC) for each frame.
  - 7. The method according to claim 1, further comprising:
    - determining a cluster of segments to be comprised of respective utterances and representing a distribution of feature vectors in the respective utterances;
      
      characterizing each feature vector in terms of its probability of being present in one of the respective voiceprint models;
      
      calculating a distance metric between utterances according to the probability;
      
      identifying time between speakers in the audio stream.
  - 8. The method according to claim 7, further comprising:
    - using distances between utterances to construct an affinity matrix based upon respective distances;
      
      computing a stochastic matrix from the affinity matrix;
      
      computing eigenvalues and corresponding eigenvectors of the stochastic matrix; and
      
      computing an embedding of the utterances into dimensional vectors; and
      
      identifying embedded utterances in a frame as an additional speaker or as additional background audio.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Gorodetski, Alex, Shapira, Ido, Wein, Ron, Sidi, Oana
Primary Examiner(s)
SHIN, SEONG-AH A

Application Number

US15/006,575
Publication Number

US 20160217793A1
Time in Patent Office

728 Days
Field of Search

704245, 704246, 704250
US Class Current
CPC Class Codes

G10L 15/26   Speech to text systems G10L...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/16   Hidden Markov models [HMM]

G10L 25/84   for discriminating voice fr...

Acoustic signature building for a speaker from multiple sessions

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Acoustic signature building for a speaker from multiple sessions

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links