×

Acoustic signature building for a speaker from multiple sessions

  • US 9,875,743 B2
  • Filed: 01/26/2016
  • Issued: 01/23/2018
  • Est. Priority Date: 01/26/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method of blind diarization comprising:

  • receiving audio data at a communication interface of a computing system on a frame by frame basis,representing segments of the audio data according to respective feature vectors;

    clustering respective segments of the audio data according to the respective feature vectors, such that agglomerative clusters of similar feature vectors are gathered as super segments of the audio data;

    building respective voiceprint models for speakers from the super segments according to a size of respective agglomerative clusters;

    creating a background model from a first diagonal Gaussian distribution that includes all segments associated with those feature vectors not representing a speaker;

    wherein building respective voiceprint models comprises;

    training a respective diagonal Gaussian distribution for each of the agglomerative clusters of super segments;

    assigning a weighting value to each respective diagonal Gaussian distribution, wherein the weighting value is proportional to a total number of super-segments in the agglomerative cluster composing the respective diagonal Gaussian distribution;

    merging the respective diagonal Gaussian distributions, wherein the respective diagonal Gaussian distributions are included in a merged Gaussian distribution according to the respective weighting values;

    utilizing the respectively merged Gaussian distributions as respective voiceprint models and using the respective voiceprint models and the background model to label the segments of audio data with an identification of one of the speakers or a different identification as background data;

    iteratively refining each of the respective voiceprint models on an audio segment by audio segment basis by calculating a log likelihood of a presence of the respective segments as fitting within either the background model or within one of the respective voiceprint models;

    within each iteration, reassigning the segments of the audio data as fitting either one of the respective voiceprint models or the background model and repeating the step of utilizing the respective voiceprint models and the background model to label the segments;

    verifying each of the respective voiceprint models when a comparison to sample agent models stored in a memory indicates a match at a threshold quality; and

    decoding the segments identified as a speaker segment in accordance with one of the respective voiceprint models.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×