×

Word-level blind diarization of recorded calls with arbitrary number of speakers

  • US 9,875,742 B2
  • Filed: 01/26/2016
  • Issued: 01/23/2018
  • Est. Priority Date: 01/26/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method of creating an acoustic signature for a speaker from multiple audio sessions and for performing diarization, the method comprising:

  • receiving, from an audio data source, audio data at an audio communications interface of a computing system, the audio data defining a training set containing a number of recorded audio sessions, wherein the computing system is configured to construct, from each audio session, a plurality of respective speaker models, wherein each speaker model is characterized by aggregating acoustic features into respective feature vectors that define a respective occupancy which is proportional to a total number of feature vectors used to construct the speaker model, and wherein the speaker models are Gaussian mixture models (GMMs) defined over a common set of Gaussian distributions that differ only by respective mixture probabilities for the acoustic features present in the feature vectors;

    classifying the plurality of speaker models to identify a set of common speaker GMMs and a set of generic speaker GMMs, wherein the classifying includes constructing an undirected similarity graph having vertices corresponding to the plurality of respective speaker models of all the recorded audio sessions in the training set and classifying the plurality of speaker models according to a degree of similarity between the corresponding vertices in the undirected similarity graph in relation to at least one threshold degree of similarity;

    generating an acoustic signature by at least;

    constructing a super-GMM for the set of common speaker GMMs, andconstructing a second super-GMM for the set of generic speaker GMMs by generating a set of random vectors and training a second GMM over these random vectors, wherein a respective acoustic signature for a common speaker is given as a super-model pair of the two constructed super-GMMs;

    storing the two constructed super-GMMs in a computing system memory;

    receiving additional audio data at the audio communications interface;

    identifying the common speaker using the super-model pair; and

    labeling the additional audio data with an identified common speaker label.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×