Unsupervised speaker clustering for automatic speaker indexing of recorded audio data
First Claim
1. A method for unsupervised clustering of audio data segments in an audio data recording containing speech from multiple speakers to segment the audio data recording by speaker, the method comprising the steps of:
- a) providing a portion of said audio data containing speech from at least all the speakers in said audio data;
b) forming initial clusters by dividing said portion of said audio data into segments, each segment including a set of data having an order;
c) computing a pairwise distance between each pair of clusters using a likelihood ratio independent of data order within the segments;
d) combining two clusters with a minimum pairwise distance into a new cluster; and
e) repeating said steps b), c), and d) until a number of clusters equal to a number of the multiple speakers is obtained.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method for unsupervised clustering of audio data segments in an audio data recording containing speech from multiple speakers including the steps of: 1) providing a portion of the audio data containing speech from all of the speakers; 2) forming initial clusters by dividing the portion of the audio data into segments, each of which includes an ordered data set; 3) computing the pairwise distance between each pair of clusters using a likelihood ration independent of the order of data within the segments; and 4) combining into a new cluster the two clusters with a minimum pairwise distance. These steps are repeated until a number of clusters equal to the number of speakers is obtained.
152 Citations
14 Claims
-
1. A method for unsupervised clustering of audio data segments in an audio data recording containing speech from multiple speakers to segment the audio data recording by speaker, the method comprising the steps of:
-
a) providing a portion of said audio data containing speech from at least all the speakers in said audio data; b) forming initial clusters by dividing said portion of said audio data into segments, each segment including a set of data having an order; c) computing a pairwise distance between each pair of clusters using a likelihood ratio independent of data order within the segments; d) combining two clusters with a minimum pairwise distance into a new cluster; and e) repeating said steps b), c), and d) until a number of clusters equal to a number of the multiple speakers is obtained. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for segmenting audio data by speaker in an audio recording representing multiple speakers, the method comprising the steps of:
-
a) providing an audio portion of said audio recording containing speech from at least all the speakers in the audio recording; b) forming initial clusters by dividing said audio portion into segments, each segment including a set of data having an order; c) until a number of clusters is obtained that is equal to a number of the multiple speakers; 1) computing a pairwise distance between each pair of clusters using a likelihood ratio independent of the order of the data within the segments; 2) combining two clusters with a minimum pairwise distance; d) for each of said desired number of clusters, training an individual Hidden Markov Model (HMM); e) combining said individual HMMs in parallel to form a speaker network HMM; f) determining an optimal path through said speaker network HMM for said audio data, identifying segments of said audio data associated with each individual HMM; and g) marking each segment according to the individual HMM from which it was created. - View Dependent Claims (7)
-
-
8. A processor controlled system for estimating speaker segmentation in recorded audio data, the system comprising:
-
a) an audio source for providing recorded audio data comprising speech from a plurality of individual speakers, wherein a total number of individual speakers in said audio data is not known; b) an audio processor for receiving said audio data and converting said audio data into spectral feature data; c) memory for storing data, the data stored in the memory including instruction data indicating instructions; d) a system processor coupled to the memory for accessing the data stored in the memory executing the instructions, and receiving said spectral feature data from said audio processor and producing estimated speaker models by; 1) dividing said spectral feature data into segments to form initial clusters of equal arbitrary length, each segment having a set of spectral feature data having an order; 2) combining said initial clusters into speaker clusters based on a likelihood ratio that segments were generated by the same speaker, said likelihood ratio being independent of the order of the spectral feature data within the segments and being based on tied mixtures of Gaussians; 3) producing estimated speaker models based on said speaker clusters, each speaker model having an associated identifier; said system processor further combining said estimated speaker models into a speaker network; said system processor, using said speaker network, determining segments of said audio data which correspond to different individual speaker models, a number of the segments equaling the total number of individual speakers; said system processor further determining at the start of each segment a timestamp, said timestamp corresponding to the received time for that segment on said storage medium, said system processor storing said timestamp in said memory; and said system processor further storing said speaker identifier of said individual speaker model for each segment in said memory in conjunction with said storage medium location address for that segment. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification