Blind diarization of recorded calls with arbitrary number of speakers
First Claim
Patent Images
1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising:
- obtaining the audio data and an unlabeled transcription of the audio data;
separating the audio data into a sequence of utterances, wherein each utterance has acoustic features;
clustering utterances having similar acoustic features;
generating a hidden Markov model (HMM) from the clustered utterances;
decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers;
determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and
labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.
2 Assignments
0 Petitions
Accused Products
Abstract
In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.
-
Citations
19 Claims
-
1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising:
-
obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to perform a method comprising:
-
obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.
-
-
11. A system for obtaining a speaker-identified transcription from audio data of multiple speakers, the system comprising:
-
a database storing acoustic voiceprint models of known speakers and the audio data of multiple speakers; a speech-to-text server that generates an unlabeled transcription of the audio data; and a computing device communicatively coupled to the database and the speech-to-text server, the computing device comprising a processor, wherein the processor is configured by software to; obtain the audio data and the unlabeled transcription of the audio data; separate the audio data into a sequence of utterances, wherein each utterance has acoustic features; cluster utterances having similar acoustic features; generate a hidden Markov model (HMM) from the clustered utterances; decode the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determine the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to the acoustic voiceprint models of known speakers; and label portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
Specification