Blind diarization of recorded calls with arbitrary number of speakers
First Claim
Patent Images
1. A method for automatically transcribing a customer service telephone conversation between an arbitrary number of speakers, the method comprising:
- receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data;
separating the audio data into frames;
analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames;
performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises;
representing each utterance as a utterance model based on acoustic features of each utterance,clustering the utterance models,creating speaker models from each of the clusters,constructing a hidden Markov model from the speaker models, anddecoding the hidden Markov model to differentiate speakers of each utterance;
tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker;
performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises;
comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, andbased on the comparison, identifying one or more of the speakers; and
transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.
2 Assignments
0 Petitions
Accused Products
Abstract
In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.
168 Citations
16 Claims
-
1. A method for automatically transcribing a customer service telephone conversation between an arbitrary number of speakers, the method comprising:
-
receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data; separating the audio data into frames; analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames; performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises; representing each utterance as a utterance model based on acoustic features of each utterance, clustering the utterance models, creating speaker models from each of the clusters, constructing a hidden Markov model from the speaker models, and decoding the hidden Markov model to differentiate speakers of each utterance; tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker; performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises; comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, and based on the comparison, identifying one or more of the speakers; and transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to a method comprising:
-
receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data; separating the audio data into frames; analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames; performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises; representing each utterance as a utterance model based on acoustic features of each utterance, clustering the utterance models, creating speaker models from each of the clusters, constructing a hidden Markov model from the speaker models, and decoding the hidden Markov model to differentiate speakers of each utterance; tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker; performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises; comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, and based on the comparison, identifying one or more of the speakers; and transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.
-
-
13. A system for automatically transcribing customer service telephone conversations between customer service agents and customers, the system comprising:
-
a plurality of call center telephones configured to facilitate telephone conversations between customer service agents and customers; a database for storing voice print models of customer service agents and customers; a computing device communicatively coupled to the plurality of call center telephones and the database, the computing device comprising a processor, wherein the processor is configured by software to; receive data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more speakers in the audio data; separate the audio data into frames; analyze the frames to identify utterances, wherein each utterance comprises a plurality of frames; perform blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises; representing each utterance as a utterance model based on acoustic features of each utterance, clustering the utterance models, creating speaker models from each of the clusters, constructing a hidden Markov model from the speaker models, decoding the hidden Markov model to differentiate speakers of each utterance, and tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker; perform speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises; comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, and identify, based on the comparison one or more of the speakers; and transcribe the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker. - View Dependent Claims (14, 15, 16)
-
Specification