Diarization using textual and audio speaker labeling
First Claim
1. A method of diarization, the method comprising:
- receiving a set of textual transcripts from a transcription server and a set of audio files associated with the set of textual transcripts from an audio database server;
performing a blind diarization on the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, wherein the number of textual speaker clusters is at least equal to a number of speakers in the textual transcript;
automatedly applying at least one heuristic to the textual speaker clusters with a processor to select textual speaker clusters likely to be associated with an identified group of speakers;
analyzing the selected textual speaker clusters with the processor to create at least one linguistic model;
applying the linguistic model to transcribed audio data with the processor to label a portion of the transcribed audio data as having been spoken by the identified group of speakers;
saving the at least one linguistic model to a linguistic database server and associating it with the labeled speaker;
with the processor, receiving a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server;
receiving the at least one linguistic model from the linguistic database server;
receiving at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server;
applying the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, comparing each new textual speaker cluster to the at least one linguistic model, and labeling each textual speaker cluster as belonging to a customer service agent or belonging to a customer, comparing each audio speaker segment to the at least one acoustic voiceprint, and labeling each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker;
when one of the audio speaker segments is labeled as belonging to a known speaker, selecting and transcribing the labeled audio speaker segments with the transcription server;
comparing the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and
when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keeping the current labels, otherwise relabeling the textual speaker cluster as belonging to an unknown speaker.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods of diarization using linguistic labeling include receiving a set of diarized textual transcripts. A least one heuristic is automatedly applied to the diarized textual transcripts to select transcripts likely to be associated with an identified group of speakers. The selected transcripts are analyzed to create at least one linguistic model. The linguistic model is applied to transcripted audio data to label a portion of the transcripted audio data as having been spoken by the identified group of speakers. Still further embodiments of diarization using linguistic labeling may serve to label agent speech and customer speech in a recorded and transcripted customer service interaction.
-
Citations
18 Claims
-
1. A method of diarization, the method comprising:
-
receiving a set of textual transcripts from a transcription server and a set of audio files associated with the set of textual transcripts from an audio database server; performing a blind diarization on the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, wherein the number of textual speaker clusters is at least equal to a number of speakers in the textual transcript; automatedly applying at least one heuristic to the textual speaker clusters with a processor to select textual speaker clusters likely to be associated with an identified group of speakers; analyzing the selected textual speaker clusters with the processor to create at least one linguistic model; applying the linguistic model to transcribed audio data with the processor to label a portion of the transcribed audio data as having been spoken by the identified group of speakers; saving the at least one linguistic model to a linguistic database server and associating it with the labeled speaker; with the processor, receiving a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server; receiving the at least one linguistic model from the linguistic database server; receiving at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server; applying the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, comparing each new textual speaker cluster to the at least one linguistic model, and labeling each textual speaker cluster as belonging to a customer service agent or belonging to a customer, comparing each audio speaker segment to the at least one acoustic voiceprint, and labeling each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker; when one of the audio speaker segments is labeled as belonging to a known speaker, selecting and transcribing the labeled audio speaker segments with the transcription server; comparing the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keeping the current labels, otherwise relabeling the textual speaker cluster as belonging to an unknown speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for diarization and labeling of audio data, the system comprising:
-
An audio database server comprising a plurality of audio files; a transcription server that transcribes the audio files of the plurality of audio files into textual transcripts; a processor that receives a set of textual transcripts from the transcription serve and a set of audio files associated with the set of textual transcripts from the audio database server, performs a blind diarization of the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, segment and cluster the audio files into a plurality of audio speaker segments, wherein the number of textual speaker clusters and the audio speaker segments are each at least equal to a number of speakers in the textual transcript, automatedly applies at least one heuristic to the textual speaker clusters to select at least one of the textual speaker cluster as being associated to an identified group of speakers, and analyzes the selected transcripts to create at least one linguistic model indicative of the identified group of speakers; a linguistic database server that stores the at least one linguistic model an acoustic voiceprint database server that stores the at least one acoustic voiceprint from a known speaker; and an audio source that provides new transcribed audio data to the processor; wherein the processor, receives a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server, receives the at least one linguistic model from the linguistic database server, receives at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server, applies the at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, compares each new textual speaker cluster to the at least one linguistic model, and labels each textual speaker cluster as belonging to a customer service agent or belonging to a customer, compares each audio speaker segment to the at least one acoustic voiceprint, and labels each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker, when one of the audio speaker segments is labeled as belonging to a known speaker, selects and transcribes the labeled audio speaker segments with the transcription server, compares the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and
based on the comparison, when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keep the current labels, otherwise relabel the textual speaker cluster as belonging to an unknown speaker. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
Specification