Diarization using textual and audio speaker labeling

US 10,446,156 B2
Filed: 10/25/2018
Issued: 10/15/2019
Est. Priority Date: 11/21/2012
Status: Active Grant

First Claim

Patent Images

1. A method of diarization, the method comprising:

receiving a set of textual transcripts from a transcription server and a set of audio files associated with the set of textual transcripts from an audio database server;

performing a blind diarization on the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, wherein the number of textual speaker clusters is at least equal to a number of speakers in the textual transcript;

automatedly applying at least one heuristic to the textual speaker clusters with a processor to select textual speaker clusters likely to be associated with an identified group of speakers;

analyzing the selected textual speaker clusters with the processor to create at least one linguistic model;

applying the linguistic model to transcribed audio data with the processor to label a portion of the transcribed audio data as having been spoken by the identified group of speakers;

saving the at least one linguistic model to a linguistic database server and associating it with the labeled speaker;

with the processor, receiving a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server;

receiving the at least one linguistic model from the linguistic database server;

receiving at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server;

applying the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, comparing each new textual speaker cluster to the at least one linguistic model, and labeling each textual speaker cluster as belonging to a customer service agent or belonging to a customer, comparing each audio speaker segment to the at least one acoustic voiceprint, and labeling each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker;

when one of the audio speaker segments is labeled as belonging to a known speaker, selecting and transcribing the labeled audio speaker segments with the transcription server;

comparing the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and

when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keeping the current labels, otherwise relabeling the textual speaker cluster as belonging to an unknown speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods of diarization using linguistic labeling include receiving a set of diarized textual transcripts. A least one heuristic is automatedly applied to the diarized textual transcripts to select transcripts likely to be associated with an identified group of speakers. The selected transcripts are analyzed to create at least one linguistic model. The linguistic model is applied to transcripted audio data to label a portion of the transcripted audio data as having been spoken by the identified group of speakers. Still further embodiments of diarization using linguistic labeling may serve to label agent speech and customer speech in a recorded and transcripted customer service interaction.

Citations

18 Claims

1. A method of diarization, the method comprising:
- receiving a set of textual transcripts from a transcription server and a set of audio files associated with the set of textual transcripts from an audio database server;
  
  performing a blind diarization on the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, wherein the number of textual speaker clusters is at least equal to a number of speakers in the textual transcript;
  
  automatedly applying at least one heuristic to the textual speaker clusters with a processor to select textual speaker clusters likely to be associated with an identified group of speakers;
  
  analyzing the selected textual speaker clusters with the processor to create at least one linguistic model;
  
  applying the linguistic model to transcribed audio data with the processor to label a portion of the transcribed audio data as having been spoken by the identified group of speakers;
  
  saving the at least one linguistic model to a linguistic database server and associating it with the labeled speaker;
  
  with the processor, receiving a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server;
  
  receiving the at least one linguistic model from the linguistic database server;
  
  receiving at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server;
  
  applying the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, comparing each new textual speaker cluster to the at least one linguistic model, and labeling each textual speaker cluster as belonging to a customer service agent or belonging to a customer, comparing each audio speaker segment to the at least one acoustic voiceprint, and labeling each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker;
  
  when one of the audio speaker segments is labeled as belonging to a known speaker, selecting and transcribing the labeled audio speaker segments with the transcription server;
  
  comparing the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and
  
  when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keeping the current labels, otherwise relabeling the textual speaker cluster as belonging to an unknown speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the identified group of speakers are customer service agents and the audio data is audio data of a customer service interaction between at least one customer service agent and at least one customer.
  - 3. The method of claim 1, wherein the specific speaker is a specific customer service agent.
  - 4. The method of claim 1, wherein the at least one heuristic is detection of a script associated with the identified group of speakers.
  - 5. The method of claim 1, wherein the analysis of the selected textual speaker clusters includes determining word use frequencies for words in the selected textual speaker clusters with the processor, determining word use frequencies for words in the non-selected textual speaker clusters with the processor, and comparing the word use frequencies for words in the selected textual speaker clusters to the word use frequencies for words in the non-selected textual speaker clusters with the processor to identify a plurality of discriminating words for use in the at least one linguistic model.
  - 6. The method of claim 1, wherein the analysis of the selected textual speaker clusters includes receiving a plurality of scripts associated with the identified group of speakers, comparing the plurality of scripts to the selected textual speaker clusters, comparing the plurality of scripts of non-selected textual speaker clusters, determining a correlation score between each of the textual speaker clusters and the plurality of scripts, identifying the group with the greatest correlation score for use in the at least one linguistic model.
  - 7. The method of claim 6, further comprising:
    - calculating a difference between the word use frequencies for each word in the selected textual speaker clusters and the non-selected textual speaker clusters; and
      
      comparing the difference to a predetermined selection threshold, wherein if the difference is greater than the predetermined selection threshold, the word is identified as a discriminating word.
  - 8. The method of claim 1, wherein the textual speaker clusters are associated in groups of at least two, wherein the group of at least two includes a textual speaker cluster originating from the identified group of speakers and at least one textual speaker cluster originating from an other speaker, and wherein the non-selected textual speaker clusters are assumed to have originated from an other speaker.
  - 9. The method of claim 1, wherein the at least one acoustic voiceprint is a set of acoustic voiceprints for each specific customer service agent saved in the acoustic voiceprint database server.
  - 10. The method of claim 9, the method further comprising:
    - receiving the set of acoustic voiceprints from the acoustic voiceprint database server;
      
      applying the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript;
      
      comparing each new textual speaker cluster to the at least one linguistic model, and labeling each textual speaker cluster as belonging to a customer service agent or belonging to a customer;
      
      comparing each audio speaker segment to the set of acoustic voiceprints;
      
      determining which audio speaker segments match one of the acoustic voiceprints; and
      
      labeling those audio speaker segments as belonging to the known speaker.

11. A system for diarization and labeling of audio data, the system comprising:
- An audio database server comprising a plurality of audio files;
  
  a transcription server that transcribes the audio files of the plurality of audio files into textual transcripts;
  
  a processor that receives a set of textual transcripts from the transcription serve and a set of audio files associated with the set of textual transcripts from the audio database server, performs a blind diarization of the set of textual transcripts and the set of audio files to segment and cluster the textual transcripts into a plurality of textual speaker clusters, segment and cluster the audio files into a plurality of audio speaker segments, wherein the number of textual speaker clusters and the audio speaker segments are each at least equal to a number of speakers in the textual transcript, automatedly applies at least one heuristic to the textual speaker clusters to select at least one of the textual speaker cluster as being associated to an identified group of speakers, and analyzes the selected transcripts to create at least one linguistic model indicative of the identified group of speakers;
  
  a linguistic database server that stores the at least one linguistic modelan acoustic voiceprint database server that stores the at least one acoustic voiceprint from a known speaker; and
  
  an audio source that provides new transcribed audio data to the processor;
  
  wherein the processor, receives a new textual transcript from the transcription server and a new audio file associated with the new textual transcript from the audio database server, receives the at least one linguistic model from the linguistic database server, receives at least one acoustic voiceprint associated with a specific speaker from a voiceprint database server, applies the at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript, compares each new textual speaker cluster to the at least one linguistic model, and labels each textual speaker cluster as belonging to a customer service agent or belonging to a customer, compares each audio speaker segment to the at least one acoustic voiceprint, and labels each audio speaker segment as belonging to a known speaker or belonging to an unknown speaker, when one of the audio speaker segments is labeled as belonging to a known speaker, selects and transcribes the labeled audio speaker segments with the transcription server, compares the selected transcribed labeled audio speaker segments to the textual speaker clusters labeled as belonging to a customer service agent; and
  
  based on the comparison, when the compared transcribed segments and clusters are each labeled as belonging to a known speaker and a customer service agent, keep the current labels, otherwise relabel the textual speaker cluster as belonging to an unknown speaker.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The system of claim 11, wherein the identified group of speakers are customer service agents and each of the audio files is of a customer service interaction between at least one customer service agent and at least one customer.
  - 13. The system of claim 11, wherein the specific speaker is a specific customer service agent.
  - 14. The system of claim 11, wherein the at least one heuristic is detection of a script associated with the identified group of speakers.
  - 15. The system of claim 11, wherein the analysis of the selected textual speaker clusters includes determining word use frequencies for words in the selected textual speaker clusters with the processor, determining word use frequencies for words in the non-selected textual speaker clusters with the processor, and comparing the word use frequencies for words in the selected textual speaker clusters to the word use frequencies for words in the non-selected textual speaker clusters with the processor to identify a plurality of discriminating words for use in the at least one linguistic model.
  - 16. The system of claim 11, further comprising:
    - a processor that calculates a difference between the word use frequencies for each word in the selected textual speaker clusters and the non-selected textual speaker clusters, andcompares the difference to a predetermined selection threshold, wherein if the difference is greater than the predetermined selection threshold, the word is identified as a discriminating word, wherein the analysis of the selected textual speaker clusters includes the processor receiving a plurality of scripts associated with the identified group of speakers, compares the plurality of scripts to the selected textual speaker clusters, compares the plurality of scripts of non-selected textual speaker clusters, determines a correlation score between each of the textual speaker clusters and the plurality of scripts, identifies the group with the greatest correlation score for use in the at least one linguistic model.
  - 17. The method of claim 11, wherein the at least one acoustic voiceprint is a set of acoustic voiceprints for each specific customer service agent saved in the acoustic voiceprint database server.
  - 18. The method of claim 17, the method further comprising:
    - the processor further receives the set of acoustic voiceprints from the acoustic voiceprint database server,applies the received at least one linguistic model from the linguistic database server to the new audio file transcript from an audio source to perform diarization of the new audio file by blind diarizing the new audio file and new textual transcript;
      
      compares each new textual speaker cluster to the at least one linguistic model, and labels each textual speaker cluster as belonging to a customer service agent or belonging to a customer;
      
      compares each audio speaker segment to the set of acoustic voiceprints;
      
      determines which audio speaker segments match one of the acoustic voiceprints; and
      
      labels those audio speaker segments as belonging to the known speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Ziv, Omer, Achituv, Ran, Shapira, Ido, Dreyfuss, Jeremie
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US16/170,297
Publication Number

US 20190066692A1
Time in Patent Office

355 Days
Field of Search
US Class Current
CPC Class Codes

G10L 17/00 Speaker identification or v...

G10L 17/02 Preprocessing operations, e...

Diarization using textual and audio speaker labeling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Diarization using textual and audio speaker labeling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links