Diarization using speech segment labeling

US 10,438,592 B2
Filed: 10/25/2018
Issued: 10/08/2019
Est. Priority Date: 11/21/2012
Status: Active Grant

First Claim

Patent Images

1. A method of diarization of audio files, the method comprising:

receiving a plurality of audio files from a database server and speaker metadata associations with each of the plurality of audio files, wherein each audio file is a recording of a customer service interaction including a known speaker and at least one other speaker, wherein the known speaker is a specific customer service agent and the at least one other speaker is a customer;

selecting a subset of the audio files, wherein each audio file of the subset is selected to maximize an acoustical difference in voice frequencies between the known speaker and the at least one other speaker in the same audio file;

performing a blind diarization on the subset of audio files to segment the audio files into a plurality of segments of speech separated by non-speech, such that each segment has a high likelihood of containing speech sections from a single speaker;

automatedly applying at least one metric to the segments of speech with a processor to label segments of speech likely to be associated with the known speaker and clustering the selected segments into an audio speaker segment;

analyzing the selected audio speaker segment to create an acoustic voiceprint, wherein the acoustic voiceprint is built from all the selected speaker segments;

applying the acoustic voiceprint to the audio files with the processor to label a portion of the audio file as having been spoken by the known speaker;

adding the labeled portion of the audio file to the acoustic voiceprint;

saving the acoustic voiceprint to a voiceprint database server and associating it with the metadata of the known speaker; and

with the processor, applying the saved acoustic voiceprint from the voiceprint database server to a new audio file from an audio source to perform diarization of the new audio file by blind diarizing the new audio file, comparing each of the new speech segments to the acoustic voiceprint, and labeling each speech segment as belonging to the known speaker associated with the acoustic voiceprint or belonging to an other speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and method of diarization of audio files use an acoustic voiceprint model. A plurality of audio files are analyzed to arrive at an acoustic voiceprint model associated to an identified speaker. Metadata associate with an audio file is used to select an acoustic voiceprint model. The selected acoustic voiceprint model is applied in a diarization to identify audio data of the identified speaker.

136 Citations

18 Claims

1. A method of diarization of audio files, the method comprising:
- receiving a plurality of audio files from a database server and speaker metadata associations with each of the plurality of audio files, wherein each audio file is a recording of a customer service interaction including a known speaker and at least one other speaker, wherein the known speaker is a specific customer service agent and the at least one other speaker is a customer;
  
  selecting a subset of the audio files, wherein each audio file of the subset is selected to maximize an acoustical difference in voice frequencies between the known speaker and the at least one other speaker in the same audio file;
  
  performing a blind diarization on the subset of audio files to segment the audio files into a plurality of segments of speech separated by non-speech, such that each segment has a high likelihood of containing speech sections from a single speaker;
  
  automatedly applying at least one metric to the segments of speech with a processor to label segments of speech likely to be associated with the known speaker and clustering the selected segments into an audio speaker segment;
  
  analyzing the selected audio speaker segment to create an acoustic voiceprint, wherein the acoustic voiceprint is built from all the selected speaker segments;
  
  applying the acoustic voiceprint to the audio files with the processor to label a portion of the audio file as having been spoken by the known speaker;
  
  adding the labeled portion of the audio file to the acoustic voiceprint;
  
  saving the acoustic voiceprint to a voiceprint database server and associating it with the metadata of the known speaker; and
  
  with the processor, applying the saved acoustic voiceprint from the voiceprint database server to a new audio file from an audio source to perform diarization of the new audio file by blind diarizing the new audio file, comparing each of the new speech segments to the acoustic voiceprint, and labeling each speech segment as belonging to the known speaker associated with the acoustic voiceprint or belonging to an other speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the subset is the top 50% or less of the audio files based upon an acoustical difference between the known speaker and the at least one other speaker.
  - 3. The method of claim 1, wherein the subset is the top 20% of less of the audio files based upon an acoustical difference between the known speaker and the at least one other speaker.
  - 4. The method of claim 1, wherein the subset is further selected by sending the audio files to a transcription server wherein the transcription server transcribes the audio files and automatedly scores each transcribed audio for a confidence of transcription, only those audio files that maximize the confidence of transcription and maximize an acoustical difference in voice frequencies between the known speaker and the at least one other speaker in the same audio file are selected.
  - 5. The method of claim 1, wherein the at least one metric is that of cluster size wherein the larger the cluster the more likely the segment belongs to the known speaker.
  - 6. The method of claim 1, wherein the saved acoustic voiceprint is a set of acoustic voiceprints each associated with a different known speaker, wherein each known speaker is a different customer service agent saved in the acoustic voiceprint database server.
  - 7. The method of claim 6, the method further comprising:
    - receiving the set of acoustic voiceprints from the acoustic voiceprint database server;
      
      applying the set of saved acoustic voiceprints from the voiceprint database server to a new audio file from an audio source to perform diarization of the new audio file by blind diarizing the new audio file, comparing each of the new speech segments to each of the acoustic voiceprints, and labeling each speech segment as belonging to one of the known speakers associated with the acoustic voiceprint or belonging to an other speaker.
  - 8. The method of claim 1, further comprising:
    - applying a linguistic model to the new audio file; and
      
      using the application of the linguistic model to identify the known speaker in the diarization of the new audio file.
  - 9. The method of claim 8, further comprising:
    - comparing an identification of the known speaker based upon the acoustic voiceprint to an identification of the known speaker based upon the linguistic model to select portions of each of the identifications in the diarization of the new audio file.
  - 10. The method of claim 1, further comprising:
    - applying the acoustic voiceprint to the audio files with the processor to label a portion of the audio file as having been spoken by the other speaker;
      
      building an other speaker acoustic voiceprint based on the portion of the audio file labeled as having been spoken by the other speaker.
  - 11. The method of claim 1, wherein the new audio file is real time audio data.

12. A system for diarization of audio files, the system comprising:
- An audio database server comprising a plurality of audio files and speaker metadata associated with each of the plurality of audio files, wherein each audio file is a recording of a customer service interaction including a known speaker and at least one other speaker, wherein the know speaker is a specific customer service agent and the at least one other speaker is a customer;
  
  a processor that receives a set of audio files associated with a set of speaker metadata wherein the audio files are all from a specific known speaker, selects a subset of the audio files, wherein each audio file of the subset is selected to maximize an acoustical difference in voice frequencies between the known speaker and the at least one other speaker in the same audio file, performs a blind diarization on the subset of audio files to segment the audio files into a plurality of segments of speech separated by non-speech, such that each segment has a high likelihood of containing speech sections from a single speaker, automatedly applies at least one metric to the segments of speech with a processor to label segments of speech likely to be associated with the known speaker and clustering the selected segments into an audio speaker segment, analyzes the selected audio speaker segment to create an acoustic voiceprint, wherein the acoustic voiceprint is built from all the selected speaker segments, applies the acoustic voiceprint to the audio files with the processor to label a portion of the audio file as having been spoken by the known speaker, adds the labeled portion of the audio file to the acoustic voiceprint;
  
  a voiceprint database server that stores the at least one acoustic voiceprint; and
  
  an audio source that provides new audio files to the processor;
  
  wherein the processor applies the saved acoustic voiceprint from the voiceprint database server to a new audio file from an audio source to perform diarization of the new audio file by blind diarizing the new audio file, comparing each of the new speech segments to the acoustic voiceprint, and labeling each speech segment as belonging to the known speaker associated with the acoustic voiceprint or belonging to an other speaker.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The system of claim 12, wherein the subset is the top 50% or less of the audio files based upon an acoustical difference between the known speaker and the at least one other speaker.
  - 14. The system of claim 12, wherein the subset is the top 20% of less of the audio files based upon an acoustical difference between the known speaker and the at least one other speaker.
  - 15. The system of claim 12, wherein the subset is further selected by sending the audio files to a transcription server wherein the transcription server transcribes the audio files and automatedly scores each transcribed audio for a confidence of transcription, only those audio files that maximize the confidence of transcription and maximize an acoustical difference in voice frequencies between the known speaker and the at least one other speaker in the same audio file are selected.
  - 16. The system of claim 12, wherein the at least one metric is that of cluster size wherein the larger the cluster the more likely the segment belongs to the known speaker.
  - 17. The system of claim 12, wherein the saved acoustic voiceprint is a set of acoustic voiceprints each associated with a different known speaker, wherein each known speaker is a different customer service agent saved in the acoustic voiceprint database server.
  - 18. The system of claim 17, the method further comprising:
    - the processor further receives the set of acoustic voiceprints from the acoustic voiceprint database server,applies the set of saved acoustic voiceprints from the voiceprint database server to a new audio file from an audio source to perform diarization of the new audio file by blind diarizing the new audio file, compares each of the new speech segments to each of the acoustic voiceprints, and labels each speech segment as belonging to one of the known speakers associated with the acoustic voiceprint or belonging to an other speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Ziv, Omer, Achituv, Ran, Shapira, Ido, Dreyfuss, Jeremie
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US16/170,306
Publication Number

US 20190066693A1
Time in Patent Office

348 Days
Field of Search
US Class Current
CPC Class Codes

G10L 17/00 Speaker identification or v...

G10L 17/02 Preprocessing operations, e...

Diarization using speech segment labeling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

136 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

Diarization using speech segment labeling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

136 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others