Speaker separation in diarization

US 9,875,739 B2
Filed: 05/19/2016
Issued: 01/23/2018
Est. Priority Date: 09/07/2012
Status: Active Grant

First Claim

Patent Images

1. A method of producing a diarized transcript from a digital audio file, the method comprising:

obtaining a digital audio file;

splitting the digital audio file into a plurality of frames;

segmenting the digital audio file into entropy segments based upon an entropy of each frame;

performing a blind diarization to identify a first speaker audio file and a second speaker audio file by clustering the entropy segments into the first speaker audio file and the second speaker audio file, wherein the first speaker audio file only contains audio attributed to the first speaker and the second speaker audio file only contains audio attributed to the second speaker; and

identifying one of the first speaker audio file and second speaker audio file as an agent audio file and another of the first speaker audio file and the second speaker audio file as a customer audio file; and

transcribing the agent audio file and the customer audio file to produce a diarized transcript.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The system and method of separating speakers in an audio file including obtaining an audio file. The audio file is transcribed into at least one text file by a transcription server. Homogenous speech segments are identified within the at least one text file. The audio file is segmented into homogenous audio segments that correspond to the identified homogenous speech segments. The homogenous audio segments of the audio file are separated into a first speaker audio file and second speaker audio file the first speaker audio file and the second speaker audio file are transcribed to produce a diarized transcript.

Citations

18 Claims

1. A method of producing a diarized transcript from a digital audio file, the method comprising:
- obtaining a digital audio file;
  
  splitting the digital audio file into a plurality of frames;
  
  segmenting the digital audio file into entropy segments based upon an entropy of each frame;
  
  performing a blind diarization to identify a first speaker audio file and a second speaker audio file by clustering the entropy segments into the first speaker audio file and the second speaker audio file, wherein the first speaker audio file only contains audio attributed to the first speaker and the second speaker audio file only contains audio attributed to the second speaker; and
  
  identifying one of the first speaker audio file and second speaker audio file as an agent audio file and another of the first speaker audio file and the second speaker audio file as a customer audio file; and
  
  transcribing the agent audio file and the customer audio file to produce a diarized transcript.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising;
    - before performing the blind diarization, transcribing the digital audio file with an automated transcription to produce a text file;
      
      identifying homogeneous speech segments in the text file; and
      
      segmenting the digital audio file into homogeneous audio segments that correspond to the identified homogeneous speech segments;
      
      wherein clustering the entropy segments into the first speaker audio file and the second speaker audio file comprises clustering the entropy segments and the homogeneous audio segments.
  - 3. The method of claim 1, further comprising after the splitting of the digital audio file into a plurality of frames:
    - calculating an overall energy speech probability for each frame;
      
      calculating a band energy speech probability for each frame;
      
      calculating a spectral peakiness speech probability for each frame;
      
      calculating a residual energy speech probability for each frame;
      
      computing an activity probability for each frame from the overall energy speech probability, band energy speech probability, spectral peakiness speech probability, and residual energy speech probability;
      
      comparing a moving average of activity probabilities to at least one threshold; and
      
      identifying speech and non-speech segments in the digital audio file based upon the comparison.
  - 4. The method of claim 1 further comprising:
    - filtering the digital audio file to remove non-speech segments;
      
      identifying long speech segments in the homogenous speech segments; and
      
      splitting the long speech segments based upon contextual information from the identified homogeneous speech segments.
  - 5. The method of claim 4, wherein the filtering of the audio file comprises energy envelope filtering to remove segments with energy determined to be below a lower energy threshold or above an upper energy threshold.
  - 6. The method of claim 1:
    - wherein transcribing the digital audio file comprises applying an agent model to the digital audio file; and
      
      wherein identifying one of the first speaker audio file and the second speaker audio file as the agent audio file, comprises comparing the first speaker audio file and the second speaker audio file to the agent model.
  - 7. The method of claim 1, wherein separating the audio file into a first speaker audio file and a second audio file further comprises:
    - clustering identified segments;
      
      creating a first speaker model and a second speaker model from the clustered identified segments; and
      
      identifying unclustered segments by comparing an unclustered segment to the first speaker model and the second speaker model.

8. A non-transitory computer-readable medium having stored thereon a sequence of instructions that when executed by a computing system causes, the computing system to perform the steps comprising:
- obtaining a digital audio file;
  
  splitting the digital audio file into a plurality of frames;
  
  segmenting the digital audio file into entropy segments based upon an entropy of each frame;
  
  performing a blind diarization to identify a first speaker audio file and a second speaker audio file by clustering the entropy segments into the first speaker audio file and the second speaker audio file, wherein the first speaker audio file only contains audio attributed to the first speaker and the second speaker audio file only contains audio attributed to the second speaker; and
  
  identifying one of the first speaker audio file and second speaker audio file as an agent audio file and another of the first speaker audio file and the second speaker audio file as a customer audio file; and
  
  transcribing the agent audio file and the customer audio file to produce a diarized transcript.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The non-transitory computer-readable medium of claim 8 having further instructions stored thereon that when executed by the computing system, cause the computing system to perform the additionally steps comprising;
    - before performing the blind diarization, transcribing the digital audio file with an automated transcription to produce a text file;
      
      identifying homogeneous speech segments in the text file; and
      
      segmenting the digital audio file into homogeneous audio segments that correspond to the identified homogeneous speech segments;
      
      wherein clustering the entropy segments into the first speaker audio file and the second speaker audio file comprises clustering the entropy segments and the homogeneous audio segments.
  - 10. The non-transitory computer-readable medium of claim 8 having further instructions stored thereon that when executed by the computing system, cause the computing system to perform the additionally steps comprising after the splitting of the digital audio file into a plurality of frames:
    - calculating an overall energy speech probability for each frame;
      
      calculating a band energy speech probability for each frame;
      
      calculating a spectral peakiness speech probability for each frame;
      
      calculating a residual energy speech probability for each frame;
      
      computing an activity probability for each frame from the overall energy speech probability, band energy speech probability, spectral peakiness speech probability, and residual energy speech probability;
      
      comparing a moving average of activity probabilities to at least one threshold; and
      
      identifying speech and non-speech segments in the digital audio file based upon the comparison.
  - 11. The non-transitory computer-readable medium of claim 8 having further instructions stored thereon that when executed by the computing system, cause the computing system to perform the additionally steps comprising:
    - filtering the digital audio file to remove non-speech segments;
      
      identifying long speech segments in the homogenous speech segments; and
      
      splitting the long speech segments based upon contextual information from the identified homogeneous speech segments.
  - 12. The non-transitory computer-readable medium of claim 11, wherein the filtering of the audio file comprises energy envelope filtering to remove segments with energy determined to be below a lower energy threshold or above an upper energy threshold.
  - 13. The non-transitory computer-readable medium of claim 8:
    - wherein transcribing the digital audio file comprises applying an agent model to the digital audio file; and
      
      wherein identifying one of the first speaker audio file and the second speaker audio file as the agent audio file, comprises comparing the first speaker audio file and the second speaker audio file to the agent model.
  - 14. The non-transitory computer-readable medium of claim 8, wherein separating the audio file into a first speaker audio file and a second audio file further comprises:
    - clustering identified segments;
      
      creating a first speaker model and a second speaker model from the clustered identified segments; and
      
      identifying unclustered segments by comparing an unclustered segment to the first speaker model and the second speaker model.

15. A system for audio diarization, the system comprising:
- a blind diarization module operating on a computer processor, wherein the blind diarization the blind diarization module is configured to receive audio data, split the audio data into a plurality of frames, segment the audio data into entropy segments based upon an entropy of each frame, and cluster the entropy segments into a first plurality of segments of the audio data as a first speaker audio file and a second plurality of segments of the audio data as a second speaker audio file;
  
  an agent diarization module operating on the computer processor, the agent diarization module receives an agent model, the agent diarization module compares the agent model to the first speaker audio file and the second speaker audio file and identifies one of the first and second speaker audio files as an agent audio file and an other of the first and second speaker audio files as a customer audio file; and
  
  a transcription server that receives the agent audio file and the customer audio file, and transcribes the audio files to produce a diarized transcript.
- View Dependent Claims (16, 17, 18)
- - 16. The system of claim 15:
    - wherein prior to the blind diarization module receiving the audio data, the transcription server transcribes the audio data and creates an information file that identifies homogeneous speech segments from the transcribed audio data; and
      
      wherein the blind diarization module identifies homogeneous speech segments in the text file and segments the digital audio file into homogeneous audio segments that correspond to the identified homogeneous speech segments.
  - 17. The system of claim 16, wherein the blind diarization module filters the audio data to remove non-speech segments, identifies long speech segments in the homogeneous speech segments, and splits the long speech segments based upon contextual information from the identified homogenous speech segments.
  - 18. The system of claim 17, wherein the blind diarization module identifies the one of the first and second speaker audio files as the agent audio file and the other of the first and second speaker audio files as the customer audio file by at least clustering the entropy segments and the homogeneous audio segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Ziv, Omer, Wein, Ron, Shapira, Ido, Achituv, Ran
Primary Examiner(s)
Chawan, Vijay B

Application Number

US15/158,959
Publication Number

US 20160343373A1
Time in Patent Office

614 Days
Field of Search

704246, 704250, 704209, 704231, 704243, 704249, 704235, 704270
US Class Current
CPC Class Codes

G10L 15/26   Speech to text systems G10L...

G10L 17/06   Decision making techniques;...

G10L 2025/783   based on threshold decision

G10L 25/51   for comparison or discrimin...

G10L 25/78   Detection of presence or ab...

Speaker separation in diarization

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker separation in diarization

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links