Blind diarization of recorded calls with arbitrary number of speakers

US 10,109,280 B2
Filed: 12/12/2017
Issued: 10/23/2018
Est. Priority Date: 07/17/2013
Status: Active Grant

First Claim

Patent Images

1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising:

obtaining the audio data and an unlabeled transcription of the audio data;

separating the audio data into a sequence of utterances, wherein each utterance has acoustic features;

clustering utterances having similar acoustic features;

generating a hidden Markov model (HMM) from the clustered utterances;

decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers;

determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and

labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.

Citations

19 Claims

1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising:
- obtaining the audio data and an unlabeled transcription of the audio data;
  
  separating the audio data into a sequence of utterances, wherein each utterance has acoustic features;
  
  clustering utterances having similar acoustic features;
  
  generating a hidden Markov model (HMM) from the clustered utterances;
  
  decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers;
  
  determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and
  
  labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, further comprising:
    - labeling portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker.
  - 3. The method according to claim 1, wherein the separating the audio data into a sequence of utterances comprises:
    - separating the audio data into frames; and
      
      detecting voice activity on a frame by frame basis; and
      
      determining utterances as consecutive frames of voice activity separated by frames of no voice activity.
  - 4. The method according to claim 3, wherein the detecting voice activity of a frame comprises:
    - comparing a characteristic of the audio in each frame to a range or a threshold;
      
      wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy.
  - 5. The method according to claim 3, wherein the determining utterances further comprises:
    - obtaining information from the transcription of the audio data; and
      
      verifying that the determined utterances correspond to the obtained information.
  - 6. The method according to claim 5, wherein the information comprises phonemes, words, or sentences spoken by a single speaker.
  - 7. The method according to claim 5, wherein the information comprises metadata associated with the transcription.
  - 8. The method according to claim 1, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC).
  - 9. The method according to claim 1, further comprising after the determining the identity of one or more of the multiple speakers:
    - refining the acoustic voiceprint models of known speakers using the utterances of the identified speakers.

10. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to perform a method comprising:
- obtaining the audio data and an unlabeled transcription of the audio data;
  
  separating the audio data into a sequence of utterances, wherein each utterance has acoustic features;
  
  clustering utterances having similar acoustic features;
  
  generating a hidden Markov model (HMM) from the clustered utterances;
  
  decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers;
  
  determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and
  
  labeling portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.

11. A system for obtaining a speaker-identified transcription from audio data of multiple speakers, the system comprising:
- a database storing acoustic voiceprint models of known speakers and the audio data of multiple speakers;
  
  a speech-to-text server that generates an unlabeled transcription of the audio data; and
  
  a computing device communicatively coupled to the database and the speech-to-text server, the computing device comprising a processor, wherein the processor is configured by software to;
  
  obtain the audio data and the unlabeled transcription of the audio data;
  
  separate the audio data into a sequence of utterances, wherein each utterance has acoustic features;
  
  cluster utterances having similar acoustic features;
  
  generate a hidden Markov model (HMM) from the clustered utterances;
  
  decode the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers;
  
  determine the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to the acoustic voiceprint models of known speakers; and
  
  label portions of the transcription corresponding to utterances of identified speakers with the speaker'"'"'s identity to obtain the speaker-identified transcription.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system according to claim 11, wherein the processor is further configured to:
    - label portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker.
  - 13. The system according to claim 11, wherein to separate the audio data into a sequence of utterances, the processor is configured to:
    - separate the audio data into frames; and
      
      detect voice activity on a frame by frame basis; and
      
      determine utterances as consecutive frames of voice activity separated by frames of no voice activity.
  - 14. The system according to claim 13, wherein to detect voice activity, the processor is configured to:
    - compare a characteristic of the audio in each frame to a range or a threshold;
      
      wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy.
  - 15. The system according to claim 13, wherein to determine utterances, the processor is further configure to:
    - obtain information from the transcription of the audio data; and
      
      verify that the determined utterances correspond to the obtained information.
  - 16. The system according to claim 15, wherein the information comprises phonemes, words, or sentences spoken by a single speaker.
  - 17. The system according to claim 15, wherein the information comprises metadata associated with the transcription.
  - 18. The system according to claim 11, wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC).
  - 19. The system according to claim 11, wherein after the determining the identity of one or more of the multiple speakers, the processor is configured to:
    - refine the acoustic voiceprint models of known speakers using the utterances of the identified speakers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Sidi, Oana, Wein, Ron
Primary Examiner(s)
Pham, Thierry L

Application Number

US15/839,190
Publication Number

US 20180158464A1
Time in Patent Office

315 Days
Field of Search

704245, 704246, 704250
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/26   Speech to text systems G10L...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/16   Hidden Markov models [HMM]

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/78   Detection of presence or ab...

H04M 2201/41   using speaker recognition

H04M 2203/303   Marking

H04M 3/5175   Call or contact centers sup...

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links