Blind diarization of recorded calls with arbitrary number of speakers

US 9,881,617 B2
Filed: 09/01/2016
Issued: 01/30/2018
Est. Priority Date: 07/17/2013
Status: Active Grant

First Claim

Patent Images

1. A method for automatically transcribing a customer service telephone conversation between an arbitrary number of speakers, the method comprising:

receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data;

separating the audio data into frames;

analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames;

performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises;

representing each utterance as a utterance model based on acoustic features of each utterance,clustering the utterance models,creating speaker models from each of the clusters,constructing a hidden Markov model from the speaker models, anddecoding the hidden Markov model to differentiate speakers of each utterance;

tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker;

performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises;

comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, andbased on the comparison, identifying one or more of the speakers; and

transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.

168 Citations

16 Claims

1. A method for automatically transcribing a customer service telephone conversation between an arbitrary number of speakers, the method comprising:
- receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data;
  
  separating the audio data into frames;
  
  analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames;
  
  performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises;
  
  representing each utterance as a utterance model based on acoustic features of each utterance,clustering the utterance models,creating speaker models from each of the clusters,constructing a hidden Markov model from the speaker models, anddecoding the hidden Markov model to differentiate speakers of each utterance;
  
  tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker;
  
  performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises;
  
  comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, andbased on the comparison, identifying one or more of the speakers; and
  
  transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method according to claim 1, wherein the analyzing the frames to identify utterances comprises using voice activity detection to identify segments of speech separated by segments of non-speech on a frame-by-frame basis.
  - 3. The method according to claim 2, wherein a frame is identified as speech or non-speech based on one or more of a frame'"'"'s mean energy, band energy, peakiness, or residual energy.
  - 4. The method according to claim 1, wherein the received data comprises an initial transcription of the telephone conversation without any separation or identification of speakers.
  - 5. The method according to claim 1, wherein the received data comprises metadata that identifies a customer service agent in the telephone conversation.
  - 6. The method according to claim 5, wherein the comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database:
    - retrieving an acoustic voice print model for the customer service agent from the database; and
      
      comparing each homogeneous speaker segment in the telephone conversation to the retrieved acoustic voice print model to determine the likelihood that the homogeneous speaker segment was spoken by the customer service agent.
  - 7. The method according to claim 5, wherein the comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database comprises:
    - retrieving a linguistic model for the customer service agent from the database;
      
      comparing each homogeneous speaker segment in the telephone conversation to the retrieved linguistic model to determine the likelihood that the homogeneous speaker segment was spoken by the customer service agent.
  - 8. The method according to claim 1, wherein the acoustic features are vectors comprised of Mel-frequency cepstral coefficients.
  - 9. The method according to claim 1, wherein the utterance models are Gaussian mixture models.
  - 10. The method according to claim 1, wherein the operation of receiving data corresponding to the telephone conversation comprises:
    - streaming audio data from a telephone conversation in real time.
  - 11. The method according to claim 1, wherein the operation of receiving data corresponding to the telephone conversation comprises:
    - receiving audio data from a telephone conversation from a stored file.

12. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to a method comprising:
- receiving data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more of the speakers in the audio data;
  
  separating the audio data into frames;
  
  analyzing the frames to identify utterances, wherein each utterance comprises a plurality of frames;
  
  performing blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises;
  
  representing each utterance as a utterance model based on acoustic features of each utterance,clustering the utterance models,creating speaker models from each of the clusters,constructing a hidden Markov model from the speaker models, anddecoding the hidden Markov model to differentiate speakers of each utterance;
  
  tagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker;
  
  performing speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises;
  
  comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, andbased on the comparison, identifying one or more of the speakers; and
  
  transcribing the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.

13. A system for automatically transcribing customer service telephone conversations between customer service agents and customers, the system comprising:
- a plurality of call center telephones configured to facilitate telephone conversations between customer service agents and customers;
  
  a database for storing voice print models of customer service agents and customers;
  
  a computing device communicatively coupled to the plurality of call center telephones and the database, the computing device comprising a processor, wherein the processor is configured by software to;
  
  receive data corresponding to the telephone conversation, wherein the received data comprises audio data and metadata that identifies one or more speakers in the audio data;
  
  separate the audio data into frames;
  
  analyze the frames to identify utterances, wherein each utterance comprises a plurality of frames;
  
  perform blind diarization of the audio data to differentiate speakers, wherein the blind diarization comprises;
  
  representing each utterance as a utterance model based on acoustic features of each utterance,clustering the utterance models,creating speaker models from each of the clusters,constructing a hidden Markov model from the speaker models,decoding the hidden Markov model to differentiate speakers of each utterance, andtagging homogeneous speaker segments in the telephone conversation with a tag unique for each speaker;
  
  perform speaker diarization to replace one or more of the tags with a speaker'"'"'s identity, wherein the speaker diarization comprises;
  
  comparing the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database wherein the one or more models retrieved correspond to the one or more speakers identified in the metadata, andidentify, based on the comparison one or more of the speakers; and
  
  transcribe the conversation to obtain a text representation of the conversation, wherein each spoken part of the conversation is labeled with either the speaker'"'"'s identity or the tag associated with the speaker.
- View Dependent Claims (14, 15, 16)
- - 14. The method according to claim 13, wherein the received data comprises metadata that identifies one of the speakers in the telephone conversation as a particular customer service agent.
  - 15. The method according to claim 14, wherein the processor is further configured to compare the homogeneous speaker segments in the telephone conversation to one or more models retrieved from a database by:
    - retrieving an acoustic voice print model for the particular customer service agent from the database; and
      
      comparing each homogeneous speaker segment in the telephone conversation to the retrieved acoustic voice print model to determine the likelihood that the homogeneous speaker segment was spoken by the particular customer service agent.
  - 16. The method according to claim 13, wherein the processor is further configured to:
    - transmit the transcribed conversation to another computer communicatively coupled to the computer and/or to a user interface couple to the computer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verint Systems Incorporated
Original Assignee
Verint Systems Limited (Verint Systems Incorporated)
Inventors
Sidi, Oana, Wein, Ron
Primary Examiner(s)
Pham, Thierry L

Application Number

US15/254,326
Publication Number

US 20170053653A1
Time in Patent Office

516 Days
Field of Search

704245, 704246, 704250
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/26   Speech to text systems G10L...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/16   Hidden Markov models [HMM]

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/78   Detection of presence or ab...

H04M 2201/41   using speaker recognition s...

H04M 2203/303   Marking

H04M 3/5175   Call or contact centers sup...

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

168 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Blind diarization of recorded calls with arbitrary number of speakers

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

168 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links