SYSTEMS AND METHODS FOR A TWO PASS DIARIZATION, AUTOMATIC SPEECH RECOGNITION, AND TRANSCRIPT GENERATION

US 20200135204A1
Filed: 10/31/2018
Published: 04/30/2020
Est. Priority Date: 10/31/2018
Status: Active Grant

First Claim

Patent Images

1. A method for transcript generation including ASR and diarization, the method comprising:

receiving an audio file at a platform module;

dividing the audio file into a plurality of chunks;

sending each instance of the plurality of chunks to a speech service module;

converting speech to text for each instance of the plurality of chunks;

returning the text for each instance of the plurality of chunks to the platform module;

merging the text for each instance of the plurality of chunks at the platform module to yield an audio file transcript;

sending the audio file and the plurality of chunks to a diarization module;

performing first pass diarization on the plurality of chunks to yield a plurality of diarized chunks;

performing second pass diarization on the plurality of diarized chunks and the audio file to yield a diarized audio file;

merging the audio file transcript and the diarized audio file to yield a final transcript.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one embodiment, a method for transcript generation includes receiving an audio file and dividing it into a plurality of chunks. The method further includes sending each instance of the plurality of chunks to a speech service module. The method further includes converting speech to text for each instance of the plurality of chunks and returning the text for each instance of the plurality of chunks. The method further includes merging the text for each instance of the plurality of chunks to yield an audio file transcript and sending the audio file and chunks to a diarization module. The method further includes performing first pass diarization on the chunks to yield a plurality of diarized chunks and performing second pass diarization on the plurality of diarized chunks and the audio file to yield a diarized audio file. The method further includes merging the files to yield a final transcript.

4 Citations

View as Search Results

21 Claims

1. A method for transcript generation including ASR and diarization, the method comprising:
- receiving an audio file at a platform module;
  
  dividing the audio file into a plurality of chunks;
  
  sending each instance of the plurality of chunks to a speech service module;
  
  converting speech to text for each instance of the plurality of chunks;
  
  returning the text for each instance of the plurality of chunks to the platform module;
  
  merging the text for each instance of the plurality of chunks at the platform module to yield an audio file transcript;
  
  sending the audio file and the plurality of chunks to a diarization module;
  
  performing first pass diarization on the plurality of chunks to yield a plurality of diarized chunks;
  
  performing second pass diarization on the plurality of diarized chunks and the audio file to yield a diarized audio file;
  
  merging the audio file transcript and the diarized audio file to yield a final transcript.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising:
    - transcoding the audio file to a known codec.
  - 3. The method of claim 1, further comprising:
    - sending the audio file transcript to a post process module;
      
      applying punctuation and casing to the audio file transcript.
  - 4. The method of claim 1, wherein the plurality of diarized chunks includes a plurality of segments, each with speaker identification information.
  - 5. The method of claim 4, wherein the speaker identification information is an I-vector.
  - 6. The method of claim 4, wherein in each of the plurality of diarized chunks, segments of the plurality of segments which include statistically similar speaker identification information are clustered as belonging to a corresponding speaker of a plurality of speakers.
  - 7. The method of claim 6, wherein the second pass diarization includes giving each of the plurality of speakers for each of the plurality of diarized chunks a unique identifier.
  - 8. The method of claim 7, wherein the second pass diarization includes, for associated segments of the plurality of segments for each unique identifier, averaging the speaker identification information of the associated segments to yield averaged speaker identification information.
  - 9. The method of claim 8, wherein the second pass diarization includes, assigning identified segments of the plurality of segments from all of the plurality of chunks a final speaker based on correlation between the averaged speaker identification information for the associated segments of the plurality of segments for each unique identifier.
  - 10. The method of claim 1, further comprising:
    - outputting the final transcript in a fixed and tangible format.
  - 14. The system of claim 1, wherein the plurality of diarized chunks includes a plurality of segments, each with speaker identification information.
  - 15. The system of claim 14, wherein the speaker identification information is an I-vector.
  - 16. The system of claim 14, wherein in each of the plurality of diarized chunks, segments of the plurality of segments which include statistically similar speaker identification information are clustered as belonging to a corresponding speaker of a plurality of speakers.
  - 17. The system of claim 16, wherein the second pass diarization includes giving each of the plurality of speakers for each of the plurality of diarized chunks a unique identifier.
  - 18. The system of claim 17, wherein the second pass diarization includes, for associated segments of the plurality of segments for each unique identifier, averaging the speaker identification information of the associated segments to yield averaged speaker identification information.
  - 19. The system of claim 18, wherein the second pass diarization includes, assigning identified segments of the plurality of segments from all of the plurality of chunks a final speaker based on correlation between the averaged speaker identification information for the associated segments of the plurality of segments for each unique identifier.

11. A system for transcript generation including ASR and diarization, the system comprising:
- a platform module;
  
  a speech service module in communication with the platform module;
  
  a diarization module in communication with the platform module, wherein the platform module, the speech service module, and speech service module configured toreceive an audio file at the platform module;
  
  divide the audio file into a plurality of chunks;
  
  send each instance of the plurality of chunks to the speech service module;
  
  convert speech to text for each instance of the plurality of chunks;
  
  return the text for each instance of the plurality of chunks to the platform module;
  
  merge the text for each instance of the plurality of chunks at the platform module to yield an audio file transcript;
  
  send the audio file and the plurality of chunks to the diarization module;
  
  perform first pass diarization on the plurality of chunks to yield a plurality of diarized chunks;
  
  perform second pass diarization on the plurality of diarized chunks and the audio file to yield a diarized audio file;
  
  merge the audio file transcript and the diarized audio file to yield a final transcript.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, wherein the speech service module, and speech service module configured to transcode the audio file to a known codec.
  - 13. The method of claim 11, further comprising a post process module, wherein the post process module, the speech service module, and speech service module configured to send the audio file transcript to a post process module;
    - and apply punctuation and casing to the audio file transcript.

20. A method of performing diarization on a sound recording, the method comprising:
- receiving a sound recording;
  
  breaking the sound recording into a plurality of chunks;
  
  performing a first diarization on the plurality of chunks, wherein the performing includes breaking each of the plurality of chunks into a plurality of segments, for each of the plurality of segments generating statistical speaker information descriptive of the sound characteristics in that segment, and clustering, within each chunk of the plurality of chunks, segments having similar statistical speaker information to generate within each chunk of the plurality of chunks groups of segments grouped according to the similar statistical speaker information;
  
  performing a second diarization by clustering between the plurality of chunks, the groups of segments according to grouped similar statistical speaker information, the grouped similar statistical speaker information being characteristics of speech of each group for the groups of segments.

21. A fixed tangible medium, which when executed by a computing system, executes steps comprising:
- receiving an audio file at a platform module;
  
  dividing the audio file into a plurality of chunks;
  
  sending each instance of the plurality of chunks to a speech service module;
  
  converting speech to text for each instance of the plurality of chunks;
  
  returning the text for each instance of the plurality of chunks to the platform module;
  
  merging the text for each instance of the plurality of chunks at the platform module to yield an audio file transcript;
  
  sending the audio file and the plurality of chunks to a diarization module;
  
  performing first pass diarization on the plurality of chunks to yield a plurality of diarized chunks;
  
  performing second pass diarization on the plurality of diarized chunks and the audio file to yield a diarized audio file;
  
  merging the audio file transcript and the diarized audio file to yield a final transcript.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rev.Com, Inc.
Original Assignee
Rev.Com, Inc.
Inventors
Robichaud, Jean-Philippe, Skurikhin, Alexei, Jette, Miguel, Stanislavovich, Petrov Evgeny

Granted Patent

US 10,825,458 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 17/00   Speaker identification or v...

G10L 17/06   Decision making techniques;...

G10L 19/038   Vector quantisation, e.g. T...

G10L 19/173   Transcoding, i.e. convertin...

G10L 25/51   for comparison or discrimin...

G10L 25/78   Detection of presence or ab...

SYSTEMS AND METHODS FOR A TWO PASS DIARIZATION, AUTOMATIC SPEECH RECOGNITION, AND TRANSCRIPT GENERATION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

4 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR A TWO PASS DIARIZATION, AUTOMATIC SPEECH RECOGNITION, AND TRANSCRIPT GENERATION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

4 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links