Human-curated glossary for rapid hybrid-based transcription of audio

US 10,607,599 B1
Filed: 10/07/2019
Issued: 03/31/2020
Est. Priority Date: 09/06/2019
Status: Active Grant

First Claim

Patent Images

1. A system configured to curate a glossary and utilize the glossary for rapid transcription of audio, comprising:

a frontend server configured to transmit, to a backend server, an audio recording comprising speech of multiple people in a room over a period spanning at least two hours; and

the backend server is configured to perform the following;

during the first hour of the period;

segment at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;

generate, utilizing an automatic speech recognition (ASR) system, a first transcription of a first segment from among the segments;

receive, from a first transcriber, a first phrase that does not appear in the first transcription, but was spoken in the first segment; and

add the first phrase to a glossary;

after the first hour of the period;

generate, utilizing the ASR system, a second transcription of a second segment of the audio recording;

provide the second transcription and the glossary to a second transcriber; and

receive a corrected transcription, in which the second transcriber substituted a second phrase in the second transcription, which was not in the glossary, with the first phrase.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein are curation of a glossary and its utilization for automatic speech recognition (ASR). In one embodiment, a server receives an audio recording of speech, taken over a period spanning at least two hours. During the first hour, the server generates, utilizing an ASR system, a transcription of a segment of the audio, recorded during the first twenty minutes. The server receives, from a transcriber, a phrase that does not appear in the transcription, but was spoken in the segment, and adds the phrase to a glossary. After the first hour of the period, the server generates, utilizing the ASR system, a second transcription of a second segment of the audio, provides the second transcription and the glossary to a second transcriber, and receives a corrected transcription, in which the second transcriber substituted a second phrase in the second transcription, which was not in the glossary, with the phrase.

64 Citations

View as Search Results

20 Claims

1. A system configured to curate a glossary and utilize the glossary for rapid transcription of audio, comprising:
- a frontend server configured to transmit, to a backend server, an audio recording comprising speech of multiple people in a room over a period spanning at least two hours; and
  
  the backend server is configured to perform the following;
  
  during the first hour of the period;
  
  segment at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;
  
  generate, utilizing an automatic speech recognition (ASR) system, a first transcription of a first segment from among the segments;
  
  receive, from a first transcriber, a first phrase that does not appear in the first transcription, but was spoken in the first segment; and
  
  add the first phrase to a glossary;
  
  after the first hour of the period;
  
  generate, utilizing the ASR system, a second transcription of a second segment of the audio recording;
  
  provide the second transcription and the glossary to a second transcriber; and
  
  receive a corrected transcription, in which the second transcriber substituted a second phrase in the second transcription, which was not in the glossary, with the first phrase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The system of claim 1, wherein the backend server is further configured to:
    - (i) generate transcriptions of the segments utilizing the ASR system, (ii) calculate, utilizing a certain model, values indicative of an expected contribution to formation of a glossary of transcription by a transcriber of each of the segments, and (iii) utilize the values to select the first segment; and
      
      wherein the value indicative of the expected contribution to formation of a glossary of transcription of the first segment is greater than the values of the expected contribution of most of the segments.
  - 3. The system of claim 2, wherein the backend server is further configured to identify, based on a transcription of the first segment, a topic of the first segment, and to select the certain model, from among a plurality of models, based on the topic;
    - and wherein each model, from among the plurality of models, corresponds to a topic from among a plurality of topics, and each model is generated based on data comprising;
      
      (i) transcriptions by the ASR system of previous segments of audio comprising speech related to the topic, and (ii) corrections to said transcriptions by transcribers.
  - 4. The system of claim 1, wherein the backend server is further configured to generate feature values based on the first phrase and the first transcription, to utilize a model to calculate, based on the feature values, an importance score for the first phrase, and to add the first phrase to the glossary responsive to the importance score reaching a threshold;
    - and wherein at least one of the feature values is indicative of one or more of the following;
      
      a prevalence of the first phrase in the first transcription, and a ratio between (i) the prevalence of the first phrase in the first transcription, and (ii) a general prevalence of use of the first phrase.
  - 5. The system of claim 4, wherein the model is generated based on data comprising:
    - previous transcriptions of other segments of audio and glossaries formed for use of transcribers who transcribed the other segments of audio.
  - 6. The system of claim 1, wherein the backend server is further configured to:
    - (i) generate feature values based on at least one of;
      
      the first transcription and the first segment, (ii) utilize a specific model to calculate, based on the feature values, suitability-values indicative of suitability of various transcribers to transcribe the first segment, and (iii) utilize the suitability-values to select the first transcriber from among the various transcribers; and
      
      wherein a suitability-value of the first transcriber is greater than suitability-values of most of the various transcribers.
  - 7. The system of claim 1, wherein the backend server is further configured to transcribe the segments utilizing the ASR system and to identify a certain utterance that is uttered in more than one of the segments, whose transcription has low confidence in corresponding transcriptions of the more than one of the segments;
    - the backend server is further configured to select the first segment based on the first segment comprising the certain utterance.
  - 8. The system of claim 1, wherein the backend server is further configured to utilize certain segments of the audio recording, in which the first phrase was uttered and certain transcriptions of the certain segments, which were reviewed by one or more transcribers, to update a phonetic model utilized by the ASR system to reflect a pronunciation of the first phrase.
  - 9. The system of claim 1, wherein the audio recording comprises two or more channels of audio, and further comprising two or more microphones, at least 40 cm away from each other, which are configured to record the two or more channels, respectively.
  - 10. The system of claim 1, wherein the second segment was recorded before the first segment.
  - 11. The system of claim 1, wherein the backend server is further configured to perform the following prior to a target completion time that is less than eight hours after the end of the period:
    - generate, utilizing the ASR system, transcriptions of additional segments of the audio recording;
      
      provide the additional segments and the glossary to multiple transcribers;
      
      receive corrected transcriptions, generated by the multiple transcribers after they listened to the additional segments and consequently made changes to the additional transcriptions, wherein at least some of the changes involve substituting a phrase in a transcription with the first phrase; and
      
      generate a transcription of the speech of the multiple people during the period based on data comprising the corrected transcriptions.

12. A method for curating and utilizing a glossary for rapid transcription of audio, comprising:
- receiving an audio recording comprising speech of multiple people in a room over a period spanning at least two hours;
  
  segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;
  
  generating, utilizing an automatic speech recognition (ASR) system, a first transcription of a first segment from among the segments;
  
  receiving, from a first transcriber, a first phrase that does not appear in the first transcription, but was spoken in the first segment;
  
  adding the first phrase to a glossary;
  
  generating, utilizing the ASR system, a second transcription of a second segment of the audio recording;
  
  providing the second transcription and the glossary to a second transcriber; and
  
  receiving a corrected transcription, in which the second transcriber substituted a second phrase in the second transcription, which was not in the glossary, with the first phrase.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The method of claim 12, further comprising:
    - generating transcriptions of the segments utilizing the ASR system;
      
      calculating, utilizing a certain model, values indicative of an expected contribution to formation of a glossary by transcription by a transcriber of each of the segments, and utilizing the values to select the first segment; and
      
      wherein the value indicative of the expected contribution to formation of a glossary of transcription of the first segment is greater than the values of the expected contribution of most of the segments.
  - 14. The method of claim 13, further comprising identifying, based on a transcription of the first segment, a topic of the first segment, and selecting the certain model, from among a plurality of models, based on the topic;
    - wherein each model, from among the plurality of models, corresponds to a topic from among a plurality of topics, and each model is generated based on data comprising;
      
      (i) transcriptions by the ASR system of previous segments of audio comprising speech related to the topic, and (ii) corrections to said transcriptions by transcribers.
  - 15. The method of claim 12, further comprising generating feature values based on the first phrase and the first transcription, utilizing a model to calculate, based on the feature values, an importance score for the first phrase, and adding the first phrase to the glossary responsive to the importance score reaching a threshold;
    - wherein at least one of the feature values is indicative of one or more of the following;
      
      a prevalence of the first phrase in the transcription of the first segment, and a ratio between (i) the prevalence of the first phrase in the first transcription, and (ii) a general prevalence of use of the first phrase; and
      
      wherein the model is generated based on data comprising;
      
      previous transcriptions of other segments of audio and glossaries formed for use of transcribers who transcribed the other segments of audio.
  - 16. The method of claim 12, further comprising:
    - (i) generating feature values based on at least one of;
      
      the first transcription and the first segment, (ii) utilizing a specific model to calculate, based on the feature values, suitability-values indicative of suitability of various transcribers to transcribe the first segment, and (iii) utilizing the suitability-values to select the first transcriber from among the various transcribers, wherein a suitability-value of the first transcriber is greater than suitability-values of most of the various transcribers.
  - 17. The method of claim 12, further comprising:
    - transcribing the segments utilizing the ASR system;
      
      identifying a certain utterance that is uttered in more than one of the segments, whose transcription has low confidence in corresponding transcriptions of the more than one of the segments; and
      
      selecting the first segment based on the first segment comprising the certain utterance.
  - 18. The method of claim 12, further comprising:
    - utilizing certain segments of the audio recording, in which the first phrase was uttered and certain transcriptions of the certain segments, which were reviewed by one or more transcribers, to update a phonetic model utilized by the ASR system to reflect a pronunciation of the first phrase.
  - 19. The method of claim 12, further comprising performing the following prior to a target completion time that is less than eight hours after the end of the period:
    - generating, utilizing the ASR system, transcriptions of additional segments of the audio recording;
      
      providing the additional segments and the glossary to multiple transcribers;
      
      receiving corrected transcriptions, generated by the multiple transcribers after they listened to the additional segments and consequently made changes to the additional transcriptions, wherein at least some of the changes involve substituting a phrase in a transcription with the first phrase; and
      
      generating a transcription of the speech of the multiple people during the period based on data comprising the corrected transcriptions.

20. A non-transitory computer-readable medium having instructions stored thereon that, in response to execution by a system including a processor and memory, causes the system to perform operations comprising:
- receiving an audio recording comprising speech of multiple people in a room over a period spanning at least two hours;
  
  segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;
  
  generating, utilizing an automatic speech recognition (ASR) system, a first transcription of a first segment from among the segments;
  
  receiving, from a first transcriber, a first phrase that does not appear in the first transcription, but was spoken in the first segment;
  
  adding the first phrase to a glossary;
  
  generating, utilizing the ASR system, a second transcription of a second segment of the audio recording;
  
  providing the second transcription and the glossary to a second transcriber; and
  
  receiving a corrected transcription, in which the second transcriber substituted a second phrase in the second transcription, which was not in the glossary, with the first phrase.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verbit Software Ltd.
Original Assignee
Verbit Software Ltd.
Inventors
Shellef, Eric Ariel, Ben Tsvi, Yaakov Kobi, Getz, Iris, Livne, Tom, Himmelreich, Roman, Shtilerman, Elad
Primary Examiner(s)
McFadden, Susan I

Application Number

US16/594,809
Time in Patent Office

176 Days
Field of Search

704257
US Class Current
CPC Class Codes

G06F 3/0484   for the control of specific...

G06F 40/20   Natural language analysis s...

G06F 40/30   Semantic analysis

G10L 15/01   Assessment or evaluation of...

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/183   using context dependencies,...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/0631   Creating reference template...

G10L 2015/0635   updating or merging of old ...

G10L 2015/0638   Interactive procedures

G10L 2015/223   Execution procedure of a sp...

G10L 25/60 : for measuring the quality o...

H04R 1/406 : microphones

H04R 3/005 : for combining the signals o...

H04R 5/027 : Spatial or constructional a...

View All

Human-curated glossary for rapid hybrid-based transcription of audio

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

64 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Human-curated glossary for rapid hybrid-based transcription of audio

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

64 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others