Real time machine learning-based indication of whether audio quality is suitable for transcription

US 10,665,231 B1
Filed: 10/07/2019
Issued: 05/26/2020
Est. Priority Date: 09/06/2019
Status: Active Grant

First Claim

Patent Images

1. A system configured to detect low-quality audio use in hybrid transcription, comprising:

a frontend server configured to transmit an audio recording comprising speech of one or more people in a room; and

a backend server configured to generate feature values based on a segment of the audio recording, and to utilize a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment;

wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Maintaining adequate audio quality is very important for creating fast and accurate transcriptions, especially in a hybrid transcription setting, in which human transcribers review transcriptions generated by automatic speech recognition (ASR) systems. Some embodiments described herein involve detecting low-quality audio intended for transcription. In one embodiment, a server receives an audio recording that includes speech. The server generates feature values based on a segment of the audio recording and utilizes a model to calculate, based on the feature values, a certain value indicative of expected hybrid transcription quality of the segment. The model is generated based on training data that includes feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers. Optionally, an alert is provided responsive to the certain value being below a threshold.

91 Citations

View as Search Results

20 Claims

1. A system configured to detect low-quality audio use in hybrid transcription, comprising:
- a frontend server configured to transmit an audio recording comprising speech of one or more people in a room; and
  
  a backend server configured to generate feature values based on a segment of the audio recording, and to utilize a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment;
  
  wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, further comprising a user interface, located in the room, and configured to present an alert about low quality audio responsive to the value indicative of the expected hybrid transcription quality of the segment being below a threshold.
  - 3. The system of claim 2, wherein the backend server is further configured to generate the alert responsive to determining that the value indicative of the expected hybrid transcription quality of the segment has fallen below the threshold, while an earlier value indicative of an expected hybrid transcription quality of a previous segment of audio was above the threshold;
    - and wherein the previous segment of audio comprises speech of the one or more people in the room and was recorded at least one minute prior to the time the segment was recorded.
  - 4. The system of claim 1, wherein one or more of the feature values are indicative of a signal-to-noise ratio of the audio in the segment.
  - 5. The system of claim 1, wherein at least one of the feature values is generated by utilizing natural language understanding (NLU) to calculate a value indicative of intelligibility of a transcription of the segment generated utilizing an automatic speech recognition (ASR) system.
  - 6. The system of claim 1, wherein one or more of the feature values is generated based on a lattice constructed by an automatic speech recognition (ASR) system that processed the segment.
  - 7. The system of claim 1, wherein the backend server is further configured to utilize a second model to suggest, based on the segment, a technical intervention to be performed by a person in the room;
    - and wherein the second model is generated based on prior segments of audio and indications of technical interventions that led to an improvement in audio quality of channels of audio, from which the prior segments of audio were taken.
  - 8. The system of claim 7, wherein the technical intervention comprises one or more of the following:
    - suggesting to shut a door in the room, suggesting to shut a window in the room, and suggesting a person in the room move closer to a microphone.
  - 9. The system of claim 1, wherein the backend server is further configured to utilize a second model to identify, based on the segment, a problematic speech characteristic by a person in the room and provide an indication thereof;
    - wherein the problematic speech characteristic includes one or more of the following;
      
      excessive interruptions of another speaker, overlapping speech, speaking fast, speaking in a low volume, and insufficient annunciation.
  - 10. The system of claim 9, wherein the second model is generated based on prior segments of audio and indications of types of problematic speech identified in the prior segments of audio.

11. A method for detecting low-quality audio used for hybrid transcription, comprising:
- receiving an audio recording comprising speech of one or more people in a room;
  
  generating feature values based on a segment of the audio recording; and
  
  utilizing a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment;
  
  wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, further comprising alerting about low quality audio responsive to the value indicative of the expected hybrid transcription quality of the segment being below a threshold.
  - 13. The method of claim 12, further comprising alerting responsive to determining that the value indicative of the expected hybrid transcription quality of the segment has fallen below the threshold, while an earlier value indicative of an expected hybrid transcription quality of a previous segment of audio was above the threshold;
    - wherein the previous segment of audio comprises speech of the one or more people in the room and was recorded at least one minute prior to the time the segment was recorded.
  - 14. The method of claim 11, further comprising generating one or more of the feature values based on a signal-to-noise ratio of the audio in the segment.
  - 15. The method of claim 11, further comprising generating at least one of the feature values utilizing natural language understanding (NLU), which is used to calculate a value indicative of intelligibility of a transcription of the segment generated utilizing an automatic speech recognition (ASR) system.
  - 16. The method of claim 11, further comprising generating one or more of the feature values based on a lattice constructed by an ASR system that processed the segment.
  - 17. The method of claim 11, further comprising utilizing a second model to suggest, based on the segment, a technical intervention to be performed by a person in the room;
    - wherein the second model is generated based on prior segments of audio and indications of technical interventions that led to an improvement in audio quality of channels of audio, from which the prior segments of audio were taken.
  - 18. The method of claim 17, wherein suggesting the technical intervention comprises one or more of the following:
    - suggesting to shut a door in the room, suggesting to shut a window in the room, and suggesting a person in the room move closer to a microphone.
  - 19. The method of claim 11, further comprising utilizing a second model to identify, based on the segment, a problematic speech characteristic by a person in the room and providing an indication thereof;
    - wherein the problematic speech characteristic includes one or more of the following;
      
      excessive interruptions of another speaker, overlapping speech, speaking fast, speaking in a low volume, and insufficient annunciation.

20. A non-transitory computer-readable medium having instructions stored thereon that, in response to execution by a system including a processor and memory, causes the system to perform operations comprising:
- receiving an audio recording comprising speech of one or more people in a room;
  
  generating feature values based on a segment of the audio recording; and
  
  utilizing a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment;
  
  wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verbit Software Ltd.
Original Assignee
Verbit Software Ltd.
Inventors
Shellef, Eric Ariel, Ben Tsvi, Yaakov Kobi, Getz, Iris, Livne, Tom, Himmelreich, Roman, Rosensweig, Elisha Yehuda
Primary Examiner(s)
Yen, Eric

Application Number

US16/595,129
Time in Patent Office

232 Days
Field of Search
US Class Current
CPC Class Codes

G06F 3/0484   for the control of specific...

G06F 40/20   Natural language analysis s...

G06F 40/30   Semantic analysis

G10L 15/01   Assessment or evaluation of...

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/183   using context dependencies,...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/0631   Creating reference template...

G10L 2015/0635   updating or merging of old ...

G10L 2015/0638   Interactive procedures

G10L 2015/223   Execution procedure of a sp...

G10L 25/60 : for measuring the quality o...

H04R 1/406 : microphones

H04R 3/005 : for combining the signals o...

H04R 5/027 : Spatial or constructional a...

View All

Real time machine learning-based indication of whether audio quality is suitable for transcription

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

91 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Real time machine learning-based indication of whether audio quality is suitable for transcription

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

91 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links