Real time machine learning-based indication of whether audio quality is suitable for transcription
First Claim
1. A system configured to detect low-quality audio use in hybrid transcription, comprising:
- a frontend server configured to transmit an audio recording comprising speech of one or more people in a room; and
a backend server configured to generate feature values based on a segment of the audio recording, and to utilize a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment;
wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.
3 Assignments
0 Petitions
Accused Products
Abstract
Maintaining adequate audio quality is very important for creating fast and accurate transcriptions, especially in a hybrid transcription setting, in which human transcribers review transcriptions generated by automatic speech recognition (ASR) systems. Some embodiments described herein involve detecting low-quality audio intended for transcription. In one embodiment, a server receives an audio recording that includes speech. The server generates feature values based on a segment of the audio recording and utilizes a model to calculate, based on the feature values, a certain value indicative of expected hybrid transcription quality of the segment. The model is generated based on training data that includes feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers. Optionally, an alert is provided responsive to the certain value being below a threshold.
91 Citations
20 Claims
-
1. A system configured to detect low-quality audio use in hybrid transcription, comprising:
-
a frontend server configured to transmit an audio recording comprising speech of one or more people in a room; and a backend server configured to generate feature values based on a segment of the audio recording, and to utilize a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment; wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for detecting low-quality audio used for hybrid transcription, comprising:
-
receiving an audio recording comprising speech of one or more people in a room; generating feature values based on a segment of the audio recording; and utilizing a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment; wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium having instructions stored thereon that, in response to execution by a system including a processor and memory, causes the system to perform operations comprising:
-
receiving an audio recording comprising speech of one or more people in a room; generating feature values based on a segment of the audio recording; and utilizing a model to calculate, based on the feature values, a value indicative of expected hybrid transcription quality of the segment; wherein the model is generated based on training data comprising feature values generated based on previously recorded segments of audio, and values of transcription-quality metrics generated based on transcriptions of the previously recorded segments, which were generated at least in part by human transcribers.
-
Specification