Human-based accent detection to assist rapid transcription with automatic speech recognition
First Claim
1. A system configured to utilize human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:
- a frontend server configured to transmit, to a backend server, an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; and
the backend server is configured to perform the following;
I) during the first hour of the period;
calculate, for a certain segment of the audio recording, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent;
select, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment;
provide a transcriber with an indication of the one or more candidate accents; and
receive, from the transcriber, after the transcriber listened to the certain segment, an indication indicative of an accent of the person who spoke in the certain segment; and
II) after receiving the indication;
provide the indication to the ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.
3 Assignments
0 Petitions
Accused Products
Abstract
Knowing what accent is spoken can assist automatic speech recondition (ASR) systems to more accurately transcribe audio. In one embodiment, a system includes a frontend server configured to transmit, to a backend server, an audio recording that includes speech of one or more people in a room over a period spanning at least two hours. At sonic time during the first hour of the period, the backend server provides a transcriber with a certain segment of the audio recording, and receives, from the transcriber, after the transcriber listened to a certain segment, an indication indicative of an accent of a person who spoke in the certain segment. The backend server then provides the indication to an ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.
71 Citations
20 Claims
-
1. A system configured to utilize human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:
-
a frontend server configured to transmit, to a backend server, an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; and the backend server is configured to perform the following; I) during the first hour of the period; calculate, for a certain segment of the audio recording, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent; select, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment; provide a transcriber with an indication of the one or more candidate accents; and
receive, from the transcriber, after the transcriber listened to the certain segment, an indication indicative of an accent of the person who spoke in the certain segment; andII) after receiving the indication; provide the indication to the ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for utilizing human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:
-
receiving an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments; calculating, for a certain segment from among the segments, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent; selecting, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment; providing a transcriber with an indication of the one or more candidate accents; receiving, from the transcriber, after the transcriber listened to a certain segment from among the segments, an indication indicative of an accent of a person who spoke in the certain segment; and generating, by an ASR system and utilizing the indication, a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period. - View Dependent Claims (15, 16, 17, 19, 20)
-
-
18. A non-transitory computer-readable medium having instructions stored thereon that, in response to execution by a system including a processor and memory, causes the system to perform operations comprising:
-
receiving an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments; calculating, for a certain segment from among the segments, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent; selecting, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment; providing a transcriber with an indication of the one or more candidate accents; receiving, from the transcriber, after the transcriber listened to a certain segment from among the segments, an indication indicative of an accent of a person who spoke in the certain segment; and generating, by an ASR system and utilizing the indication, a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.
-
Specification