Human-based accent detection to assist rapid transcription with automatic speech recognition

US 10,726,834 B1
Filed: 10/07/2019
Issued: 07/28/2020
Est. Priority Date: 09/06/2019
Status: Active Grant

First Claim

Patent Images

1. A system configured to utilize human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:

a frontend server configured to transmit, to a backend server, an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; and

the backend server is configured to perform the following;

I) during the first hour of the period;

calculate, for a certain segment of the audio recording, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent;

select, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment;

provide a transcriber with an indication of the one or more candidate accents; and

receive, from the transcriber, after the transcriber listened to the certain segment, an indication indicative of an accent of the person who spoke in the certain segment; and

II) after receiving the indication;

provide the indication to the ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Knowing what accent is spoken can assist automatic speech recondition (ASR) systems to more accurately transcribe audio. In one embodiment, a system includes a frontend server configured to transmit, to a backend server, an audio recording that includes speech of one or more people in a room over a period spanning at least two hours. At sonic time during the first hour of the period, the backend server provides a transcriber with a certain segment of the audio recording, and receives, from the transcriber, after the transcriber listened to a certain segment, an indication indicative of an accent of a person who spoke in the certain segment. The backend server then provides the indication to an ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.

71 Citations

View as Search Results

20 Claims

1. A system configured to utilize human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:
- a frontend server configured to transmit, to a backend server, an audio recording comprising speech of one or more people in a room over a period spanning at least two hours; and
  
  the backend server is configured to perform the following;
  
  I) during the first hour of the period;
  
  calculate, for a certain segment of the audio recording, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent;
  
  select, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment;
  
  provide a transcriber with an indication of the one or more candidate accents; and
  
  receive, from the transcriber, after the transcriber listened to the certain segment, an indication indicative of an accent of the person who spoke in the certain segment; and
  
  II) after receiving the indication;
  
  provide the indication to the ASR system to be utilized to generate a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The system of claim 1, wherein the audio recording comprises two or more channels of audio, and further comprising two or more microphones, at least 40 cm away from each other, which are configured to record the two or more channels, respectively.
  - 3. The system of claim 2, wherein the certain segment comprises a recording from a certain channel of audio, from among the two or more channels, recorded by a certain microphone from among the two or more microphones, which is closer to the person than the other microphones from among the two or more microphones;
    - andwherein the ASR system is configured to utilize the indication to generate transcriptions of one or more segments comprising audio from the certain channel.
  - 4. The system of claim 1, wherein the backend server is further configured to utilize the indication to select a certain phonetic model, from among a plurality of phonetic models corresponding to the plurality of accents, to be utilized by the ASR system to generate the transcription of the additional portion of the audio.
  - 5. The system of claim 1, wherein the backend server is further configured to provide an input based on the indication to a seq2seq network utilized by the ASR system to generate the transcription.
  - 6. The system of claim 1, wherein the backend server is further configured to identify a specific accent spoken in the certain segment, and to provide the certain segment to the transcriber responsive to confidence in identification of the specific accent being below a threshold.
  - 7. The system of claim 1, wherein the backend server is further configured to provide a transcription of the certain segment generated by the ASR system to the transcriber, and receive, from the transcriber, one or more corrections to the transcription of the certain segment;
    - and wherein the one or more corrections comprise a phrase that did not appear in the transcription of the certain segment and the phrase is utilized to expand a language model utilized by the ASR system to generate the transcription of the additional portion of the audio recording.
  - 8. The system of claim 1, wherein the backend server is further configured to perform the following prior to a target completion time that is less than eight hours after the end of the period:
    - receive additional transcriptions, generated by the ASR system utilizing the indication, of additional segments of the audio recording, which were recorded after the first twenty minutes of the period;
      
      provide the additional transcriptions and the additional segments to one or more transcribers;
      
      update the additional transcriptions based on corrections made by the one or more transcribers;
      
      and generate a transcription of the speech of the one or more people during the period based on data comprising the additional transcriptions of the additional segments of the audio.
  - 9. The system of claim 8, wherein the backend server is further configured to select the one or more transcribers from a pool of a plurality of transcribers based on prior performance of at least some of the plurality of transcribers when reviewing transcriptions involving speech with the accent.
  - 10. The system of claim 1, wherein the backend server is further configured to transmit a live transcription, generated by the ASR system utilizing the indication, of at least some of the speech of the one or more people while they speak.
  - 11. The system of claim 1, wherein the backend server is further configured to determine confidence in transcriptions of segments of the audio recording, generated by the ASR system, and select the certain segment based on a confidence in a transcription of the certain segment being below a threshold.
  - 12. The system of claim 1, wherein the backend server is further configured to:
    - (i) calculate, utilizing a certain model and based on a transcription of the certain segment generated by the ASR system, values indicative of suitability of various transcribers to transcribe the certain segment, and (ii) utilize the values to select the transcriber from among the various transcribers; and
      
      wherein a value indicative of a suitability of the transcriber is greater than values indicative of suitability of most of the various transcribers.
  - 13. The system of claim 12, wherein the certain model is generated based on training data comprising:
    - (i) feature values generated from transcriptions by the transcriber of one or more segments of audio that included speech in the accent, and (ii) labels indicative of quality of the transcriptions.

14. A method for utilizing human assistance to apprise an automatic speech recognition (ASR) system about a spoken accent, comprising:
- receiving an audio recording comprising speech of one or more people in a room over a period spanning at least two hours;
  
  segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;
  
  calculating, for a certain segment from among the segments, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent;
  
  selecting, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment;
  
  providing a transcriber with an indication of the one or more candidate accents;
  
  receiving, from the transcriber, after the transcriber listened to a certain segment from among the segments, an indication indicative of an accent of a person who spoke in the certain segment; and
  
  generating, by an ASR system and utilizing the indication, a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.
- View Dependent Claims (15, 16, 17, 19, 20)
- - 15. The method of claim 14, further comprising selecting, based on the indication, a certain phonetic model, from among a plurality of phonetic models corresponding to the plurality of accents, and utilizing the certain phonetic model to generate the transcription of the additional portion of the audio.
  - 16. The method of claim 14, further comprising:
    - generating a certain transcription of the certain segment;
      
      providing the certain transcription to the transcriber;
      
      receiving from the transcriber one or more corrections to the certain transcription, which comprise a phrase that did not appear in the transcription of the certain segment;
      
      expanding a language model with the phrase; and
      
      utilizing the language model to generate the transcription of the additional portion of the audio recording.
  - 17. The method of claim 14, further comprising performing the following prior to a target completion time that is less than eight hours after the end of the period:
    - receiving additional transcriptions, generated by the ASR system utilizing the indication, of additional segments of the audio recording, which were recorded after the first twenty minutes of the period;
      
      providing the additional transcriptions and the additional segments to one or more transcribers;
      
      updating the additional transcriptions based on corrections made by the one or more transcribers; and
      
      generating a transcription of the speech of the one or more people during the period based on data comprising the additional transcriptions of the additional segments of the audio.
  - 19. The method of claim 17, further comprising selecting the one or more transcribers from a pool of a plurality of transcribers based on prior performance of at least some of the plurality of transcribers when reviewing transcriptions involving speech with the accent.
  - 20. The method of claim 14, further comprising:
    - calculating, utilizing a certain model and based on a transcription of the certain segment generated by the ASR system, values indicative of suitability of various transcribers to transcribe the certain segment, andutilizing the values to select the transcriber from among the various transcribers;
      
      wherein a value indicative of a suitability of the transcriber is greater than values indicative of suitability of most of the various transcribers.

18. A non-transitory computer-readable medium having instructions stored thereon that, in response to execution by a system including a processor and memory, causes the system to perform operations comprising:
- receiving an audio recording comprising speech of one or more people in a room over a period spanning at least two hours;
  
  segmenting at least a portion of the audio recording, which was recorded during the first twenty minutes of the period, to segments;
  
  calculating, for a certain segment from among the segments, a plurality of values corresponding to a plurality of accents, respectively, wherein each value corresponding to a certain accent is indicative of a probability that a person who spoke in the certain segment had the certain accent;
  
  selecting, based on the plurality of values, one or more candidate accents for the accent of the person who spoke in the certain segment;
  
  providing a transcriber with an indication of the one or more candidate accents;
  
  receiving, from the transcriber, after the transcriber listened to a certain segment from among the segments, an indication indicative of an accent of a person who spoke in the certain segment; and
  
  generating, by an ASR system and utilizing the indication, a transcription of an additional portion of the audio recording, which was recorded after the first twenty minutes of the period.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verbit Software Ltd.
Original Assignee
Verbit Software Ltd.
Inventors
Shellef, Eric Ariel, Ben Tsvi, Yaakov Kobi, Getz, Iris, Livne, Tom, Himmelreich, Roman, Shtilerman, Elad, Asor, Eli
Primary Examiner(s)
Hang, Vu B

Application Number

US16/594,471
Time in Patent Office

295 Days
Field of Search
US Class Current
CPC Class Codes

G06F 3/0484   for the control of specific...

G06F 40/20   Natural language analysis s...

G06F 40/30   Semantic analysis

G10L 15/01   Assessment or evaluation of...

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/183   using context dependencies,...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/0631   Creating reference template...

G10L 2015/0635   updating or merging of old ...

G10L 2015/0638   Interactive procedures

G10L 2015/223   Execution procedure of a sp...

G10L 25/60 : for measuring the quality o...

H04R 1/406 : microphones

H04R 3/005 : for combining the signals o...

H04R 5/027 : Spatial or constructional a...

View All

Human-based accent detection to assist rapid transcription with automatic speech recognition

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

71 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Human-based accent detection to assist rapid transcription with automatic speech recognition

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

71 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others