Transcription generation from multiple speech recognition systems

US 10,573,312 B1
Filed: 12/04/2018
Issued: 02/25/2020
Est. Priority Date: 12/04/2018
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining first audio data originating at a first device during a communication session between the first device and a second device, the communication session configured for verbal communication such that the first audio data includes speech;

obtaining a first text string that is a transcription of the first audio data, the first text string generated by a first automatic speech recognition system using the first audio data and using a first model trained for a plurality of individuals;

obtaining a second text string that is a transcription of second audio data, the second audio data including a revoicing of the first audio data by a captioning assistant and the second text string generated by a second automatic speech recognition system using the second audio data and using a second model trained for the captioning assistant;

obtaining a third text string that is a transcription of the first audio data or the second audio data, the third text string generated by a third automatic speech recognition system using a third model;

generating an output text string from the first text string, the second text string, and the third text string; and

providing the output text string as a transcription of the speech to the second device for presentation during the communication session concurrently with the presentation of the first audio data by the second device.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method may include obtaining first audio data originating at a first device during a communication session between the first device and a second device. The method may also include obtaining a first text string that is a transcription of the first audio data, where the first text string may be generated using automatic speech recognition technology using the first audio data. The method may also include obtaining a second text string that is a transcription of second audio data, where the second audio data may include a revoicing of the first audio data by a captioning assistant and the second text string may be generated by the automatic speech recognition technology using the second audio data. The method may further include generating an output text string from the first text string and the second text string and using the output text string as a transcription of the speech.

Citations

20 Claims

1. A method comprising:
- obtaining first audio data originating at a first device during a communication session between the first device and a second device, the communication session configured for verbal communication such that the first audio data includes speech;
  
  obtaining a first text string that is a transcription of the first audio data, the first text string generated by a first automatic speech recognition system using the first audio data and using a first model trained for a plurality of individuals;
  
  obtaining a second text string that is a transcription of second audio data, the second audio data including a revoicing of the first audio data by a captioning assistant and the second text string generated by a second automatic speech recognition system using the second audio data and using a second model trained for the captioning assistant;
  
  obtaining a third text string that is a transcription of the first audio data or the second audio data, the third text string generated by a third automatic speech recognition system using a third model;
  
  generating an output text string from the first text string, the second text string, and the third text string; and
  
  providing the output text string as a transcription of the speech to the second device for presentation during the communication session concurrently with the presentation of the first audio data by the second device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the first model includes one or more of the following:
    - a feature model, a transform model, an acoustic model, a language model, and a pronunciation model.
  - 3. The method of claim 1, wherein generating the output text string further includes:
    - de-normalizing the first text string and the second text string;
      
      aligning the first text string and the second text string; and
      
      comparing the aligned and de-normalized first and second text strings.
  - 4. The method of claim 1, wherein generating the output text string further includes:
    - selecting one or more second words from the second text string for the output text string based on the first text string and the second text string both including the one or more second words; and
      
      selecting one or more first words from the first text string for the output text string based on the second text string not including the one or more first words.
  - 5. The method of claim 1, further comprising correcting at least one word in one or more of:
    - the output text string, the first text string, and the second text string based on input obtained from a device associated with the captioning assistant.
  - 6. The method of claim 5, wherein the input obtained from the device is based on a fourth text string generated by the first automatic speech recognition system using the first audio data.
  - 7. The method of claim 6, wherein the first text string and the fourth text string are both hypothesis generated by the first automatic speech recognition system for the substantially same portion of the first audio data.
  - 8. The method of claim 1, wherein the third text string is a transcription of the first audio data, the method further comprising obtaining a fourth text string that is a transcription of the second audio data, the fourth text string generated by a fourth automatic speech recognition system using the second audio data and using a fourth model, wherein the output text string is generated from the first text string, the second text string, the third text string, and the fourth text string.
  - 9. The method of claim 1, further comprising:
    - obtaining fourth audio data that includes speech and that originates at the first device during the communication session;
      
      obtaining a third text string that is a transcription of the fourth audio data, the fourth text string generated by the first automatic speech recognition system using the fourth audio data and using the first model; and
      
      in response to either no revoicing of the fourth audio data or a fifth transcription generated using the second automatic speech recognition system having a quality measure satisfying a quality threshold, generating a second output text string using only the fourth text string.
  - 10. At least one non-transitory computer-readable media configured to store one or more instructions that in response to being executed by at least one computing system cause performance of the method of claim 1.

11. A method comprising:
- obtaining first audio data originating at a first device during a communication session between the first device and a second device, the communication session configured for verbal communication such that the first audio data includes speech;
  
  obtaining a first text string that is a transcription of the first audio data, the first text string generated using automatic speech recognition technology using the first audio data;
  
  obtaining a second text string that is a transcription of second audio data, the second audio data including a revoicing of the first audio data by a captioning assistant and the second text string generated by the automatic speech recognition technology using the second audio data;
  
  obtaining a third text string that is a transcription of the first audio data or the second audio data, the third text string generated by the automatic speech recognition technology;
  
  generating an output text string from the first text string, the second text string, and the third text string; and
  
  using the output text string as a transcription of the speech.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The method of claim 11, wherein the automatic speech recognition technology used to generate the first text string is a first automatic speech recognition system that includes a first model trained for a plurality of individuals and the automatic speech recognition technology used to generate the second text string is a second automatic speech recognition system that includes a second model adapted to the captioning assistant.
  - 13. The method of claim 11, wherein the output text string includes one or more first words from the first text string and one or more second words from the second text string.
  - 14. The method of claim 11, further comprising correcting at least one word in one or more of:
    - the output text string, the first text string, and the second text string based on input obtained from a device associated with the captioning assistant.
  - 15. The method of claim 14, wherein the input obtained from the device is based on a fourth text string generated by the automatic speech recognition technology using the first audio data.
  - 16. The method of claim 15, wherein the first text string and the fourth text string are both hypothesis generated by the automatic speech recognition technology for the substantially same portion of the first audio data.
  - 17. The method of claim 11, further comprising:
    - obtaining third audio data that includes speech and that originates at the first device during the communication session;
      
      obtaining a fourth text string that is a transcription of the third audio data, the fourth text string generated by the automatic speech recognition technology using the third audio data; and
      
      in response to either no revoicing of the third audio data or a fourth transcription, generated using the automatic speech recognition technology and revoicing of the third audio data, having a quality measure satisfying a quality threshold, generating a second output text string using only the fourth text string.
  - 18. At least one non-transitory computer-readable media configured to store one or more instructions that in response to being executed by at least one computing system cause performance of the method of claim 11.

19. A method comprising:
- obtaining first audio data originating at a first device during a communication session between the first device and a second device, the communication session configured for verbal communication such that the first audio data includes speech;
  
  obtaining a first text string that is a transcription of the first audio data, the first text string generated using automatic speech recognition technology using the first audio data;
  
  obtaining a second text string that is a transcription of second audio data, the second audio data including a revoicing of the first audio data and the second text string generated by the automatic speech recognition technology using the second audio data;
  
  obtaining a third text string that is a transcription of the first audio data or the second audio data, the third text string generated by the automatic speech recognition technology;
  
  generating an output text string from the first text string, the second text string, and the third text string, the output text string including one or more words based on at least two of the first text string, the second text string, and the third text string including the one or more words; and
  
  providing the output text string as a transcription of the speech to the second device for presentation during the communication session by the second device.

20. A system comprising:
- one or more processors; and
  
  at least one non-transitory computer-readable media coupled to the one or more processors, the at least one non-transitory computer-readable media configured to store one or more instructions that in response to being executed by the one or more processors cause the system to perform operations, the operations comprising;
  
  obtain first audio data originating at a first device during a communication session between the first device and a second device, the communication session configured for verbal communication such that the first audio data includes speech;
  
  obtain a first text string that is a transcription of the first audio data, the first text string generated using automatic speech recognition technology using the first audio data;
  
  obtain a second text string that is a transcription of second audio data, the second audio data including a revoicing of the first audio data and the second text string generated by the automatic speech recognition technology using the second audio data;
  
  obtain a third text string that is a transcription of the first audio data or the second audio data, the third text string generated by the automatic speech recognition technology;
  
  generate an output text string from the first text string, the second text string, and the third text string; and
  
  provide the output text string as a transcription of the speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sorenson Ip Holdings, LLC
Original Assignee
Sorenson Ip Holdings, LLC
Inventors
Thomson, David, Adams, Jadie, Skaggs, Jonathan, McClellan, Joshua, Roylance, Shane
Primary Examiner(s)
Leland, III, Edwin S

Application Number

US16/209,623
Time in Patent Office

448 Days
Field of Search

704235
US Class Current
CPC Class Codes

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

H04M 1/2475   for a hearing impaired user

H04M 2201/40   using speech recognition

H04M 3/42391   where the subscribers are h...

Transcription generation from multiple speech recognition systems

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Transcription generation from multiple speech recognition systems

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links