Recognizing speech in multiple languages

US 9,129,591 B2
Filed: 12/26/2012
Issued: 09/08/2015
Est. Priority Date: 03/08/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving audio that encodes an utterance;

selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of the automated speech recognizers;

providing the audio to (i) a first automated speech recognizer of the particular subset that is associated with a first natural language, (ii) a second automated speech recognizer of the particular subset that is associated with a different, second, natural language, and (iii) an a automatic language identifier;

receiving (i) a first transcription of the utterance, in the first natural language, and a first speech recognition confidence score for the first transcription, from the first automated speech recognizer, (ii) a second transcription of the utterance, in the second natural language, and a second speech recognition confidence score for the second transcription, from the second automated speech recognizer, and (iii) data indicating a language associated with utterance, from the automated language identifier;

after receiving (i) the first transcription, (ii) the second transcription, and (iii) the data indicating the language associated with the utterance, selecting, from among the first transcription of the utterance and the second transcription of the utterance, a particular transcription to output as a representative transcription of the utterance based at least on (a) the first speech recognition confidence score, and (b) the second speech recognition confidence score, and (c) the data indicating the language; and

providing the representative transcription for output.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech recognition systems may perform the following operations: receiving audio; recognizing the audio using language models for different languages to produce recognition candidates for the audio, where the recognition candidates are associated with corresponding recognition scores; identifying a candidate language for the audio; selecting a recognition candidate based on the recognition scores and the candidate language; and outputting data corresponding to the selected recognition candidate as a recognized version of the audio.

Citations

18 Claims

1. A computer-implemented method comprising:
- receiving audio that encodes an utterance;
  
  selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of the automated speech recognizers;
  
  providing the audio to (i) a first automated speech recognizer of the particular subset that is associated with a first natural language, (ii) a second automated speech recognizer of the particular subset that is associated with a different, second, natural language, and (iii) an a automatic language identifier;
  
  receiving (i) a first transcription of the utterance, in the first natural language, and a first speech recognition confidence score for the first transcription, from the first automated speech recognizer, (ii) a second transcription of the utterance, in the second natural language, and a second speech recognition confidence score for the second transcription, from the second automated speech recognizer, and (iii) data indicating a language associated with utterance, from the automated language identifier;
  
  after receiving (i) the first transcription, (ii) the second transcription, and (iii) the data indicating the language associated with the utterance, selecting, from among the first transcription of the utterance and the second transcription of the utterance, a particular transcription to output as a representative transcription of the utterance based at least on (a) the first speech recognition confidence score, and (b) the second speech recognition confidence score, and (c) the data indicating the language; and
  
  providing the representative transcription for output.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, comprising:
    - receiving input from a user from whom the audio is received, andwherein selecting the particular subset of automated speech recognizers comprises selecting, based on the input from the user and from among multiple automated speech recognizers that are each associated with a different natural language, the particular subset of automated speech recognizers.
  - 3. The method of claim 1, comprising:
    - identifying languages associated with previously-received audio; and
      
      wherein selecting the particular subset of automated speech recognizers comprises selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of automated speech recognizers that are each associated with one of the identified natural languages.
  - 4. The method of claim 1, wherein receiving audio data that encodes the utterance comprises receiving audio data that encodes a user'"'"'s spoken utterance, andwherein the data indicating the language associated with the utterance reflects (i) an estimated level of fluency of the user in the first natural language and (ii) an estimated level of fluency of the user in the second natural language.
  - 5. The method of claim 1, wherein:
    - the audio data that encodes the utterance (i) is generated at a computing device and (ii) encodes an utterance spoken by a user of the computing device;
      
      the method further comprises;
      
      receiving data indicating at least one characteristic of one or more of the computing device and the user of the computing device; and
      
      selecting, from among the first transcription of the utterance and the second transcription of the utterance, the particular transcription to output as a representative transcription of the utterance based at least on (a) the first speech recognition confidence score, (b) the second speech recognition confidence score, and (c) the data indicating the language comprises selecting, from among the first transcription of the utterance and the second transcription of the utterance, a particular transcription to output as a representative transcription of the utterance based at least on (a) the first speech recognition confidence score, (b) the second speech recognition confidence score, (c) the data indicating the language, and (d) the data indicating at least one characteristic of one or more of the computing device and the user of the computing device.
  - 6. The method of claim 5, wherein the data indicating at least one characteristic of one or more of the computing device and the user of the computing device includes data indicating a location the computing device.
  - 7. The method of claim 5, wherein receiving data indicating at least one characteristic of one or more of the computing device and the user of the computing device comprises receiving data indicating at least one characteristic of one or more of the computing device and the user of the computing device from an account associated with the user of the computing device.
  - 8. The method of claim 5, wherein providing the representative transcription for output comprises providing the representative transcription to a service that controls one or more functions of an application running on the computing device.

9. A non-transitory computer-readable storage device having instructions stored thereon that, when executed by a computing device, cause the computing device to perform operations comprising:
- receiving audio that encodes an utterance;
  
  selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of the automated speech recognizers;
  
  providing the audio to (i) a first automated speech recognizer of the particular subset that is associated with a first natural language, (ii) a second automated speech recognizer of the particular subset that is associated with a different, second, natural language, and (iii) an a automatic language identifier;
  
  receiving (i) a first transcription of the utterance, in the first natural language, and a first speech recognition confidence score for the first transcription, from the first automated speech recognizer, (ii) a second transcription of the utterance, in the second natural language, and a second speech recognition confidence score for the second transcription, from the second automated speech recognizer, and (iii) data indicating a language associated with utterance, from the automated language identifier;
  
  after receiving (i) the first transcription, (ii) the second transcription, and (iii) the data indicating the language associated with the utterance, selecting, from among the first transcription of the utterance and the second transcription of the utterance, a particular transcription to output as a representative transcription of the utterance based at least on (a) the first speech recognition confidence score, and (b) the second speech recognition confidence score, and (c) the data indicating the language; and
  
  providing the representative transcription for output.
- View Dependent Claims (10, 11, 12)
- - 10. The storage device of claim 9, wherein:
    - the audio data that encodes the utterance (i) is generated at a computing device and (ii) encodes an utterance spoken by a user of the computing device; and
      
      the data indicating the language associated with the utterance reflects (i) an estimated level of fluency of the user of the computing device in the first natural language and (ii) an estimated level of fluency of the user of the computing device in the second natural language.
  - 11. The storage device of claim 10, wherein the operations comprise:
    - receiving data that reflects input provided to the computing device by the user of the computing device; and
      
      wherein selecting, from among multiple automated speech recognizers that are each associated with a different natural language, the particular subset of the automated speech recognizers comprises selecting, based on the input and from among multiple speech recognizers that are each associated with a different language, a particular subset of the automated speech recognizers.
  - 12. The storage device of claim 10, wherein providing the representative transcription for output comprises providing the representative transcription to a service that controls one or more functions of an application running on the computing device.

13. A system comprising:
- one or more data processing apparatus; and
  
  a computer-readable storage device having stored thereon instructions that, when executed by the one or more data processing apparatus, cause the one or more data processing apparatus to perform operations comprising;
  
  receiving audio data that encodes an utterance;
  
  selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of the automated speech recognizers;
  
  providing the audio data to (i) a first automated speech recognizer of the particular subset that is associated with a first natural language, (ii) a second automated speech recognizer of the particular subset that is associated with a different, second, natural language, and (iii) an automated language identifier;
  
  receiving (i) a first transcription of the utterance, in the first natural language, from the first automated speech recognizer, (ii) a second transcription of the utterance, in the second natural language, from the second automated speech recognizer, and (iii) data indicating a first portion of the audio data that is associated with the first natural language and a second portion of the audio data that is associated with the second natural language, from the automated language identifier;
  
  after receiving (i) the first transcription, (ii) the second transcription, and (iii) the data indicating the first portion of the audio data that is associated with the first natural language and the second portion of the audio data that is associated with the second natural language, generating a representative transcription of the utterance comprising (a) a portion of the first transcription that corresponds to the first portion of the audio data that is associated with the first natural language, and (b) a portion of the second transcription that corresponds to the second portion of the audio data that is associated with the second natural language; and
  
  providing the representative transcription for output.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The system of claim 13, wherein the operations comprise:
    - selecting, based at least on the data received from the automated language identifier and from among the first transcription of the utterance and the second transcription of the utterance, two or more transcription portions that include (a) a portion of the first transcription that corresponds to the first portion of the audio data that is associated with the first natural language, and (b) a portion of the second transcription that corresponds to the second portion of the audio data that is associated with the second natural language, andwherein generating the representative transcription of the utterance comprises generating a representative transcription of the utterance includes the two or more selected transcription portions.
  - 15. The system of claim 13, wherein the operations comprise:
    - receiving, from the first and second automated speech recognizers, a speech recognition confidence score for each of two or more portions of first and second transcriptions of the utterance, respectively, andwherein selecting the two or more transcription portions comprises selecting the two or more transcription portions based at least on the multiple speech recognition confidence scores and the data received from the automated language identifier.
  - 16. The system of claim 13, wherein:
    - the audio data that encodes the utterance (i) is generated at a computing device and (ii) encodes an utterance spoken by a user of the computing device;
      
      the operations further comprise;
      
      receiving data that identifies one or more of the computing device and the user of the computing device; and
      
      after receiving (i) the first transcription, (ii) the second transcription, and (iii) the data indicating the first portion of the audio data that is associated with the first natural language and the second portion of the audio data that is associated with the second natural language, generating a representative transcription of the utterance, comprises after receiving (i) the first transcription, (ii) the second transcription, (iii) the data indicating the first portion of the audio data that is associated with the first natural language and the second portion of the audio data that is associated with the second natural language, and (iv) the data that identifies one or more of the computing device and the user of the computing device, generating a representative transcription of the utterance.
  - 17. The system of claim 16, wherein the operations comprise:
    - determining, based at least on the data that identifies one or more of the computing device and the user of the computing device, that the user of the computing device is fluent in a particular subset of the natural languages; and
      
      wherein selecting, from among multiple automated speech recognizers that are each associated with a different natural language, a particular subset of the automated speech recognizers comprises selecting, in response to determining that the user of the computing device is fluent in the particular subset of the natural languages and from among multiple automated speech recognizers that each associated with a different natural language, a particular subset of the automated speech recognizers that are each associated with a different one of the particular subset of the natural languages.
  - 18. The system of claim 16, wherein providing the representative transcription for output comprises providing the representative transcription to a service that controls one or more functions of an application running on the computing device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Sung, Yun-hsuan, Beaufays, Francoise, Strope, Brian, Lin, Hui, Huang, Jui-Ting
Primary Examiner(s)
Han, Qi

Application Number

US13/726,954
Publication Number

US 20130238336A1
Time in Patent Office

986 Days
Field of Search

704/255, 704/231, 704/235, 704/251, 704/8
US Class Current

1/1
CPC Class Codes

G10L 15/005   Language recognition

G10L 15/183   using context dependencies,...

G10L 15/32   Multiple recognisers used i...

Recognizing speech in multiple languages

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Recognizing speech in multiple languages

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links