Speech recognition using variable-length context

US 8,494,850 B2
Filed: 06/29/2012
Issued: 07/23/2013
Est. Priority Date: 06/30/2011
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;

receiving speech data and data indicating a candidate transcription for the speech data;

accessing a phonetic representation for the candidate transcription;

extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone;

receiving data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences;

selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones;

accessing data from the acoustic model corresponding to the selected test sequence; and

generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises;

determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and

adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for recognizing speech using a variable length of context. Speech data and data identifying a candidate transcription for the speech data are received. A phonetic representation for the candidate transcription is accessed. Multiple test sequences are extracted for a particular phone in the phonetic representation. Each of the multiple test sequences includes a different set of contextual phones surrounding the particular phone. Data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences is received. From among the one or more test sequences, the test sequence that includes the highest number of contextual phones is selected. A score for the candidate transcription is generated based on the data from the acoustic model that corresponds to the selected test sequence.

46 Citations

View as Search Results

18 Claims

1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving speech data and data indicating a candidate transcription for the speech data;
  
  accessing a phonetic representation for the candidate transcription;
  
  extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone;
  
  receiving data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences;
  
  selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones;
  
  accessing data from the acoustic model corresponding to the selected test sequence; and
  
  generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises;
  
  determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and
  
  adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The system of claim 1, wherein determining the penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones comprises determining a magnitude of the penalty based on a difference between a number of contextual phones in the selected test sequence and the predetermined maximum number of contextual phones.
  - 3. The system of claim 1, wherein extracting multiple test sequences for the particular phone comprises extracting one or more asymmetric test sequences that include asymmetric numbers of contextual phones before and after the particular phone.
  - 4. The system of claim 3, wherein extracting multiple test sequences for the particular phone comprises extracting one or more symmetric test sequences that include symmetric numbers of contextual phones before and after the particular phone, each of the symmetric test sequences each including fewer contextual phones than each of the one or more asymmetric test sequences.
  - 5. The system of claim 1, wherein extracting multiple test sequences for the particular phone comprises extracting at least:
    - a first sequence that includes one contextual phone before the particular phone or one contextual phone after the particular phone,a second sequence that includes two contextual phones before the particular phone or two contextual phones after the particular phone, anda third sequence that includes three contextual phones before the particular phone or three contextual phones after the particular phone.
  - 6. The system of claim 1, wherein extracting multiple test sequences for the particular phone comprises extracting at least five test sequences, where the at least five test sequences respectively include any contextual phones occurring within one, two, three, four, or five contextual positions before and after the particular phone.
  - 7. The system of claim 1, wherein receiving data indicating that the acoustic model includes data for the one or more of the multiple test sequences comprises:
    - requesting, for each of the test sequences, data from the acoustic model that corresponds to the test sequence;
      
      receiving data from the acoustic model corresponding to each of the one or more test sequences for which data is present in the acoustic model; and
      
      determining that the one or more test sequences are recognized by the model based on receiving the data corresponding to the one or more test sequences.
  - 8. The system of claim 1, wherein accessing the data from the acoustic model corresponding to the selected test sequence comprises:
    - identifying a partitioning key based on a sequence of phones that occurs in each of the multiple test sequences;
      
      identifying a partition of a distributed associative array that corresponds to the partitioning key; and
      
      obtaining, from the identified partition, data corresponding to each of the multiple test sequences for which the acoustic model includes data.
  - 9. The system of claim 1, wherein accessing the data from the acoustic model corresponding to the selected test sequence comprises accessing data that describe a Gaussian mixture model corresponding to a central phone of the selected test sequence.
  - 10. The system of claim 1, wherein accessing the phonetic representation for the transcription comprises accessing a phonetic representation comprising context-independent phones.
  - 11. The system of claim 1, wherein receiving the speech data comprises receiving feature vectors that indicate speech characteristics.
  - 12. The system of claim 1, wherein generating the score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence comprises adjusting a score assigned to the candidate transcription using a different acoustic model.
  - 13. The system of claim 1, wherein the operations further comprise:
    - extracting, from the phonetic representation, multiple second test sequences for a second phone in the phonetic representation that is different from the particular phone, each of the multiple second test sequences including a different set of contextual phones surrounding the second phone;
      
      receiving data indicating that the acoustic model includes data for one or more of the multiple second test sequences; and
      
      selecting, from among the one or more second test sequences for which the acoustic model includes data, the second test sequence that includes the highest number of contextual phones; and
      
      wherein generating the score for the candidate transcription comprises generating the score for the candidate transcription based on the data from the acoustic model that corresponds to the selected test sequence and the data from the acoustic model that corresponds to selected second test sequence.

14. A computer-implemented method, comprising:
- receiving speech data and data identifying a candidate transcription for the speech data;
  
  accessing a phonetic representation for the candidate transcription;
  
  extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone;
  
  determining that an acoustic model includes data corresponding to one or more of the multiple test sequences;
  
  selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones;
  
  accessing data from the acoustic model corresponding to the selected test sequence; and
  
  generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises;
  
  determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and
  
  adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.
- View Dependent Claims (15, 16, 17)
- - 15. The computer-implemented method of claim 14, wherein determining the penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones comprises determining a magnitude of the penalty based on a difference between a number of contextual phones in the selected test sequence and the predetermined maximum number of contextual phones.
  - 16. The computer-implemented method of claim 14, wherein extracting multiple test sequences for the particular phone comprises extracting one or more asymmetric test sequences that include asymmetric numbers of contextual phones before and after the particular phone.
  - 17. The computer-implemented method of claim 16, wherein extracting multiple test sequences for the particular phone comprises extracting one or more symmetric test sequences that include symmetric numbers of contextual phones before and after the particular phone, each of the symmetric test sequences each including fewer contextual phones than each of the one or more asymmetric test sequences.

18. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving speech data and data identifying a candidate transcription for the speech data;
  
  accessing a phonetic representation for the candidate transcription;
  
  extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone;
  
  determining that an acoustic model includes data corresponding to one or more of the multiple test sequences;
  
  selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones;
  
  accessing data from the acoustic model corresponding to the selected test sequence; and
  
  generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises;
  
  determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and
  
  adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Chelba, Ciprian I., Xu, Peng, Pereira, Fernando
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US13/539,284
Publication Number

US 20130006623A1
Time in Patent Office

389 Days
Field of Search

704/233, 704/251, 704/254
US Class Current

704/233
CPC Class Codes

G10L 15/063   Training

G10L 15/14   using statistical models, e...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/34   Adaptation of a single reco...

G10L 2015/0631   Creating reference template...

Speech recognition using variable-length context

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

46 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition using variable-length context

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

46 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links