Speech recognition using variable-length context
First Claim
1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
receiving speech data and data indicating a candidate transcription for the speech data;
accessing a phonetic representation for the candidate transcription;
extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone;
receiving data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences;
selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones;
accessing data from the acoustic model corresponding to the selected test sequence; and
generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises;
determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and
adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for recognizing speech using a variable length of context. Speech data and data identifying a candidate transcription for the speech data are received. A phonetic representation for the candidate transcription is accessed. Multiple test sequences are extracted for a particular phone in the phonetic representation. Each of the multiple test sequences includes a different set of contextual phones surrounding the particular phone. Data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences is received. From among the one or more test sequences, the test sequence that includes the highest number of contextual phones is selected. A score for the candidate transcription is generated based on the data from the acoustic model that corresponds to the selected test sequence.
46 Citations
18 Claims
-
1. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving speech data and data indicating a candidate transcription for the speech data; accessing a phonetic representation for the candidate transcription; extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone; receiving data indicating that an acoustic model includes data corresponding to one or more of the multiple test sequences; selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones; accessing data from the acoustic model corresponding to the selected test sequence; and generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises; determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
14. A computer-implemented method, comprising:
-
receiving speech data and data identifying a candidate transcription for the speech data; accessing a phonetic representation for the candidate transcription; extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone; determining that an acoustic model includes data corresponding to one or more of the multiple test sequences; selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones; accessing data from the acoustic model corresponding to the selected test sequence; and generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises; determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data. - View Dependent Claims (15, 16, 17)
-
-
18. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving speech data and data identifying a candidate transcription for the speech data; accessing a phonetic representation for the candidate transcription; extracting, from the phonetic representation, multiple test sequences for a particular phone in the phonetic representation, each of the multiple test sequences including a different set of contextual phones surrounding the particular phone; determining that an acoustic model includes data corresponding to one or more of the multiple test sequences; selecting, from among the one or more test sequences for which the acoustic model includes data, the test sequence that includes the highest number of contextual phones, the selected test sequence including fewer than a predetermined maximum number of contextual phones; accessing data from the acoustic model corresponding to the selected test sequence; and generating a score for the candidate transcription based on the accessed data from the acoustic model that corresponds to the selected test sequence, wherein generating the score comprises; determining a penalty based on the selected test sequence including fewer than the predetermined maximum number of contextual phones; and adjusting a first score for the candidate transcription based on the penalty to generate an adjusted score, the adjusted score indicating a lower likelihood than the first score that the candidate transcription is an accurate transcription for the speech data.
-
Specification