Predicting pronunciation in speech recognition
First Claim
1. A computer-implemented method for processing a spoken utterance, the method comprising:
- receiving a single word comprising a first portion and a second portion;
determining a first language of origin of the first portion;
determining a first score associated with the first language of origin;
determining a second language of origin of the second portion;
determining a second score associated with the second language of origin;
determining a plurality of potential pronunciations of the single word based at least in part on the first score and the second score, wherein each of the plurality of potential pronunciations is associated with a respective pronunciation score and wherein the plurality includes at least one hybrid pronunciation of the single word based on the first language of origin and the second language of origin;
receiving a spoken utterance comprising a request to output audio content;
matching a portion of the spoken utterance with one of the plurality of potential pronunciations based at least in part on a pronunciation score of one of the plurality of potential pronunciations;
identifying the audio content based at least in part on the one of the plurality of potential pronunciations; and
causing the audio content to be played by a computing device.
1 Assignment
0 Petitions
Accused Products
Abstract
An automatic speech recognition (ASR) device may be configured to predict pronunciations of textual identifiers (for example, song names, etc.) based on predicting one or more languages of origin of the textual identifier. The one or more languages of origin may be determined based on the textual identifier. The pronunciations may include a hybrid pronunciation including a pronunciation in one language, a pronunciation in a second language and a hybrid pronunciation that combines multiple languages. The pronunciations may be added to a lexicon and matched to the content item (e.g., song) and/or textual identifier. The ASR device may receive a spoken utterance from a user requesting the ASR device to access the content item. The ASR device determines whether the spoken utterance matches one of the pronunciations of the content item in the lexicon. The ASR device then accesses the content when the spoken utterance matches one of the potential textual identifier pronunciations.
35 Citations
21 Claims
-
1. A computer-implemented method for processing a spoken utterance, the method comprising:
-
receiving a single word comprising a first portion and a second portion; determining a first language of origin of the first portion; determining a first score associated with the first language of origin; determining a second language of origin of the second portion; determining a second score associated with the second language of origin; determining a plurality of potential pronunciations of the single word based at least in part on the first score and the second score, wherein each of the plurality of potential pronunciations is associated with a respective pronunciation score and wherein the plurality includes at least one hybrid pronunciation of the single word based on the first language of origin and the second language of origin; receiving a spoken utterance comprising a request to output audio content; matching a portion of the spoken utterance with one of the plurality of potential pronunciations based at least in part on a pronunciation score of one of the plurality of potential pronunciations; identifying the audio content based at least in part on the one of the plurality of potential pronunciations; and causing the audio content to be played by a computing device. - View Dependent Claims (2, 3)
-
-
4. A computing system, comprising:
-
at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the computing system to; receive audio data corresponding to an utterance; determine a first score corresponding to a likelihood that a first portion of the audio data represents a first potential pronunciation of a first portion of a single word, the first potential pronunciation of the first portion of the single word corresponding to a first language of origin; determine a second score corresponding to a likelihood that a second portion of the audio data represents a second potential pronunciation of a second portion of the single word, the second potential pronunciation of the second portion of the single word corresponding to a second language of origin; determine that the audio data includes a representation of a combined potential pronunciation of the single word based at least in part on the first score and the second score, wherein the combined potential pronunciation of the single word comprises a hybrid pronunciation based on the first language of origin and the second language of origin; identify content based at least in part on the single word; and cause the content to be output. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method, comprising:
-
receiving audio data corresponding to an utterance; determining a first score corresponding to a likelihood that a first portion of the audio data represents a first potential pronunciation of a first portion of a single word, the first potential pronunciation of the first portion of the single word corresponding to a first language of origin; determining a second score corresponding to a likelihood that a second portion of the audio data represents a second potential pronunciation of a second portion of the single word, the second potential pronunciation of the second portion of the single word corresponding to a second language of origin; determining that the audio data includes a representation of a combined potential pronunciation of the single word based at least in part on the first score and the second score, wherein the combined potential pronunciation of the single word comprises a hybrid pronunciation based on the first language of origin and the second language of origin; identifying content based at least in part on the single word; and causing the content to be output. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
Specification