Speech recognition using topic-specific language models
First Claim
Patent Images
1. A method comprising:
- receiving audio;
determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic;
determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content;
identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content;
obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score;
generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and
selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores.
2 Assignments
0 Petitions
Accused Products
Abstract
Speech recognition techniques may include: receiving audio; identifying one or more topics associated with audio; identifying language models in a topic space that correspond to the one or more topics, where the language models are identified based on proximity of a representation of the audio to representations of other audio in the topic space; using the language models to generate recognition candidates for the audio, where the recognition candidates have scores associated therewith that are indicative of a likelihood of a recognition candidate matching the audio; and selecting a recognition candidate for the audio based on the scores.
-
Citations
19 Claims
-
1. A method comprising:
-
receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic; determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content; identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content; obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score; generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. One or more non-transitory machine-readable media storing instructions that are executable by one or more processing devices to perform operations comprising:
-
receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic; determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content; identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content; obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score; generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores. - View Dependent Claims (16, 17)
-
-
18. A system comprising:
-
memory storing instructions that are executable; and one or more processing devices to execute the instructions to perform operations comprising; receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic; determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content; identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content; obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score; generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores. - View Dependent Claims (19)
-
Specification