Clustering classes in language modeling
First Claim
1. A computer-implemented method, comprising:
- obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms;
for each respective text sample among the plurality of text samples;
(i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and
(ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms;
generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class;
merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models;
selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models;
generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and
providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.
2 Assignments
0 Petitions
Accused Products
Abstract
This document describes, among other things, a computer-implemented method. The method can include obtaining a plurality of text samples that each include one or more terms belonging to a first class of terms. The plurality of text samples can be classified into a plurality of groups of text samples. Each group of text samples can correspond to a different sub-class of terms. For each of the groups of text samples, a sub-class context model can be generated based on the text samples in the respective group of text samples. Particular ones of the sub-class context models that are determined to be similar can be merged to generate a hierarchical set of context models. Further, the method can include selecting particular ones of the context models and generating a class-based language model based on the selected context models.
-
Citations
19 Claims
-
1. A computer-implemented method, comprising:
-
obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms; for each respective text sample among the plurality of text samples; (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms; generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class; merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models; selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models; generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. One or more non-transitory computer-readable media including instructions that, when executed by one or more processors, cause performance of operations, comprising:
-
obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms; for each respective text sample among the plurality of text samples; (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms; generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class; merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models; selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models; generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system. - View Dependent Claims (18)
-
-
19. A system comprising:
-
one or more processors; and one or more computer-readable media including instructions that, when executed by the one or more processors, cause performance of operations comprising; obtaining, by the system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms; for each respective text sample among the plurality of text samples; (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms; generating, by the system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class; merging, by the system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models; selecting, by the system, particular ones of the context models from among the hierarchical set of context models; generating, by the system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.
-
Specification