Exemplar-based latent perceptual modeling for automatic speech recognition
First Claim
1. A method for recognizing speech in an output domain, the method comprising:
- at a device comprising one or more processors and memory;
establishing a global speech recognition model based on an initial set of training data;
receiving a plurality of input speech segments to be recognized in the output domain; and
for each of the plurality of input speech segments;
identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
generating a respective focused speech recognition model based on the respective set of focused training data;
and providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain;
wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;
and deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems, and computer-readable media related to selecting observation-specific training data (also referred to as “observation-specific exemplars”) from a general training corpus, and then creating, from the observation-specific training data, a focused, observation-specific acoustic model for recognizing the observation in an output domain are disclosed. In one aspect, a global speech recognition model is established based on an initial set of training data; a plurality of input speech segments to be recognized in an output domain are received; and for each of the plurality of input speech segments: a respective set of focused training data relevant to the input speech segment is identified in the global speech recognition model; a respective focused speech recognition model is generated based on the respective set of focused training data; and the respective focused speech recognition model is provided to a recognition device for recognizing the input speech segment in the output domain.
679 Citations
27 Claims
-
1. A method for recognizing speech in an output domain, the method comprising:
-
at a device comprising one or more processors and memory; establishing a global speech recognition model based on an initial set of training data; receiving a plurality of input speech segments to be recognized in the output domain; and
for each of the plurality of input speech segments;identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment; generating a respective focused speech recognition model based on the respective set of focused training data; and providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain; wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for recognizing speech in an output domain, the method comprising:
- at a client device comprising one or more processors and memory;
receiving a speech input from a user;
for each of a plurality of input speech segments in the speech input;
receiving a respective focused speech recognition model, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data; and
recognizing the input speech segment using the respective focused speech recognition model;wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
- at a client device comprising one or more processors and memory;
-
12. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- establishing a global speech recognition model based on an initial set of training data;
receiving a plurality of input speech segments to be recognized in an output domain; and
for each of the plurality of input speech segments;
identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
generating a respective focused speech recognition model based on the respective set of focused training data; and
providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain;wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates. - View Dependent Claims (13, 14, 15, 16, 17, 18)
- establishing a global speech recognition model based on an initial set of training data;
-
19. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- at a client device;
receiving a speech input from a user;
for each of a plurality of input speech segments in the speech input;
receiving a respective focused speech recognition model, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data; and
recognizing the input speech segment using the respective focused speech recognition model;wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
- at a client device;
-
20. A system, comprising:
- one or more processors; and
memory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the processors to perform operations comprising;
establishing a global speech recognition model based on an initial set of training data;
receiving a plurality of input speech segments to be recognized in an output domain; and
for each of the plurality of input speech segments;
identifying in the global speech recognition model a respective set of focused training data relevant to the input speech segment;
generating a respective focused speech recognition model based on the respective set of focused training data; and
providing the respective focused speech recognition model to a recognition device for recognizing the input speech segment in the output domain;wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates. - View Dependent Claims (21, 22, 23, 24, 25, 26)
- one or more processors; and
-
27. A system, comprising:
-
one or more processors; and memory having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising; at a client device; receiving a speech input from a user;
for each of a plurality of input speech segments in the speech input;
receiving a respective focused speech recognition model from a server, wherein the respective focused speech recognition model is generated based on a respective set of focused training data relevant to the input speech segment, wherein the respective set of focused training data is selected for the input speech segment in a global speech recognition model, and wherein the global speech recognition model is generated based on a set of global training data;and recognizing the input speech segment using the respective focused speech recognition model; wherein establishing the global speech recognition model based on the initial set of training data further comprises;
generating the initial set of training data from a plurality of training speech samples, the initial set of training data including an initial set of speech segments and an initial set of speech templates;and deriving a global latent space from the initial set of speech segments and the initial set of speech templates.
-
Specification