Generation and use of multiple speech processing transforms
First Claim
1. A system comprising:
- a computer-readable memory storing executable instructions; and
one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
receive, from a client device, audio data regarding a user utterance;
obtain, based at least partly on the audio data, at least a first transform and a second transform of a plurality of feature vector transforms associated with the client device;
generate a plurality of feature vectors based at least partly on the audio data;
apply the first transform, the second transform, and a default transform to at least a portion of the plurality of feature vectors to generate first transformed feature vectors, second transformed feature vectors, and default feature vectors respectively;
determine, based at least partly on a performance score, that the default feature vectors provide a better match to an acoustic model than the first transformed feature vectors and the second transformed feature vectors; and
discard at least one of the first transform or the second transform and create a new transform, the new transform based at least partly on the default transform and speech recognition statistics regarding a speech recognition pass using the default transform, wherein the new transform is associated with the client device.
1 Assignment
0 Petitions
Accused Products
Abstract
Features are disclosed for selecting and using multiple transforms associated with a particular remote device for use in automatic speech recognition (“ASR”). Each transform may be based on statistics that have been generated from processing utterances that share some characteristic (e.g., acoustic characteristics, time frame within which the utterances where processed, etc.). When an utterance is received from the remote device, a particular transform or set of transforms may be selected for use in speech processing based on data obtained from the remote device, speech processing of a portion of the utterance, speech processing of prior utterances, etc. The transform or transforms used in processing the utterances may then be updated based on the results of the speech processing.
10 Citations
25 Claims
-
1. A system comprising:
-
a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least; receive, from a client device, audio data regarding a user utterance; obtain, based at least partly on the audio data, at least a first transform and a second transform of a plurality of feature vector transforms associated with the client device; generate a plurality of feature vectors based at least partly on the audio data; apply the first transform, the second transform, and a default transform to at least a portion of the plurality of feature vectors to generate first transformed feature vectors, second transformed feature vectors, and default feature vectors respectively; determine, based at least partly on a performance score, that the default feature vectors provide a better match to an acoustic model than the first transformed feature vectors and the second transformed feature vectors; and discard at least one of the first transform or the second transform and create a new transform, the new transform based at least partly on the default transform and speech recognition statistics regarding a speech recognition pass using the default transform, wherein the new transform is associated with the client device. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method comprising:
-
receiving, by a spoken language processing system comprising one or more computing devices, audio data regarding a user utterance, wherein the audio data is received from a client device separate from the one or more computing devices; determining, based at least partly on the audio data, to replace a session feature vector transform associated with the client device with a new feature vector transform, wherein the session feature vector transform is based on data generated during processing of prior audio data regarding one or more utterances, made prior to the user utterance, received from the client device within a threshold period of time; creating the new feature vector transform based at least partly on the audio data, wherein the new feature vector transform is associated with the client device; and performing speech recognition using the new feature vector transform. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 16)
-
-
15. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
-
receiving, by a spoken language processing system comprising one or more computing devices, audio data regarding a user utterance, wherein the audio data is received from a client device separate from the one or more computing devices; determining, based at least partly on the audio data, to replace a session feature vector transform associated with the client device with a new feature vector transform, wherein the session feature vector transform is based on data generated during processing of prior audio data regarding one or more utterances, made prior to the user utterance, received from the client device within a threshold period of time; creating the new feature vector transform based at least partly on the audio data, wherein the new feature vector transform is associated with the client device; and performing speech recognition using the new feature vector transform. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25)
-
Specification