Fast algorithm for deriving acoustic prototypes for automatic speech recognition
First Claim
1. An apparatus for generating a set of acoustic prototype signals for encoding speech, said apparatus comprising:
- means for storing a model of a training script, said training script model comprising a series of word-segment models, each word-segment model being selected from a finite set of word-segment models, each word-segment model comprising a series of elementary models, each elementary model having a location in each word-segment model, each elementary model being selected from a finite set of elementary models;
means for measuring the value of at least one feature of an utterance of the training script during each of a series of time intervals spanned by the utterance of the training script to produce a series of feature vector signals, each feature vector signal having a feature value representing the value of the at least one feature of the utterance during a corresponding time interval;
means for estimating at least one path through the training script model which would produce the entire series of measured feature vector signals so as to estimate, for each feature vector signal, the corresponding elementary model in the training script model which would produce that feature vector signal;
means for clustering the feature vector signals into a plurality of clusters to form a plurality of cluster signals, each feature vector signal in a cluster corresponding to a single elementary model in a single location in a single word-segment model, each cluster signal having a cluster value equal to an average of the feature values of all of the feature vector signals in the cluster;
means for storing a plurality of prototype vector signals, each prototype vector signal corresponding to an elementary model, each prototype vector signal having an identifier and comprising at least two partition values, at least one partition value being equal to a combination of the cluster values of one or more cluster signals corresponding to the elementary model, at least one other partition value being equal to a combination of the cluster values of one or more other cluster signals corresponding to the elementary model.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus for generating a set of acoustic prototype signals for encoding speech includes a memory for storing a training script model comprising a series of word-segment models. Each word-segment model comprises a series of elementary models. An acoustic measure is provided for measuring the value of at least one feature of an utterance of the training script during each of a series of time intervals to produce a series of feature vector signals representing the feature values of the utterance. An acoustic matcher is provided for estimating at least one path through the training script model which would produce the entire series of measured feature vector signals. From the estimated path, the elementary model in the training script model which would produce each feature vector signal is estimated. The apparatus further comprises a cluster processor for clustering the feature vector signals into a plurality of clusters. Each feature vector signal in a cluster corresponds to a single elementary model in a single location in a single word-segment model. Each cluster signal has a cluster value equal to an average of the feature values of all feature vectors in the signal. Finally, the apparatus includes a memory for storing a plurality of prototype vector signals. Each prototype vector signal corresponds to an elementary model, has an identifier, and comprises at least two partition values. The partition values are equal to combinations of the cluster values of one or more cluster signals corresponding to the elementary model.
22 Citations
14 Claims
-
1. An apparatus for generating a set of acoustic prototype signals for encoding speech, said apparatus comprising:
-
means for storing a model of a training script, said training script model comprising a series of word-segment models, each word-segment model being selected from a finite set of word-segment models, each word-segment model comprising a series of elementary models, each elementary model having a location in each word-segment model, each elementary model being selected from a finite set of elementary models; means for measuring the value of at least one feature of an utterance of the training script during each of a series of time intervals spanned by the utterance of the training script to produce a series of feature vector signals, each feature vector signal having a feature value representing the value of the at least one feature of the utterance during a corresponding time interval; means for estimating at least one path through the training script model which would produce the entire series of measured feature vector signals so as to estimate, for each feature vector signal, the corresponding elementary model in the training script model which would produce that feature vector signal; means for clustering the feature vector signals into a plurality of clusters to form a plurality of cluster signals, each feature vector signal in a cluster corresponding to a single elementary model in a single location in a single word-segment model, each cluster signal having a cluster value equal to an average of the feature values of all of the feature vector signals in the cluster; means for storing a plurality of prototype vector signals, each prototype vector signal corresponding to an elementary model, each prototype vector signal having an identifier and comprising at least two partition values, at least one partition value being equal to a combination of the cluster values of one or more cluster signals corresponding to the elementary model, at least one other partition value being equal to a combination of the cluster values of one or more other cluster signals corresponding to the elementary model. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of generating a set of acoustic prototype signals for encoding speech, said method comprising the steps of:
-
storing a model of a training script, said training script model comprising a series of word-segment models, each word-segment model being selected from a finite set of word-segment models, each word-segment model comprising a series of elementary models, each elementary model having a location in each word-segment model, each elementary model being selected from a finite set of elementary models; measuring the value of at least one feature of an utterance of the training script during each of a series of time intervals spanned by the utterance of the training script to produce a series of feature vector signals, each feature vector signal having a feature value representing the value of the at least one feature of the utterance during a corresponding time interval; estimating at least one path through the training script model which would produce the entire series of measured feature vector signals so as to estimate, for each feature vector signal, the corresponding elementary model in the training script model which would product that feature vector signal; clustering the feature vector signals into a plurality of clusters to form a plurality of cluster signals, each feature vector signal in a cluster corresponding to a single elementary model in a single location in a single word-segment mode, each cluster signal having a cluster value equal to an average of the feature values of all of the feature vector signals in the cluster; storing a plurality of prototype vector signals, each prototype vector signal corresponding to an elementary model, each prototype vector signal having an identifier and comprising at least two partition values, at least one partition value being equal to a combination of the cluster values of one or more cluster signals corresponding to the elementary model, at least one other partition value being equal to a combination of the cluster values of one or more other cluster signals corresponding to the elementary model. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A speech recognition apparatus comprising:
-
means for measuring the value of at least one feature of an utterance of a word to be recognized during each of a series of time intervals spanned by the utterance of the word to be recognized to produce a series of feature vector signals, each feature vector signal having a feature value representing the value of at least one feature of the utterance during a corresponding time interval; means for storing a set of a plurality of prototype vector signals, each prototype vector signal having an identifier and a prototype value; means for comparing the value of each feature vector signal to the prototype value of each prototype vector signal to identify the best matched prototype vector signal associated with each feature vector signal to produce a series of associated prototype vector identifier signals; means for storing a plurality of acoustic word models; means for comparing the series of associated prototype vector identifier signals with each of the acoustic word models to estimate the one or more words which most likely correspond to the series of associated prototype vector identifier signals; and a display for displaying at least one of the one or more words which most likely correspond to the series of associated prototype vector identifier signals; characterized in that the apparatus further comprises means for generating the set of prototype vector signals, said means for generating comprising; means for storing a model of a training script, said training script model comprising a series of word-segment models, each word-segment model being selected from a finite set of word-segment models, each word-segment model comprising a series of elementary models, each elementary model having a location in each word-segment model, each elementary model being selected from a finite set of elementary models; means for measuring the value of at least one featuyre of an utterance of the training script during each of a series of time intervals spanned by the utterance of the training script to produce a series of feature vector signals, each feature vector signal having a feature value representing the value of the at least one feature of the utterance during a corresponding time interval; means for estimating at least one path through the training script model which would produce the entire series of measured feature vector signals so as to estimate, for each feature vector signal, the corresponding elementary model in the training script model which would produce that feature vector signal; means for clustering the feature vector signals into a plurality of clusters to form a plurality of cluster signals, each feature vector signal in a cluster corresponding to a single elementary model in a single location in a single word-segment model, each cluster signal having a cluster value equal to an average of the feature values of all of the feature vector signals in the cluster; means for storing a plurality of prototype vector signals, each prototype vector signal corresponding to an elementary model, each prototype vector signal having an identifier and comprising at least two partition values, at least one partition value being equal to a combination of the cluster values of one or more cluster signals corresponding to the elementary model, at least one other partition value being equal to a combination of the cluster values of one or more other cluster signals corresponding to the elementary model. - View Dependent Claims (12, 13, 14)
-
Specification