Multiple template speech recognition system
First Claim
1. A circuit for recognizing an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a set of signals representative of the features of said utterance;
- means responsive to the feature signal sets of each reference word for generating at least one temple signal, each template signal being representative of a group of said reference word feature signal sets;
means responsive to the unknown utterance for generating a set of signals representative of the features of said unknown utterance;
means jointly responsive to said unknown utterance feature signal set and each reference word template signal for forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said reference word template signal;
characterized in that selection means (130) are responsive to the similarlity signals for each reference word to select a plurality of said reference word similarity signals;
averaging means (135) are adapted to form a signal corresponding to the average of said selected similarlity signals for each reference word; and
identifying apparatus (140,
145) is responsive to the average similarity signals for said reference words to identify said unknown utterance as the most similar reference word.
0 Assignments
0 Petitions
Accused Products
Abstract
A speech analyzer for recognizing an unknown utterance as one of a set of reference words is adapted to generate a feature signal set for each utterance of every reference word. At least one template signal is produced for each reference word which template signal is representative of a group of feature signal sets. Responsive to a feature signal set formed from the unknown utterance and each reference word template signal, a signal representative of the similarity between the unknown utterance and the template signal is generated. A plurality of similarity signals for each reference word is selected and a signal corresponding to the average of said selected similarity signals is formed. The average similarity signals are compared to identify the unknown utterance as the most similar reference word. Features of the invention include: template formation by successive clustering involving partitioning feature signal sets into groups of predetermined similarity by centerpoint clustering, and recognition by comparing the average of selected similarity measures of a time-warped unknown feature signal set with the cluster-derived reference templates for each vocabulary word.
215 Citations
26 Claims
-
1. A circuit for recognizing an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a set of signals representative of the features of said utterance;
- means responsive to the feature signal sets of each reference word for generating at least one temple signal, each template signal being representative of a group of said reference word feature signal sets;
means responsive to the unknown utterance for generating a set of signals representative of the features of said unknown utterance;
means jointly responsive to said unknown utterance feature signal set and each reference word template signal for forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said reference word template signal;
characterized in that selection means (130) are responsive to the similarlity signals for each reference word to select a plurality of said reference word similarity signals;
averaging means (135) are adapted to form a signal corresponding to the average of said selected similarlity signals for each reference word; and
identifying apparatus (140,
145) is responsive to the average similarity signals for said reference words to identify said unknown utterance as the most similar reference word. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- means responsive to the feature signal sets of each reference word for generating at least one temple signal, each template signal being representative of a group of said reference word feature signal sets;
-
10. A method of recognizing an unknown utterance as one of a set of reference words comprising the steps of generating a set of feature signals representative of each of a plurality of utterances of each reference word;
- generating at least one template signal for each reference word responsive to the feature signal sets of said reference word, each template signal being representative of a group of said reference word feature signal sets;
responsive to the unknown utterance, generating a set of signals representative of the features of said unknown utterance;
jointly responsive to said unknown utterance feature signal set and each reference word template signal, forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said template signal;
characterized in that a plurality of similarity signals for each reference word is selected;
a signal corresponding to the average of each reference word selected similarity signals is formed; and
responsive to the average signals for all reference words, the unknown utterance is identified as the most similar reference word. - View Dependent Claims (11, 12, 13, 14, 15, 16)
- generating at least one template signal for each reference word responsive to the feature signal sets of said reference word, each template signal being representative of a group of said reference word feature signal sets;
-
17. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a first signal representative of the prediction parameters of said utterance;
- means responsive to the first signals of each reference word for generating at least one template signal for said reference word, each template signal being representative of a group of the reference word first signals;
means responsive to the unknown utterance for generating a second signal representative of the prediction parameters of said unknown utterance;
means jointly responsive to the template signals of each reference word and the second signal for forming a set of signals each representative of the distance between said second signal and said reference word template signal; and
means responsive to said reference word distance signals for identifying said unknown utterance as the reference word having the minimum distance signals characterized in that said template signal generating means (112) further comprises means (222,
224) responsive to the first signals of each reference word for generating and storing a set of signals each representative of the distance between a pair of said reference word first signals;
means (222, 225, 226, 228,
230) responsive to said stored distance signals for successively partitioning the first signals of each reference word into clusters, the first signals of each cluster having a predetermined degree of similarity; and
means (216, 230,
600) responsive to said distance signals for determining the centermost first signal of each cluster and for identifying said centermost first signal as the cluster template signal. - View Dependent Claims (18, 19)
- means responsive to the first signals of each reference word for generating at least one template signal for said reference word, each template signal being representative of a group of the reference word first signals;
-
20. A method for identifying an unknown utterance as one of a set of reference words comprising the steps of generating a first signal representative of the prediction parameters of each of a plurality of utterances of a reference word;
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
generating a second signal representative of the prediction parameters of said unknown utterance;
jointly responsive to the template signals of each reference word and the second signal, forming a set of signals each representative of the distance between said second signal and said reference word template signal;
responsive to the distance signals of all reference words, identifying the unknown utterance as the reference word having the minimum distance signals characterized in that said template signal generation for each reference word includes generating and storing a set of signals each representative of the distance between a pair of reference word first signals responsive to the first signals of said reference word; and
successively partitioning the first signals of said reference word into clusters responsive to the stored reference word distance signals, the first signals of each cluster having a predetermined degree of similarity;
determining the centermost first signal of each cluster responsive to said stored reference word distance signals; and
identifying said centermost first signal as the cluster template signal. - View Dependent Claims (21, 22)
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
-
23. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words comprising:
- means responsive to each of a plurality of utterances of a reference word for generating a first signal representative of the prediction parameters of said utterance;
means responsive to the first signals of each reference word for generating at least one template signal for each reference word, each template signal being representative of a group of reference word first signals;
means responsive to the unknown utterance for generating a second signal representative of the prediction parameters of said unknown utterance;
means jointly responsive to the template signals of each reference word and the second signal for forming a set of a signals each representative of the distance between the second signal and said reference word template signal; and
means responsive to said reference word distance signals for identifying said unknown utterance as the reference word having the minimum distance signals;
characterized in that said distance representative signal forming means (1803, 1806, 1810, 1815, 1817,
1820) further comprises means (205) responsive to said unknown utterance for determining the number of frames to the endpoint frame of the unknown utterance;
means (880,
1806) for generating a third signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said endpoint frame of said unknown utterance, for determining the intermediate frame of the unknown utterance at which the speech signal energy of the unknown utterance from said intermediate frame to said endpoint frame is a predetermined portion of the total speech signal energy of the unknown utterance, and for generating a fourth signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said intermediate frame; and
means (1817,
1820) for selecting the minimum of said third and fourth signals as said distance representative signal. - View Dependent Claims (24)
- means responsive to each of a plurality of utterances of a reference word for generating a first signal representative of the prediction parameters of said utterance;
-
25. A method for identifying an unknown utterance as one of a set of reference words comprising the steps of generating a first signal representative of the prediction parameters of each of a plurality of utterances of each reference word;
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
generating a second signal representative of the prediction parameters of said unknown utterance;
jointly responsive to the template signals of each reference word and the second signal, forming a set of signals each representative of the distance between said second signal and said reference word template signal;
responsive to the distance signals of all reference words, identifying the unknown utterance as the reference word having the minimum distance signals characterized in that the step of forming a set of signals each representative of the distance between said second signal and said reference template signal comprises the steps of determining the endpoint frame of the unknown utterance;
generating a third signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said endpoint frame;
determining the intermediate frame of said unknown utterance at which the unknown utterance speech signal energy from said intermediate frame to said endpoint frame is a predetermined portion of the total speech signal energy of said unknown utterance;
generating a fourth signal corresponding to the average frame distance between the second signal and the template prediction parameter signals until said intermediate frame; and
selecting the minimum of said third and fourth signals as said distance representative signal. - View Dependent Claims (26)
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
Specification