Word model candidate preselection for speech recognition using precomputed matrix of thresholded distance values
First Claim
1. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
- providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states;
thresholding said matrix by assigning a default value to metrics which do not meet a preselected criterion for being meaningful;
for each prototype frame, forming a list of prototype states for which the distance metric is meaningful;
for each input utterance, generating a fine sequence of prototype frames and a coarse set of input representative frames selected from said fine sequence, the number of representatives being a minor fraction of the number of frames in the corresponding fine sequence of frames and being distributed in position along said fine sequence;
for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input representatives to said states by performing the following steps;
(a) setting all entries in said temporary matrix to the default value;
(b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states;
(c) adjusting those entries in said temporary matrix which are included in said corresponding lists; and
subsampling at least a selected portion of said vocabulary models and scoring the subsampled prototype states from said selected models using distance metrics obtained from said temporary matrix, the scoring providing a basis for preselection of candidate models for further processing.
11 Assignments
0 Petitions
Accused Products
Abstract
In the large vocabulary speech recognition system disclosed herein, a preliminary screening of vocabulary models is provided by applying high speed distance measuring functions. The distance measuring functions utilize subsampled or otherwise reduced representations of the unknown speech segment and the vocabulary models. The initial screening functions achieve very high speed by precalculating, for each utterance, a comparison table of distance values which can be used for all vocabulary models. The building of each comparison table is facilitated by a method which utilizes default values as initial entries and only adjusts entries which are meaningfully different from the default value.
36 Citations
10 Claims
-
1. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
-
providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states; thresholding said matrix by assigning a default value to metrics which do not meet a preselected criterion for being meaningful; for each prototype frame, forming a list of prototype states for which the distance metric is meaningful; for each input utterance, generating a fine sequence of prototype frames and a coarse set of input representative frames selected from said fine sequence, the number of representatives being a minor fraction of the number of frames in the corresponding fine sequence of frames and being distributed in position along said fine sequence; for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input representatives to said states by performing the following steps; (a) setting all entries in said temporary matrix to the default value; (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states; (c) adjusting those entries in said temporary matrix which are included in said corresponding lists; and subsampling at least a selected portion of said vocabulary models and scoring the subsampled prototype states from said selected models using distance metrics obtained from said temporary matrix, the scoring providing a basis for preselection of candidate models for further processing. - View Dependent Claims (2, 3, 4, 5)
-
-
6. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
-
providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states; thresholding said matrix by assigning a default value to metrics which do not meet a preselected criteria for being meaningful; for each prototype frame, forming a list of prototype states for which the distance metric is meaningful; for each input utterance, generating a fine sequence of prototype frames; dividing said fine sequence into a series of equal segments thereby to obtain a coarse set of input sample positions along said fine sequence, the number of sample positions being a minor fraction of the number of frames in the corresponding fine sequence of frames; for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps; (a) setting all entries in said temporary matrix to the default value; (b) sequentially scanning a predetermined number of input frames adjacent to and including each input sample position to locate the corresponding lists for included prototype states; (c) determining the one of said predetermined number of frames which best matches each included prototype state; and (d) adjusting those entries in said temporary matrix which correspond to said best matches; and subsampling at least a selected portion of said vocabulary models and scoring the subsampled prototype states from said selected models using distance metrics obtained from said temporary matrix, the scoring providing a basis for preselection of candidate models for further processing.
-
-
7. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
-
precalculating a matrix of distance metrics relating said prototype frames with said prototype states; thresholding said matrix by assigning a default value to metrics which do not meet a preselected criterion for being meaningful; for each prototype frame, forming a list of prototype states for which the distance metric is meaningful; for each input utterance, generating a fine sequence of prototype frames and a coarse set of a predetermined number of input representative frames selected from said fine sequence, the predetermined number of representatives being a minor fraction of the number of frames in the corresponding fine sequence of frames; for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input representatives to said states by performing the following steps; (a) setting all entries in said temporary matrix to the default value; (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states; (c) adjusting those entries in said temporary matrix which are included in said corresponding lists; and for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states; said predetermined numbers together defining a comparison matrix, there being a preselected region within said matrix which is examined by said method; for each state in said limited collection, determining for each state position in said comparison matrix the input representative which provides the best match with that state, considering and examining only frames which lie within said preselected region, a measure of the match being stored in a table; calculating, using said table, for each model to be considered a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states; preselecting for accurate comparison those models with the better overall match values as so calculated. - View Dependent Claims (8)
-
-
9. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
-
providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states; thresholding said matrix by assigning a default value to metrics which do not meet a preselected criteria for being meaningful; for each prototype frame, forming a list of prototype states for which the distance metric is meaningful; for each input utterance, generating a fine sequence of prototype frames; dividing said fine sequence into a series of equal segments thereby to obtain a coarse set of input sample positions along said fine sequence, the number of sample positions being a minor fraction of the number of frames in the corresponding fine sequence of frames; for each input utterance, generating a first temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps; (a) setting all entries in said first temporary matrix to a default value; (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states; (c) adjusting those entries in said temporary matrix which are included in said corresponding lists; for each input utterance, also generating a second temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps; (d) setting all entries in said second temporary matrix to a default value; (e) sequentially scanning a predetermined number of input frames adjacent to and including each input sample position to locate the corresponding lists for included prototype states; (f) determining the one of said predetermined number of frames which best matches each included prototype state; and (g) adjusting those entries in said temporary matrix which correspond to said best matches; and subsampling at least a selected portion of said vocabulary models; scoring the subsampled prototype states from said selected models first using distance metrics obtained from said second temporary matrix; and selecting a group of the models scoring higher using said second matrix for scoring using distance metrics obtained from said first matrix. - View Dependent Claims (10)
-
Specification