System for processing a succession of utterances spoken in continuous or discrete form
First Claim
1. A system for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising:
- A. means for storing a plurality of word models;
B. means for identifying a succession of temporal segments of said utterances spoken in any of continuous and discrete form;
C. means selectively operable on ones of said segments for identifying a subset of said plurality of word models meeting predetermined criteria, said subset defining a list of candidate words;
D. control means for determining arbitrarily selected frame start times t during said utterances spoken in any of continuous and discrete form, said frame start times being independent of identification of an initial anchor; and
E. means for generating a signal representative of said list of candidate words for selected ones of said frame start times t determined in step D.
7 Assignments
0 Petitions
Accused Products
Abstract
The system of the invention relates to continuous speech pre-filtering systems for use in discrete and continuous speech recognition computer systems. The speech to be recognized is converted from utterances to frame data sets, which frame data sets are smoothed to generate a smooth frame model over a predetermined number of frames. A resident vocabulary is stored within the computer as clusters of word models which are acoustically similar over a succession of frame periods. A cluster score is generated by the system, which score includes the likelihood of the smooth frames evaluated using a probability model for the cluster against which the smooth frame model is being compared. Cluster sets having cluster scores below a predetermined acoustic threshold are removed from further consideration. The remaining cluster sets are unpacked for determination of a word score for each unpacked word. These word scores are used to identify those words which are above a second predetermined threshold to define a word list which is sent to a recognizer for a more lengthy word match. Control means enable the system to initialize times corresponding to the frame start time for each frame data set, defining a sliding window.
69 Citations
27 Claims
-
1. A system for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising:
-
A. means for storing a plurality of word models; B. means for identifying a succession of temporal segments of said utterances spoken in any of continuous and discrete form; C. means selectively operable on ones of said segments for identifying a subset of said plurality of word models meeting predetermined criteria, said subset defining a list of candidate words; D. control means for determining arbitrarily selected frame start times t during said utterances spoken in any of continuous and discrete form, said frame start times being independent of identification of an initial anchor; and E. means for generating a signal representative of said list of candidate words for selected ones of said frame start times t determined in step D.
-
-
2. A prefiltering system for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising:
-
A. cluster data storage means for storing a plurality of M cluster data sets, C1, . . . , CM, where M is an integer greater than 1, each of said cluster data sets including data representative of a plurality of word models; B. frame data means for generating a succession of w frame data sets vt, vt+1, . . . vt+w-1, beginning at a frame start time t during said succession of utterances spoken in any of continuous and discrete form, where w is an integer greater than 1, said succession of frame data sets being representative of a corresponding succession of temporal segments of said utterances spoken in any of continuous and discrete form, each of said frame data sets including k values representative of different frame parameters, where k≧
1;C. data reduction means selectively operable on said w frame data sets for generating s reduced frame data sets Y1, Y2, . . . , Y3, where s<
w, each of said reduced frame data sets being related to an associated plurality of said frame data sets and including j values representative of different reduced frame data set parameters;D. scoring means for evaluating each of said reduced frame data sets against succession of said cluster data sets to generate a cluster score SY for each of said cluster data sets; E. selectively operable identifying means for identifying each of said word models of said cluster data sets having a cluster score bearing a predetermined relation to at least one threshold score T, said identified word models defining a candidate word list; F. control means for determining said frame start times t, where successive start times t are spaced apart arbitrarily, said frame start times being independent of identification of an initial anchor; and G. means for generating a signal representative of said candidate word list for preselected ones of said frame start times t determined by said control means. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A speech processing method for processing speech including a succession of utterances spoken in any of continuous and discrete form comprising the steps of:
-
A. storing a plurality of M cluster data sets, C1, . . . , CM, where M is an integer greater than 1, each of said cluster data sets including data representative of a plurality of word models; B. generating a succession of w frame data sets vt, vt+1, . . . vt+w-1, beginning at a frame start time t during said succession of utterances spoken in any of continuous and discrete form, where w is an integer greater than 1, each of said frame data sets being representative of successive acoustic segments of utterances spoken in any of continuous and discrete form for a frame period, each of said frame data sets including k values representative of different frame parameters where k≧
1;C. reducing w of said frame data sets to generate s reduced frame data sets, Y1, Y2, . . . , Y3 where s<
w, each of said reduced frame data sets being related to an associated plurality of said frame data sets and including j values related to the k values of said associated frame data sets, where j≦
k;D. evaluating said reduced frame data sets with a succession of said cluster data sets to generate a cluster score SY for each of said cluster data sets; E. identifying each of said word models having a cluster score bearing a predetermined relation to at least one threshold score T, said identified word models defining a word list; F. determining said frame start times t, where successive start times t are identified at arbitrarily selected intervals;
said frame start times being independent of identification of an initial anchor; andG. generating a signal representative of said candidate word list for selected ones of said frame start times determined in step F. - View Dependent Claims (23, 24, 25, 26)
-
-
27. A prefiltering method for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising the steps of:
-
A. storing a plurality of word models; B. identifying a succession of temporal segments of said utterances spoken in any of continuous and discrete form; C. operating on ones of said segments and selectively identifying a subset of said plurality of word models meeting predetermined criteria, said subset defining a list of candidate words; D. determining arbitrarily selected frame start times t during said utterances spoken in any of continuous and discrete form, said frame start times being independent of identification of an initial anchor; and E. generating a signal representative of said list of candidate words for selected ones of said frame start times t determined in step D.
-
Specification