Method and apparatus for a time-synchronous tree-based search strategy
First Claim
1. A speech recognition method for recognizing an entire utterance, for a system including an asynchronous detailed match procedure, said method comprising the step of performing a synchronous fast match process for said entire utterance prior to executing said detailed match procedure.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for using a tree structure to constrain a time-synchronous, fast search for candidate words in an acoustic stream is described. A minimum stay of three frames in each graph node visited is imposed by allowing transitions only every third frame. This constraint enables the simplest possible Markov model for each phoneme while enforcing the desired minimum duration. The fast, time-synchronous search for likely words is done for an entire sentence/utterance. The list of hypotheses beginning at each time frame is stored for providing, on-demand, lists of contender/candidate words to the asynchronous, detailed match phase of decoding.
79 Citations
27 Claims
- 1. A speech recognition method for recognizing an entire utterance, for a system including an asynchronous detailed match procedure, said method comprising the step of performing a synchronous fast match process for said entire utterance prior to executing said detailed match procedure.
-
6. A speech recognition system for recognizing an entire utterance and having means for receiving and executing a detailed match procedure, said system comprising:
means for performing a synchronous fast match on said entire utterance prior to asynchronously executing said detailed match procedure. - View Dependent Claims (7, 8, 9)
-
10. A speech recognition method for recognizing an entire utterance segmented into a plurality of frames and based upon a speech language vocabulary, said method comprising:
-
receiving an utterance; forming an acoustic signal of a plurality of phoneme constituents making up said utterance; combining three of said frames to form a frame triplet; initiating a fast match for said utterance by forming a phoneme probability matrix table giving probabilities of each phoneme versus an acoustic observation time, wherein said phoneme matrix table has each column corresponding to a single frame; multiplying together a group of three individual probabilities of the three frames that make up each said triplet to produce a joint probability of the triplet for each particular said phoneme and triplet; forming a triplet probability matrix representing a complete observation time of said utterance and having a row for each phoneme of said utterance and a column for each said triplet; and invoking a synchronous iterative process to perform the fast match for the entire utterance in steps of frame triplets.
-
-
11. A speech recognition method for recognizing an entire utterance segmented into a plurality of frames and based upon a speech language vocabulary, said method comprising:
-
receiving an utterance; forming an acoustic signal of a plurality of phoneme constituents making up said utterance; combining three of said frames to form a frame triplet; initiating a fast match for said utterance by forming a phoneme probability matrix table giving probabilities of each phoneme versus an acoustic observation time, wherein said phoneme matrix table has each column corresponding to a single frame; multiplying together a group of three individual probabilities of the three frames that make up each said triplet to produce a joint probability of the triplet for each particular said phoneme and triplet; forming a triplet probability matrix representing a complete observation time of said utterance and having a row for each phoneme of said utterance and a column for each said triplet; invoking a synchronous iterative process to perform the fast match for the entire utterance in steps of frame triplets; initializing to the root node and to the end of the utterance; determining for each potentially active node `n` at a next time τ
, a maximum of a node at time τ
+3 which maximizes the product of a score of said node with the transition probability from said node into a potentially active node;computing the score s(τ
,n) of the potentially active node given by a product of said maximum and an observation probability at a current time of the phoneme identified with state `n`;determining a maximum score of the node scores at the current time; comparing the score for each potentially active node to said maximum score; including in a next active list, only active nodes for which the difference between the log of said active node score and the log of the maximum score is less than a user-specified range constant; and adding to a matrix of contender words at an appropriate time, a new node placed in said next active list which corresponds to a beginning of a whole word, and a new node score of said new node. - View Dependent Claims (12, 13, 14)
-
-
15. A speech recognition method for recognizing an entire utterance segmented into a plurality of frames and based upon a speech language vocabulary, said method comprising:
-
receiving an utterance; forming an acoustic signal of a plurality of phoneme constituents making up said utterance; combining three of said frames to form a frame triplet; initiating a fast match for said utterance by forming a phoneme Probability matrix table giving probabilities of each phoneme versus an acoustic observation time, wherein said phoneme matrix table has each column corresponding to a single frame; multiplying together a group of three individual probabilities of the three frames that make up each said triplet to produce a joint probability of the triplet for each particular said phoneme and triplet; forming a triplet probability matrix representing a complete observation time of said utterance and having a row for each phoneme of said utterance and a column for each said triplet; invoking a synchronous iterative process to Perform the fast match for the entire utterance in steps of frame triplets; forming a `next potentials list` from the `current active list` if an utterance beginning has not been reached; computing and storing a score for each node in the `potentials list`; finding and storing a current highest node score; choosing and using an inclusion range parameter to form the `next active list`; entering and storing active list entries for each triplet in a `matrix of contender words`; decrementing to a next backward frame triplet; modifying the `current active list` to correspond with the next active list; and stopping the fast match process if the utterance beginning has been reached.
-
-
16. A speech recognition method for recognizing an entire utterance, for a system including a fast match process and a detailed match procedure, wherein said fast match process proceeds backward from an end of said entire utterance towards a beginning of said entire utterance.
-
17. A speech recognition method comprising:
- recognizing an utterance by performing an asynchronous detailed match and a synchronous fast match, wherein said fast match is performed in an iterative manner with an iteration performed for each of a plurality of frames.
- View Dependent Claims (18, 19)
- 20. A speech recognition system for recognizing an utterance, said system comprising a fast match process which proceeds backward from an end of said utterance towards a beginning of said utterance.
-
24. A speech recognition apparatus comprising:
-
means for synchronously performing a fast match on an entire utterance; and means for executing a detailed match procedure asynchronously on said entire utterance so as to recognize said entire utterance. - View Dependent Claims (25)
-
-
26. A speech recognition method comprising:
-
multiplying phoneme probabilities together in groups of three frames, each group forming a triplet, and employing each triplet in a fast match process using a non-replicated one state model. - View Dependent Claims (27)
-
Specification