Speech recognition by acoustic/phonetic system and technique
First Claim
1. A method for the recognition of speech, of the type including the steps ofstoring signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;
- storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;
sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and
accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said method being particularly characterized in thatthe accessing step includesassigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, andmaximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals,said assigning and maximizing being performed recursively for all assignments and transitions over all time frames up to and including the present time frame; and
then retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, andsubsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition system and technique of the acoustic/phonetic type is made speaker-independent and capable of continuous speech recognition during fluent discourse by a combination of techniques which include, inter alia, using a so-called continuously-variable-duration hidden Markov vodel in identifying word segments, i.e., phonetic units, and developing proposed phonetic sequences by a durationally-responsive recursion before any lexical access is attempted. Lexical access is facilitated by the phonetic transcriptions provided by the durationally-responsive recursion; and the resulting array of word candidates facilitates the subsequent alignment of the word candidates with the acoustic feature signals. A separate step is used for aligning the members of the candidate word arrays with the acoustic feature signals representative of the corresponding portion of the utterance. Any residual work selection ambiguities are then more readily resolved, regardless of the ultimate sentence selection technique employed.
-
Citations
14 Claims
-
1. A method for the recognition of speech, of the type including the steps of
storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit, each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability, each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state, each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech; -
storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries; sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance, said method being particularly characterized in that the accessing step includes assigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, and maximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals, said assigning and maximizing being performed recursively for all assignments and transitions over all time frames up to and including the present time frame; and then retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, and subsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained. - View Dependent Claims (2, 3)
-
-
4. A method for the recognition of speech, of the type including the steps of
storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit, each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability, each state having associated with it an observational density function assigning to each set of speech feature signals that my be observed in fluent speech a likelihood of being observed in association with that state, each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech; -
storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries; sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance, said method being particularly characterized in that the accessing step includes first accessing the stored model to obtain signals which represent proposed sequences of phonetic units independently of the stored lexical candidates, and second accessing the stored lexical candidates in response to portions of the proposed sequences of phonetic units to obtain proposed lexical candidates each containing each said portion, including whenever multiple proposed lexical candidates contain the same one said portion, aligning said multiple proposed lexical candidates with the one or more respective sets of acoustic feature signals from which said proposed sequences of phonetic units were obtained to evaluate said multiple proposed lexical candidates. - View Dependent Claims (5, 6, 7)
-
-
8. Apparatus for the recognition of speech, of the type comprising
means for storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit, each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability, each state having associated with it an observational density function assigning to each set of speech feature signals that my be observed in fluent speech a likelihood of being observed in association with that state, each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech; -
means for storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries; means for sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and means for accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including means for selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance, said apparatus being particularly characterized in that the accessing means includes means for assigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, and means for maximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals, said assigning means and maximizing means being adapted to operate recursively for all assignments and transitions over all time frames up to and including the present time frame; and said accessing means further includes means for retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, and means for subsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained. - View Dependent Claims (9, 10)
-
-
11. Apparatus for the recognition of speech, of the type comprising
means for storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit, each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability, each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state, each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech; -
means for storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries; means for sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and means for accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including means for selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance, said apparatus being particularly characterized in that the accessing means includes means for first accessing the stored model to obtain signals which represent proposed sequences of phonetic units independently of the stored lexical candidates, and means for next accessing the stored lexical candidates in response to portions of the proposed sequences of phonetic units to obtain proposed lexical candidates each containing each said portion, and the accessing means further includes means for aligning said proposed lexical candidates, whenever more than one exists, with the one or more respective sets of acoustic feature signals from which said proposed sequences of phonetic units were obtained to reduce the number of said proposed lexical candidates. - View Dependent Claims (12, 13, 14)
-
Specification