Speech recognition by acoustic/phonetic system and technique

US 4,852,180 A
Filed: 04/03/1987
Issued: 07/25/1989
Est. Priority Date: 04/03/1987
Status: Expired due to Term

First Claim

Patent Images

1. A method for the recognition of speech, of the type including the steps ofstoring signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;

storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;

sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and

accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said method being particularly characterized in thatthe accessing step includesassigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, andmaximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals,said assigning and maximizing being performed recursively for all assignments and transitions over all time frames up to and including the present time frame; and

then retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, andsubsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system and technique of the acoustic/phonetic type is made speaker-independent and capable of continuous speech recognition during fluent discourse by a combination of techniques which include, inter alia, using a so-called continuously-variable-duration hidden Markov vodel in identifying word segments, i.e., phonetic units, and developing proposed phonetic sequences by a durationally-responsive recursion before any lexical access is attempted. Lexical access is facilitated by the phonetic transcriptions provided by the durationally-responsive recursion; and the resulting array of word candidates facilitates the subsequent alignment of the word candidates with the acoustic feature signals. A separate step is used for aligning the members of the candidate word arrays with the acoustic feature signals representative of the corresponding portion of the utterance. Any residual work selection ambiguities are then more readily resolved, regardless of the ultimate sentence selection technique employed.

Citations

14 Claims

1. A method for the recognition of speech, of the type including the steps ofstoring signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;
- storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;
  
  sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and
  
  accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said method being particularly characterized in thatthe accessing step includesassigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, andmaximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals,said assigning and maximizing being performed recursively for all assignments and transitions over all time frames up to and including the present time frame; and
  
  then retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, andsubsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained.
- View Dependent Claims (2, 3)
- - 2. A method for the recognition of speech, of the type claimed in claim 1,said method being further characterized in thatthe model storing step includes storing an ergodic model in which any state can occur after any other state, the model including examples of all such sequences and the corresponding transition probability signals.
  - 3. A method for the recognition of speech, of the type claimed in claim 2;
    - said method being further characterized in thatthe lexical candidate storing step comprisesstoring words represented by a phonetic orthography which is characterized by partial phonetic information, such that words may be retrieved on the basis of the phonetic units which they contain, andstoring information linking a plurality of words containing like sequences of phonetic units, whereby for each sequence of phonetic units as many words as contain them are directly accessible.

4. A method for the recognition of speech, of the type including the steps ofstoring signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that my be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;
- storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;
  
  sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and
  
  accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said method being particularly characterized in thatthe accessing step includesfirst accessing the stored model to obtain signals which represent proposed sequences of phonetic units independently of the stored lexical candidates, andsecond accessing the stored lexical candidates in response to portions of the proposed sequences of phonetic units to obtain proposed lexical candidates each containing each said portion, includingwhenever multiple proposed lexical candidates contain the same one said portion, aligning said multiple proposed lexical candidates with the one or more respective sets of acoustic feature signals from which said proposed sequences of phonetic units were obtained to evaluate said multiple proposed lexical candidates.
- View Dependent Claims (5, 6, 7)
- - 5. A method for the recognition of speech, of the type claimed in claim 4,said method being further characterized in thatthe model storing step includes storing an ergodic model in which any state can occur after any other state.
  - 6. A method for the recognition of speech, of the type claimed in claim 5,said method being further characterized in thatthe aligning step evaluates said multiple proposed lexical candidates to include only those suitable for subsequent unambiguous ranking by processing by techniques that relate to sentence structure and meaning.
  - 7. A method for the recognition of speech, of the type claimed in claim 5,said method being further characterized in thatthe aligning step evaluates said multiple proposed lexical candidates to select only the best one, whereby a selection signal representing the utterance as a word is produced.

8. Apparatus for the recognition of speech, of the type comprisingmeans for storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that my be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;
- means for storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;
  
  means for sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and
  
  means for accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including means for selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said apparatus being particularly characterized in thatthe accessing means includesmeans for assigning a phonetic unit signal and a phonetic duration signal from the stored model to one or more of said time frame portions of speech in response to one or more of said respective sets of acoustic feature signals, andmeans for maximizing independently of the stored lexical candidates the likelihoods of each phonetic unit and each phonetic duration jointly with the likelihood of observing said one or more of said respective sets of acoustic feature signals,said assigning means and maximizing means being adapted to operate recursively for all assignments and transitions over all time frames up to and including the present time frame; and
  
  said accessing means further includesmeans for retracing the actual maximization results by stepping through the phonetic determinations in a strict order to produce a proposed phonetic sequence for accessing the lexical candidates, andmeans for subsequently accessing the stored lexical candidates with the proposed phonetic sequence to obtain signals representing a set of proposed lexical candidates, from which signals a final selection signal can be obtained.
- View Dependent Claims (9, 10)
- - 9. Apparatus for the recognition of speech, of the type claimed in claim 8,said apparatus being further characterized in thatthe means for storing a model includes means for storing an ergodic model in which any state can occur after any other state, the model including examples of all such sequences and the corresponding transition probability signals.
  - 10. Apparatus for the recognition of speech, of the type claimed in claim 9;
    - said apparatus being further characterized in thatthe lexical candidate storing means comprisesmeans for storing words represented by a phonetic orthography which is characterized by partial phonetic information, such that words may be retrieved on the basis of the phonetic units which they contain, andmeans for storing information linking a plurality of words containing like sequences of phonetic units, whereby for each sequence of phonetic units as many words as contain them are directly accessible.

11. Apparatus for the recognition of speech, of the type comprisingmeans for storing signals representing a model of the language to be recognized, said model being of the state-transitional type, each state being uniquely identified with a phonetic unit,each state having associated with it a portion of a transition matrix which describes which states can follow it and with what probability,each state having associated with it an observational density function assigning to each set of speech feature signals that may be observed in fluent speech a likelihood of being observed in association with that state,each state having associated with it a durational density function assigning to each duration it may have a likelihood of occurrence in fluent speech;
- means for storing signals representing lexical candidates, said lexical candidates being assemblages of phonetic units of the language in association with partial phonetic information of the type found in dictionaries;
  
  means for sequentially converting successive time frame portions of an utterance into signals representing respective sets of acoustic feature signals representative of the portions; and
  
  means for accessing the stored model and stored lexical candidates to obtain signals which represent sequences of the phonetic units, including means for selecting the optimum ones of such sequences to produce a selection signal representing recognition of the utterance,said apparatus being particularly characterized in thatthe accessing means includesmeans for first accessing the stored model to obtain signals which represent proposed sequences of phonetic units independently of the stored lexical candidates, andmeans for next accessing the stored lexical candidates in response to portions of the proposed sequences of phonetic units to obtain proposed lexical candidates each containing each said portion, andthe accessing means further includesmeans for aligning said proposed lexical candidates, whenever more than one exists, with the one or more respective sets of acoustic feature signals from which said proposed sequences of phonetic units were obtained to reduce the number of said proposed lexical candidates.
- View Dependent Claims (12, 13, 14)
- - 12. Apparatus for the recognition of speech, of the type claimed in claim 11,said apparatus being further characterized in thatthe model storing means includes means for storing an ergodic model in which any state can occur after any other state.
  - 13. Apparatus for the recognition of speech, of the type claimed in claim 12,said apparatus being further characterized in thatthe aligning means evaluates said multiple proposed lexical candidates to include only those suitable for subsequent unambiguous ranking by techniques that relate to sentence structure and meaning.
  - 14. Apparatus for the recognition of speech, of the type claimed in claim 12,said apparatus being further characterized in thatthe aligning means evaluates said multiple proposed lexical candidates to select only the best one, whereby the aligning means is capable of producing a selection signal representing the utterance as a word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
American Telephone & Telegraph Company (AT&T, Inc.), Bell Telephone Laboratories, Inc. (Nokia Corporation)
Original Assignee
American Telephone & Telegraph Company (AT&T, Inc.)
Inventors
Levinson, Stephen E.
Primary Examiner(s)
NOT, DEFINED
Assistant Examiner(s)
NOT, DEFINED

Application Number

US07/034,467
Time in Patent Office

844 Days
Field of Search

381/41-43, 381/44-45, 364/513, 364/513.5
US Class Current

704/256.4
CPC Class Codes

G10L 15/14 using statistical models, e...

Speech recognition by acoustic/phonetic system and technique

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition by acoustic/phonetic system and technique

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links