Apparatus and method of grouping utterances of a phoneme into context-dependent categories based on sound-similarity for automatic speech recognition

US 5,195,167 A
Filed: 04/17/1992
Issued: 03/16/1993
Est. Priority Date: 01/23/1990
Status: Expired due to Fees

First Claim

Patent Images

1. A method of automatically grouping utterances of a phoneme into similar categories and correlating the groups of utterances with different contexts, said method comprising the steps of:

providing a training script comprising a series of phonemes, said training script comprising a plurality of occurrences of a selected phoneme, each occurrence of the selected phoneme having a context of one or more other phonemes preceding or following the selected phoneme in the training script;

measuring the value of an acoustic feature of an utterance of the phonemes in the training script during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance, each acoustic feature vector signal corresponding to an occurrence of a phoneme in the training script;

selecting a pair of first and second subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the first subset having a first context, each occurrence of the selected phoneme in the second subset having a second context different from the first context;

selecting a pair of third and fourth subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the third subset having a third context different from the first and second contexts, each occurrence of the selected phoneme in the fourth subset having a fourth context different from the first, second, and third contexts;

for each pair of subsets, determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in one subset of the pair, and determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in the other subset of the pair, the combined similarities for both subsets in the pair being a "goodness of fit" which estimates how well the contexts of the selected phoneme explain variations in the acoustic feature values of the utterances of the selected phoneme;

identifying first and second best contexts associated with the pair of subsets having the best "goodness of fit"; and

grouping the utterances of the selected phoneme into a first output set of utterances of the selected phoneme having the first best context, and grouping the utterances of the selected phoneme into a second output set of utterances of the selected phoneme having the second best context.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Symbol feature values and contextual feature values of each event in a training set of events are measured. At least two pairs of complementary subsets of observed events are selected. In each pair of complementary subsets of observed events, one subset has contextual features with values in a set C_n, and the other set has contextual features with values in a set C_n, were the sets in C_n and C_n are complementary sets of contextual feature values. For each subset of observed events, the similarity values of the symbol features of the observed events in the subsets are calculated. For each pair of complementary sets of observed events, a "goodness of fit" is the sum of the symbol feature value similarity of the subsets. The sets of contextual feature values associated with the subsets of observed events having the best "goodness of fit" are identified and form context-dependent bases for grouping the observed events into two output sets.

200 Citations

10 Claims

1. A method of automatically grouping utterances of a phoneme into similar categories and correlating the groups of utterances with different contexts, said method comprising the steps of:
- providing a training script comprising a series of phonemes, said training script comprising a plurality of occurrences of a selected phoneme, each occurrence of the selected phoneme having a context of one or more other phonemes preceding or following the selected phoneme in the training script;
  
  measuring the value of an acoustic feature of an utterance of the phonemes in the training script during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance, each acoustic feature vector signal corresponding to an occurrence of a phoneme in the training script;
  
  selecting a pair of first and second subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the first subset having a first context, each occurrence of the selected phoneme in the second subset having a second context different from the first context;
  
  selecting a pair of third and fourth subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the third subset having a third context different from the first and second contexts, each occurrence of the selected phoneme in the fourth subset having a fourth context different from the first, second, and third contexts;
  
  for each pair of subsets, determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in one subset of the pair, and determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in the other subset of the pair, the combined similarities for both subsets in the pair being a "goodness of fit" which estimates how well the contexts of the selected phoneme explain variations in the acoustic feature values of the utterances of the selected phoneme;
  
  identifying first and second best contexts associated with the pair of subsets having the best "goodness of fit"; and
  
  grouping the utterances of the selected phoneme into a first output set of utterances of the selected phoneme having the first best context, and grouping the utterances of the selected phoneme into a second output set of utterances of the selected phoneme having the second best context.
- View Dependent Claims (2, 3, 4)
- - 2. A method as claimed in claim 1, characterized in that the acoustic feature has at least first and second independent components, each component having a value.
  - 3. A method as claimed in claim 2, further comprising the steps of:
    - storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value and having a unique label identifier;
      
      comparing the acoustic feature value of each acoustic feature vector signal to the parameter values of the prototype vector signals to determine the associated prototype vector signal which is best matched to each acoustic feature vector signal; and
      
      outputting a series of label signals representing the label identifiers of the prototype vector signals associated with the acoustic feature vector signals as a coded representation of the acoustic feature values of the utterance of the training script.
  - 4. A method as claimed in claim 3, further comprising the steps of:
    - combining the acoustic feature values of all utterances in the first output set of utterances of the selected phoneme to produce a model of the phoneme when an utterance of the phoneme has the first best context, andcombining the acoustic feature values of all utterances in the second output set of utterances of the selected phoneme to produce a model of the phoneme when an utterance of the phoneme has the second best context.

5. An apparatus for automatically grouping utterances of a phoneme into similar categories and correlating the groups of utterances with different contexts, said apparatus comprising:
- means for storing a training script comprising a series of phonemes, said training script comprising a plurality of occurrences of a selected phoneme, each occurrence of the selected phoneme having a context of one or more other phonemes preceding or following the selected phoneme in the training script;
  
  an acoustic processor for measuring the value of an acoustic feature of an utterance of the phonemes in the training script during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance, each acoustic feature vector signal corresponding to an occurrence of a phoneme in the training script;
  
  means for selecting a pair of first and second subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the first subset having a first context, each occurrence of the selected phoneme in the second subset having a second context different from the first context;
  
  means for selecting a pair of third and fourth subsets of the set of occurrences of the selected phoneme in the training script, each occurrence of the selected phoneme in the third subset having a third context different from the first and second contexts, each occurrence of the selected phoneme in the fourth subset having a fourth context different from the first, second, and third contexts;
  
  means for determining, for each pair of subsets, the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in one subset of the pair, and determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the selected phoneme in the other subset of the pair, the combined similarities for both subsets in the pair being a "goodness of fit" which estimates how well the contexts of the selected phoneme explain variations in the acoustic feature values of the utterances of the selected phoneme;
  
  means for identifying first and second best contexts associated with the pair of subsets having the best "goodness of fit"; and
  
  means for grouping the utterances of the selected phoneme into a first output set of utterances of the selected phoneme having the first best context, and grouping the utterances of the selected phoneme into a second output set of utterances of the selected phoneme having the second best context.
- View Dependent Claims (6, 7, 8)
- - 6. An apparatus as claimed in claim 5, characterized in that the acoustic feature has at least first and second independent components, each component having a value.
  - 7. An apparatus as claimed in claim 6, further comprising:
    - means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value and having a unique label identifier;
      
      means for comparing the acoustic feature value of each acoustic feature vector signal to the parameter values of the prototype vector signals to determine the associated prototype vector signal which is best matched to each acoustic feature vector signal; and
      
      means for outputting a series of label signals representing the label identifiers of the prototype vector signals associated with the acoustic feature vector signals as a coded representation of the acoustic feature values of the utterance of the training script.
  - 8. An apparatus as claimed in claim 7, further comprising:
    - means for combining the acoustic feature values of all utterances in the first output set of utterances of the selected phoneme to produce a model of the phoneme when an utterance of the phoneme has the first best context, andmeans for combining the acoustic feature values of all utterances in the second output set of utterances of the selected phoneme to produce a model of the phoneme when an utterance of the phoneme has the second best context.

9. A method of automatic speech recognition, said method comprising the steps of:
- measuring the value of at least one acoustic feature of an utterance to be recognized during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance;
  
  selecting a hypothesis model of a hypothesis sequence of one or more phonemes, said hypothesis sequence of phonemes comprising a candidate phoneme, said candidate phoneme having a context of one or more other phonemes preceding or following the candidate phoneme in the hypothesis sequence of phonemes, said hypothesis model comprising a first candidate model of the candidate phoneme if the context of the candidate phoneme is a first best context, said hypothesis model comprising a second candidate model of the candidate phoneme if the context of the candidate phoneme is a second best context; and
  
  estimating, from the hypothesis model, the probability that an utterance of the hypothesis sequence of phonemes would have a series of acoustic feature values equal to the series of acoustic feature values of the utterance to be recognized;
  
  characterized in that the first and second best contexts are determined by the steps of;
  
  providing a training script comprising a series of phonemes, said training script comprising a plurality of occurrences of the candidate phoneme, each occurrence of the candidate phoneme having a context of one or more other phonemes preceding or following the candidate phoneme in the training script;
  
  measuring the value of an acoustic feature of an utterance of the phonemes in the training script during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance, each acoustic feature vector signal corresponding to an occurrence of a phoneme in the training script;
  
  selecting a pair of first and second subsets of the set of occurrences of the candidate phoneme in the training script, each occurrence of the candidate phoneme in the first subset having a first context, each occurrence of the candidate phoneme in the second subset having a second context different from the first context;
  
  selecting a pair of third and fourth subsets of the set of occurrences of the candidate phoneme in the training script, each occurrence of the candidate phoneme in the third subset having a third context different from the first and second contexts, each occurrence of the candidate phoneme in the fourth subset having a fourth context different from the first, second, and third contexts;
  
  for each pair of subsets, determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the candidate phoneme in one subset of the pair, and determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the candidate phoneme in the other subset of the pair, the combined similarities for both subsets in the pair being a "goodness of fit" which estimates how well the contexts of the candidate phoneme explain variations in the acoustic feature values of the utterances of the candidate phoneme;
  
  identifying first and second best contexts associated with the pair of subsets having the best "goodness of fit"; and
  
  grouping the utterances of the candidate phoneme into a first output set of utterances of the candidate phoneme having the first best context, and grouping the utterances of the candidate phoneme into a second output set of utterances of the candidate phoneme having the second best context.

10. An automatic speech recognition apparatus comprising:
- an acoustic processor for measuring the value of at least one acoustic feature of an utterance to be recognized during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance;
  
  means for selecting a hypothesis model of a hypothesis sequence of one or more phonemes, said hypothesis sequence of phonemes comprising a candidate phoneme, said candidate phoneme having a context of one or more other phonemes preceding or following the candidate phoneme in the hypothesis sequence of phonemes, said hypothesis model comprising a first candidate model of the candidate phoneme if the context of the candidate phoneme is a first best context, said hypothesis model comprising a second candidate model of the candidate phoneme if the context of the candidate phoneme is a second best context; and
  
  means for estimating, from the hypothesis model, the probability that an utterance of the hypothesis sequence of phonemes would have a series of acoustic feature values equal to the series of acoustic feature values of the utterance to be recognized;
  
  characterized in that the apparatus further comprises;
  
  means for storing a training script comprising a series of phonemes, said training script comprising a plurality of occurrences of the candidate phoneme, each occurrence of the candidate phoneme having a context of one or more other phonemes preceding or following the candidate phoneme in the training script;
  
  an acoustic processor for measuring the value of an acoustic feature of an utterance of the phonemes in the training script during each of a series of time intervals to produce a series of acoustic feature vector signals representing the acoustic feature values of the utterance, each acoustic feature vector signal corresponding to an occurrence of a phoneme in the training script;
  
  means for selecting a pair of first and second subsets of the set of occurrences of the candidate phoneme in the training script, each occurrence of the candidate phoneme in the first subset having a first context, each occurrence of the candidate phoneme in the second subset having a second context different from the first context;
  
  means for selecting a pair of third and fourth subsets of the set of occurrences of the candidate phoneme in the training script, each occurrence of the candidate phoneme in the third subset having a third context different from the first and second contexts, each occurrence of the candidate phoneme in the fourth subset having a fourth context different from the first, second, and third contexts;
  
  means for determining, for each pair of subsets, the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the candidate phoneme in one subset of the pair, and determining the similarity of the acoustic feature values of the acoustic feature vector signals corresponding to the occurrences of the candidate phoneme in the other subset of the pair, the combined similarities for both subsets in the pair being a "goodness of fit" which estimates how well the contexts of the candidate phoneme explain variations in the acoustic feature values of the utterances of the candidate phoneme;
  
  means for identifying first and second best contexts associated with the pair of subsets having the best "goodness of fit"; and
  
  means for grouping the utterances of the candidate phoneme into a first output set of utterances of the candidate phoneme having the first best context, and grouping the utterances of the candidate phoneme into a second output set of utterances of the candidate phoneme having the second best context.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
De Souza, Peter V., Nahamoo, David, Picheny, Michael A., Gopalakrishnan, Ponani S., Bahl, Lalit R.
Primary Examiner(s)
Knepper, David D.

Application Number

US07/871,600
Time in Patent Office

333 Days
Field of Search

395/2, 381/41-43
US Class Current

704/200
CPC Class Codes

G10L 15/063 Training

G10L 2015/025 Phonemes, fenemes or fenone...

Apparatus and method of grouping utterances of a phoneme into context-dependent categories based on sound-similarity for automatic speech recognition

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

200 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method of grouping utterances of a phoneme into context-dependent categories based on sound-similarity for automatic speech recognition

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

200 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links