Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system

US 5,033,087 A
Filed: 03/14/1989
Issued: 07/16/1991
Est. Priority Date: 03/14/1989
Status: Expired due to Fees

First Claim

Patent Images

1. A method for automatically separating vocalizations of language components into a plurality of groups representing pronunciations of the language components in respectively different contexts, said method comprising the steps of:

A) processing a training text and vocalizations representing the training text to obtain a plurality of samples representing the language components of said vocalizations;

B) selecting, from among the plurality of samples, a set of samples representing respective instances of a selected language component in the vocalizations;

C) annotating each of said selected samples with a context indicator, representing at least one language component in a contextual relationship with the selected sample, to produce annotated samples;

D) separating the selected samples into respectively different leaf groups based on the respective context indicators of said annotated samples, each of said leaf groups representing a pronunciation of said selected language component in a respectively different context.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A continuous speech recognition system includes an automatic phonological rules generator which determines variations in the pronunciation of phonemes based on the context in which they occur. This phonological rules generator associates sequences of labels derived from vocalizations of a training text with respective phonemes inferred from the training text. These sequences are then annotated with their pheneme context from the training text and clustered into groups representing similar pronunciations of each phoneme. A decision tree is generated using the context information of the sequences to predict the clusters to which the sequences belong. The training data is processed by the decision tree to divide the sequences into leaf-groups representing similar pronunciations of each phoneme. The sequences in each leaf-group are clustered into sub-groups representing respectively different pronunciations of their corresponding phoneme in a give context. A Markov model is generated for each sub-group. The various Markov models of a leaf-group are combined into a single compound model by assigning common initial and final states to each model. The compound Markov models are used by a speech recognition system to analyze an unknown sequence of labels given its context.

298 Citations

24 Claims

1. A method for automatically separating vocalizations of language components into a plurality of groups representing pronunciations of the language components in respectively different contexts, said method comprising the steps of:
- A) processing a training text and vocalizations representing the training text to obtain a plurality of samples representing the language components of said vocalizations;
  
  B) selecting, from among the plurality of samples, a set of samples representing respective instances of a selected language component in the vocalizations;
  
  C) annotating each of said selected samples with a context indicator, representing at least one language component in a contextual relationship with the selected sample, to produce annotated samples;
  
  D) separating the selected samples into respectively different leaf groups based on the respective context indicators of said annotated samples, each of said leaf groups representing a pronunciation of said selected language component in a respectively different context.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of automatically separating vocalizations of language components set forth in claim 1 wherein:
    - Step C) further includes the steps of;
      
      C1) grouping said annotated selected samples into a plurality of clusters, each cluster representing a respectively different pronunciation of said selected language component; and
      
      C2) further annotating each selected sample with an indicator of the cluster to which it belongs.
  - 3. The method for automatically separating vocalizations of language components set forth in claim 2 wherein:
    - said language components are phonemes;
      
      said samples are sequences of fenemes having distinct types, each of said sequences of fenemes corresponding to a respective one of said phonemes; and
      
      step C1) includes the steps of;
      
      C1a) assigning each sequence of fenemes to a respectively different prototype cluster;
      
      C1b) calculating respective expected frequency values for each type of feneme in each of the prototype clusters;
      
      C1c) statistically comparing the expected frequency values for the respective types of fenemes in each prototype cluster to the expected frequency values for the respective types of fenemes in all other prototype clusters to generate a plurality of statistical difference values, one for each pair of prototype clusters;
      
      C1d) combining pairs of prototype clusters that exhibit a statistical difference value which is less than a threshold value to generate new prototype clusters;
      
      C1e) repeating steps C1b through C1d until no pair of prototype clusters exhibits a statistical difference value which is less than the threshold value;
      
      wherein each cluster corresponds to a respectively different one of the clusters of step C1.
  - 4. The method for automatically separating vocalizations of language components set forth in claim 3 wherein the step C1c includes the steps of:
    - generating a plurality of probabilistic models, each model representing a Markov model of the pronunciation of the phoneme represented by a respectively different one of said prototype clusters;
      
      generating a plurality of histograms, each histogram representing the relative frequency of occurrence of each feneme in a respectively different one of said prototype clusters;
      
      calculating, for each pair of prototype clusters, a log-likelihood ratio that the histograms for each prototype cluster in said pair match expected frequencies of the respective types of fenemes in a single probabilistic model representing a combination of the respective probabilistic models for said pair of prototype clusters, wherein a sign inverted version of said log-likelihood ratio is the statistical difference value for said pair of clusters.
  - 5. The method for automatically separating vocalizations of language components set forth in claim 1, further comprising the step of generating, for each leaf group of said decision tree, a probabilistic model representing the pronunciation of he language component represented by the samples of said leaf group.
  - 6. The method for automatically separating vocalizations of language components set forth in claim 5, wherein each of the samples is classified as to type and said probabilistic model has the form of a Markov model representing respective relative frequencies of occurrence of the samples of each type in the leaf group.
  - 7. The method for automatically separating vocalizations of language components set forth in claim 5 wherein the step of generating a probabilistic model for each leaf group of the decision tree includes the steps of:
    - grouping the samples of the leaf group into a plurality of clusters, each cluster representing a respectively different pronunciation of the language component represented by the samples of the leaf group;
      
      generating, from said plurality of clusters, a respective plurality of statistical models each of said statistical models having the form of a Markov model; and
      
      augmenting said plurality of statistical models by adding a common initial state and a common final state to each model to generate said probabilistic model.

8. In an automatic speech recognition system, a method for associating vocalizations of a continuously spoken sequence of words with respective language components, comprising the steps of:
- generating a sampled data signal representing the vocalizations of said continuously spoken sequence of words;
  
  A) associating a first language component with a first set of samples of said sampled data signal;
  
  B) associating a second set of samples of said sampled data signal with a second language component;
  
  C) accessing a decision means with the first language component as a context indicator to define a probabilistic model to be used to relate the second set of samples to the second language component, the probabilistic model defined by said decision means representing a distinct pronunciation of a selected language component in terms of a context provided for the selected language component;
  
  D) calculating, from said defined probabilistic model, a likelihood that the second language component corresponds to the second set of samples;
  
  E) repeating steps A) through D) for a plurality of second language components; and
  
  F) associating the one of the second language component and the plurality of second language components having the greatest likelihood with said second set of samples.
- View Dependent Claims (9, 10)
- - 9. The method set forth in claim 8 wherein:
    - step B) includes the step of calculating the likelihood that the second language component corresponds to the second set of samples, as defined by a context-independent probabilistic model representing the pronunciation of said second language component; and
      
      step C) includes the step of accessing a binary decision tree with said context indicator to define said probabilistic model.
  - 10. The method set forth in claim 8 wherein:
    - said probabilistic model has the form of a Markov model composed of a plurality of Markov sub-models having common initial and final states.

11. In a speech recognition system, a method for dividing a set of sample values representing respective vocalizations into first and second groups of sample values using an automatically generated test question, the method comprising the steps of:
- A) annotating each of the sample values in said set with an indicator of a first attribute of said sample values;
  
  B) further annotating each of the sample values in said set with an indicator of a second attribute of said sample values, said set of further annotated samples having a predetermined entropy value measured with respect to said second attribute indicator; and
  
  C) generating the test question, in terms of the first attribute indicator of said further annotated samples, wherein, the test question is applied to divide the further annotated sample values into said first and second groups, wherein said first and second groups of samples have a combined entropy value, measured with respect to said second attribute indicator, that is less than said predetermined entropy value.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method set forth in claim 11 wherein step C) includes the steps of:
    - grouping the annotated sample values in said set into a plurality of clusters according to the second attribute of said sample values; and
      
      further annotating each of said annotated samples with an indicator of the cluster to which it belongs as said second attribute indicator.
  - 13. The method set forth in claim 12, further including the steps of:
    - generating a first further test question in terms of the first attribute indicator of the sample values in said second group which transfers sample values from said second group to said first group to form revised first and second groups of sample values having a reduced combined entropy value measured with respect to said second attribute indicator; and
      
      generating a second further test question in terms of the first attribute indicator of said sample values in said first group which transfers sample values from said first group to said second group to form further revised first and second groups of sample values having a further reduced combined entropy value measured with respect to said second attribute indicator.
  - 14. The method set forth in claim 13, wherein:
    - said sample values are respective sequences of fenemes representing sequential vocalizations of phonemes;
      
      said first attribute of said sample values relates to the phonemes occurring proximate in time with the respective phonemes corresponding to said sample values; and
      
      said second attribute of said sample values relates to the vocalizations represented by said sequences of fenemes.
  - 15. The method set forth in claim 12, wherein each samples value in said set of sample values includes a sequence of subsample values having distinct types and the step of grouping the annotated sample values in said set into clusters includes the steps of:
    - assigning each sample value to a respectively different prototype cluster;
      
      calculating respective expected frequency values for each type of subsample in each of the prototype clusters;
      
      statistically comparing the expected frequency values of each subsample type in each prototype cluster to the expected frequencies of each subsample type in all other prototype clusters to generate a plurality of statistical difference values, one for each pair of prototype clusters;
      
      combining pairs of prototype clusters that exhibit a statistical difference value which is less than a threshold value to generate new prototype clusters.
  - 16. The method set forth in claim 15 wherein the step of statistically comparing the expected frequency of each subsample type in each prototype cluster to the expected frequencies of all other subsample types in all other prototypes clusters includes the steps of:
    - generating a plurality of probabilistic models, each model representing a Markov model of the sequences of subsamples represented by a respectively different one of said prototype clusters;
      
      generating a plurality of histograms, each histogram representing the relative frequency of occurrence of each type of subsample in a respectively different one of said prototype clusters;
      
      calculating, for each pair of prototype clusters, a log-likelihood ratio that the histograms for each prototype cluster in said pair match the expected frequencies of subsamples in a single probabilistic model representing a combination of the respective probabilistic models for said pair of prototype clusters, wherein a sign inverted version of said log-likelihood ratio is the statistical difference value for said pair of clusters.

17. In a voice recognition system, a method for grouping a plurality of vocalizations, represented by respective sequences of samples having respective sample values, into clusters comprising the steps of:
- A) assigning each sequence of samples to a respectively different prototype cluster;
  
  B) calculating an expected frequency value for each sample value in each prototype cluster;
  
  C) statistically comparing the expected frequency value of each sample value of each prototype cluster to the expected frequencies of all other sample values in all other prototype clusters to generate a plurality of statistical difference values, one for each pair of clusters;
  
  D) combining pairs of prototype clusters that exhibit a statistical difference value which is less than a threshold value to generate new prototype clusters.
- View Dependent Claims (18, 19)
- - 18. The method of groping a plurality of sequences of samples set forth in claim 17 further including the step of repeating steps B) through D) until no prototype clusters can be combined.
  - 19. The method of grouping a plurality of vocalizations set forth in claim 18 wherein the step of statistically comparing the expected frequency of each sample value of each prototype cluster to the expected frequencies of all other sample value of all other prototype clusters includes the steps of:
    - generating a plurality of probabilistic models, each model representing a Markov model of the sequences of samples represented by a respectively different one of said prototype clusters;
      
      generating a plurality of histograms, each histogram representing the relative frequency of occurrence of each sample value in a respectively different one of said prototype clusters;
      
      calculating, for each pair of prototype clusters, a log-likelihood ratio that the histograms for each prototype cluster in said pair match the expected frequency values of sample values of a single probabilistic model representing a combination of the respective probabilistic models for said pair of prototype clusters, wherein a sign inverted version of said log-likelihood ratio is the statistical difference value for said pair of clusters.

20. Apparatus for automatically separating vocalizations of language components into a plurality of groups representing pronunciations of the language components in respectively different contexts, comprising:
- sampling means for converting a training text and vocalizations of the training test into a plurality of samples representing the language components of said vocalizations;
  
  means for selecting, from among the plurality of samples, a set of samples representing respective instances of a selected language component in the vocalizations;
  
  processing means including means for annotating each of said selected samples with a context indicator representing at least one language component in a contextual relationship with the selected sample, to produce annotated samples;
  
  means for separating the selected samples into respectively different leaf groups based on the respective context indicators of said annotated samples, each of said leaf groups representing a pronunciation of said selected language component in a respectively different context.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The appratus set forth in claim 20 wherein:
    - the processing means further includes;
      
      means for grouping said annotated selected samples into a plurality of clusters, each cluster representing a respectively different pronunciation of said selected language component; and
      
      means for further annotating each selected sample with an indicator of the cluster to which it belongs.
  - 22. The apparatus set forth in claim 21 wherein:
    - said language components are phonemes;
      
      said samples are sequences of fenemes having distinct types, each of said sequences of fenemes corresponding to a respective one of said phonemes; and
      
      the means for grouping said annotated selected samples includes;
      
      means for assigning each sequence of fenemes to a respectively different prototype cluster;
      
      means for calculating respective expected frequency values for each type of feneme in each of the prototype clusters;
      
      statistical comparison means for statistically comparing the expected frequency values for the respective types of fenemes in each prototype cluster to the expected frequency values for the respective types of fenemes in all other prototype clusters to generate a plurality of statistical difference values, one for each pair of prototype clusters; and
      
      means for combining pairs of prototype clusters that exhibit a statistical difference value which is less than a threshold value to generate new prototype clusters.
  - 23. The apparatus of claim 22 wherein the statistical comparison means includes:
    - means for generating a plurality of probabilistic models, each model representing a Markov model of the pronunciation of the phoneme represented by a respectively different one of said prototype clusters;
      
      means for generating a plurality of histograms, each histogram representing the relative frequency of occurrence of each feneme in a respectively different one of said prototype clusters;
      
      means for calculating, for each pair of prototype clusters, a log-likelihood ratio that the histograms for each prototype cluster in said pair match expected frequencies of the respective types of fenemes in a single probabilistic model representing a combination of the respective probabilistic models for said pair of prototype clusters, wherein a sign inverted version of said log-likelihood ratio is the statistical difference value for said pair of clusters.
  - 24. The apparatus of claim 20 further comprising means for generating a probabilistic model for each leaf group of the decision tree, including:
    - means for grouping the samples of the leaf group into a plurality of clusters, each cluster representing a respectively different pronunciation of the language component represented by the samples of the leaf group;
      
      means for generating, from said plurality of clusters, a respective plurality of statistical models each of said statistical models having the form of a Markov model; and
      
      means for augmenting said plurality of statistical models by adding a common initial state and a common final state to each model to generate said probabilistic model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bahl, Lalit R., Brown, Peter F., Mercer, Robert L., DeSouza, Peter V.
Primary Examiner(s)
NOT, DEFINED
Assistant Examiner(s)
NOT, DEFINED

Application Number

US07/323,479
Time in Patent Office

854 Days
Field of Search

381/41-46, 364/513.5
US Class Current

704/256.5
CPC Class Codes

G10L 15/14 using statistical models, e...

Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

298 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

298 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others