Apparatuses and methods for developing and using models for speech recognition
First Claim
1. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
- receiving training signals representing the sound of spoken words;
storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound;
time aligning successive time frames of the training signals against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents;
storing a set of classifications, each representing a possible set of one or more of the phonetic features which can be associated with one of said phonetic context units;
using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar frames;
developing shared acoustic model components for a plurality of phonetic context groups whose associated sub-sets of classifications share a sub-sub-set of classifications and whose frames have a certain acoustic similarity, which shared acoustic model components contain statistical information derived from frames time aligned against the phonetic context units in different ones of said plurality of phonetic context groups; and
developing an acoustic model for each given phonetic context group in said plurality of phonetic context groups which contains a combination of said statistical information contained in the model components shared by said plurality of groups and more specific statistical information representing the frames time aligned against the phonetic context units in the given individual phonetic context group.
8 Assignments
0 Petitions
Accused Products
Abstract
A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned against speech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group. A decision tree classifies speech sounds into such groups, and related speech sound groups descend from common tree nodes. New speech samples time aligned against a given speech sound group'"'"'s model update models of related speech sound groups, decreasing the training data required to adapt the system. The phonetic context classifications can be based on knowledge of which contextual features are associated with acoustic similarity. The computerized system samples speech sounds using a first, larger, parameter set; automatically selects combinations of phonetic context classifications which divide the speech sounds into groups whose frames are acoustically similar, such as by use of a decision tree; selects a second, smaller, set of parameters based on that set'"'"'s ability to separate the frames aligned with each speech sound group, such as by used of linear discriminant analysis; and then uses these new parameters to represent frames and speech sound models. Then, using the new parameters, a decision tree classifier can be used to re-classify the speech sounds and to calculate new acoustic models for the resulting groups of speech sounds.
-
Citations
16 Claims
-
1. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
-
receiving training signals representing the sound of spoken words; storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound; time aligning successive time frames of the training signals against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents; storing a set of classifications, each representing a possible set of one or more of the phonetic features which can be associated with one of said phonetic context units; using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar frames; developing shared acoustic model components for a plurality of phonetic context groups whose associated sub-sets of classifications share a sub-sub-set of classifications and whose frames have a certain acoustic similarity, which shared acoustic model components contain statistical information derived from frames time aligned against the phonetic context units in different ones of said plurality of phonetic context groups; and developing an acoustic model for each given phonetic context group in said plurality of phonetic context groups which contains a combination of said statistical information contained in the model components shared by said plurality of groups and more specific statistical information representing the frames time aligned against the phonetic context units in the given individual phonetic context group. - View Dependent Claims (2)
-
-
3. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
-
receiving training signals representing the sound of spoken words; storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound; time aligning successive time frames of the training signals against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents; storing a set of classifications, each representing a possible set of one or more of the phonetic features which can be associated with one of said phonetic context units; using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar frames; developing shared acoustic model components for a plurality of phonetic context groups whose associated sub-sets of classifications share a sub-sub-set of classifications, and whose frames have a certain acoustic similarity; and developing an acoustic model for each phonetic context group in said plurality of phonetic context groups based both on the model components shared by said plurality of groups and on the frames associated with the phonetic context units in the individual phonetic context group; wherein; said classification routine builds a decision tree to select the plurality of sub-sets of classifications used to divide the phonetic context units into phonetic context groups; said plurality of phonetic context groups which share a sub-sub-set of classifications are descendants from a common ancestor node in a common decision tree; the acoustic model components shared by a set of phonetic context groups having a common ancestor node are a set of probability distribution models, each representing a possible distribution of multi-dimensional acoustic values associated with frames time aligned against the phonetic context units of said set of phonetic context groups; and the acoustic model developed for an individual phonetic context group is a mixture model made up of a weighted sum of such a set of distribution models, with each distribution model in the set being weighted so the sum of such models better represents the distribution of frame values associated with the individual phonetic context group. - View Dependent Claims (4, 5, 6, 7, 8)
-
-
9. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
-
receiving training signals representing the sound of spoken words; storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound; time aligning successive time frames of the training signals against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents; storing a set of classifications, each representing a possible set of one or more of the phonetic features which can be associated with one of said phonetic context units; using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar frames; developing an acoustic model for each phonetic context group in said plurality of phonetic context groups based on the frames associated with the phonetic context units in that phonetic context group; receiving additional signals representing additional sounds of spoken words; time aligning successive time frames of the additional signals against the models of phonetic context groups corresponding to the phonetic context units of the words corresponding to those additional signals, to associate each such additional frame with a phonetic context group; and automatically combining acoustic data from an additional frame time aligned against a given phonetic context group into the acoustic models of a different phonetic context group which shares a sub-sub-set of said classifications with, and whose associated frames have a given acoustic similarity to, said given phonetic context group.
-
-
10. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
-
receiving training signals representing the sound of spoken words; storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound using p-parameters; time aligning successive time frames of the training signals represented in said p parameters, against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents; storing a set of classifications, each representing a possible set of one or more the phonetic features which can be associated with one of said phonetic context units; using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar p-parameter frames; using an automatic parameter selection routine to select a new set of q acoustic parameters derived from said p parameters, where q is less than p, which, for a given q, produces a relatively optimal separation between the sets of frames associated with each phonetic context group; and using said q parameters to build a set of second acoustic models, for use in speech recognition, to represent said phonetic context units. - View Dependent Claims (11)
-
-
12. A computerized method for automatically creating models of speech sounds to be used in speech recognition comprising the steps of:
-
receiving training signals representing the sound of spoken words; storing a plurality of phonetic context units, each representing a speech sound in a phonetic context defined by one or more phonetic features, and associating with each phonetic context unit an initial acoustic model to represent its associated speech sound using p-parameters; time aligning successive time frames of the training signals represented in said p parameters, against the initial models of the phonetic context units of the words corresponding to those training signals, to associate each frame with the phonetic context unit whose sound it represents; storing a set of classifications, each representing a possible set of one or more the phonetic features which can be associated with one of said phonetic context units; using an automatic classification routine to select a plurality of sub-sets of said classifications which divide the phonetic context units into phonetic context groups, such that the phonetic context units in each such phonetic context group tend to be time aligned against acoustically similar p-parameter frames; using an automatic parameter selection routine to select a new set of q acoustic parameters derived from said p parameters, where q is less than p, which, for a given q, produces a relatively optimal separation between the sets of frames associated with each phonetic context group; and using said q parameters to build a set of second acoustic models, for use in speech recognition, to represent said phonetic context units, wherein; said automatic parameter selection routine performs linear discriminant analysis to produce an L matrix which converts said p parameters into said q parameters, said classification routine builds a decision tree to select the plurality of sub-sets of classifications used to divide the phonetic context units into phonetic context groups. - View Dependent Claims (13, 14, 15, 16)
-
Specification