Method and apparatus for automatic text-independent grading of pronunciation for language instruction
First Claim
1. In an automatic speech processing system, a method for assessing pronunciation of a student speech sample using a computerized acoustic segmentation system, the method comprising:
- accepting said student speech sample which comprises a sequence of words spoken by a student speaker;
operating said computerized acoustic segmentation system to define sample acoustic units within said student speech sample based on speech acoustic models within said segmentation system, said speech acoustic models being established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words;
measuring duration of said sample acoustic units; and
comparing said durations of sample acoustic units to a model of exemplary acoustic unit duration to compute a duration score indicative of similarity between said sample acoustic unit durations and exemplary acoustic unit durations.
2 Assignments
0 Petitions
Accused Products
Abstract
Pronunciation quality is automatically evaluated for an utterance of speech based on one or more pronunciation scores. One type of pronunciation score is based on duration of acoustic units. Examples of acoustic units include phones and syllables. Another type of pronunciation score is based on a posterior probability that a piece of input speech corresponds to a certain model, such as a hidden Markov model, given the piece of input speech. Speech may be segmented into phones and syllable for evaluation with respect to the models. The utterance of speech may be an arbitrary utterance made up of a sequence of words which had not been encountered before. Pronunciation scores are converted into grades as would be assigned by human graders. Pronunciation quality may be evaluated in a client-server language instruction environment.
179 Citations
39 Claims
-
1. In an automatic speech processing system, a method for assessing pronunciation of a student speech sample using a computerized acoustic segmentation system, the method comprising:
-
accepting said student speech sample which comprises a sequence of words spoken by a student speaker; operating said computerized acoustic segmentation system to define sample acoustic units within said student speech sample based on speech acoustic models within said segmentation system, said speech acoustic models being established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words; measuring duration of said sample acoustic units; and comparing said durations of sample acoustic units to a model of exemplary acoustic unit duration to compute a duration score indicative of similarity between said sample acoustic unit durations and exemplary acoustic unit durations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system for assessing pronunciation of a student speech sample, said student speech sample comprising a sequence of words spoken by a student speaker, the system comprising:
-
speech acoustic models established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words; a computerized acoustic segmentation system configured to identify acoustic units within said student speech sample based on said speech acoustic models; a duration extractor configured to measure duration of said sample acoustic units; a model of exemplary acoustic unit duration; and a duration scorer configured to compare said sample acoustic unit durations to said model of exemplary acoustic unit duration and compute a duration score indicative of similarity between said sample acoustic unit durations and acoustic unit durations in exemplary speech.
-
-
21. In an automatic speech processing system, a method for grading the pronunciation of a student speech sample, the method comprising:
-
accepting said student speech sample which comprises a sequence of words spoken by a student speaker; operating a set of trained speech models to compute at least one posterior probability from said speech sample, each of said posterior probabilities being a probability that a particular portion of said student speech sample corresponds to a particular known model given said particular portion of said speech sample; and computing an evaluation score, herein referred to as the posterior-based evaluation score, of pronunciation quality for said student speech sample from said posterior probabilities wherein each of said posterior probabilities is derived from a model likelihood by dividing the likelihood that said particular known model generated said particular portion of said student speech sample by the summation of the likelihoods that individual models generated said particular portion of said speech sample. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. In an automatic speech processing system, a method for grading the pronunciation of a student speech sample, the method comprising:
-
accepting said student speech sample which comprises a sequence of words spoken by a student speaker; operating a set of trained speech models to compute at least one posterior probability from said speech sample, each of said posterior probabilities being a probability that a particular portion of said student speech sample corresponds to a particular known model given said particular portion of said speech sample; and computing an evaluation score, herein referred to as the posterior-based evaluation score, of pronunciation quality for said student speech sample from said posterior probabilities wherein; said trained speech models comprise a set of phone models; said student speech sample comprises phones; and the step of operating said speech models comprises computing a frame-based posterior probability for each frame yt within a phone i of a phone type qi;
##EQU14## wherein;
p(yt|qi, . . . ) is the probability of the frame yt according to a model corresponding to phone type qi;the sum over q runs over all phone types; and P(qi) represents the prior probability of the phone type qi. - View Dependent Claims (31, 32, 33, 34, 35, 36)
-
-
37. A system for pronunciation training in a client/server environment wherein there exists a client process for presenting prompts to a student and for accepting student speech elicited by said prompts, the system comprising:
-
a server process for sending control information to said client process to specify a prompt to be presented to said student and for receiving a speech sample derived from said student speech elicited by said presented prompt; and a pronunciation evaluator invocable by said server process for analyzing said student speech sample wherein; said pronunciation evaluator is established using training speech data; and said server process is adapted to specify a prompt for eliciting a sequence of words not necessarily found in said training speech data as said student speech sample. - View Dependent Claims (38, 39)
-
Specification