Method and apparatus for automatic text-independent grading of pronunciation for language instruction

US 6,055,498 A
Filed: 10/02/1997
Issued: 04/25/2000
Est. Priority Date: 10/02/1996
Status: Expired due to Term

First Claim

Patent Images

1. In an automatic speech processing system, a method for assessing pronunciation of a student speech sample using a computerized acoustic segmentation system, the method comprising:

accepting said student speech sample which comprises a sequence of words spoken by a student speaker;

operating said computerized acoustic segmentation system to define sample acoustic units within said student speech sample based on speech acoustic models within said segmentation system, said speech acoustic models being established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words;

measuring duration of said sample acoustic units; and

comparing said durations of sample acoustic units to a model of exemplary acoustic unit duration to compute a duration score indicative of similarity between said sample acoustic unit durations and exemplary acoustic unit durations.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Pronunciation quality is automatically evaluated for an utterance of speech based on one or more pronunciation scores. One type of pronunciation score is based on duration of acoustic units. Examples of acoustic units include phones and syllables. Another type of pronunciation score is based on a posterior probability that a piece of input speech corresponds to a certain model, such as a hidden Markov model, given the piece of input speech. Speech may be segmented into phones and syllable for evaluation with respect to the models. The utterance of speech may be an arbitrary utterance made up of a sequence of words which had not been encountered before. Pronunciation scores are converted into grades as would be assigned by human graders. Pronunciation quality may be evaluated in a client-server language instruction environment.

179 Citations

39 Claims

1. In an automatic speech processing system, a method for assessing pronunciation of a student speech sample using a computerized acoustic segmentation system, the method comprising:
- accepting said student speech sample which comprises a sequence of words spoken by a student speaker;
  
  operating said computerized acoustic segmentation system to define sample acoustic units within said student speech sample based on speech acoustic models within said segmentation system, said speech acoustic models being established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words;
  
  measuring duration of said sample acoustic units; and
  
  comparing said durations of sample acoustic units to a model of exemplary acoustic unit duration to compute a duration score indicative of similarity between said sample acoustic unit durations and exemplary acoustic unit durations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method according to claim 1 wherein said exemplary acoustic unit duration model is established using duration-training speech data from at least one exemplary speaker, said duration-training data not necessarily including said sequence of spoken words.
  - 3. The method according to claim 1 wherein each acoustic unit is shorter in duration than a longest word in the language of said spoken words.
  - 4. The method according to claim 1 further comprising:
    - mapping said duration score to a grade; and
      
      presenting said grade to a student.
  - 5. The method according to claim 4 wherein the step of mapping said duration score to a grade comprises:
    - collecting a set of training speech samples from a plurality of language students of various proficiency levels;
      
      computing training duration scores for each of said training speech samples;
      
      collecting at least one human evaluation grade from a human grader for each of said training speech samples; and
      
      adjusting coefficients used in mapping by minimizing an error measurement between said human evaluation grades and said training duration scores.
  - 6. The method according to claim 4 wherein the step of mapping comprises using a mapping function obtained by linear or non-linear regression from training duration scores, alone or in combination with other machine scores, and corresponding human evaluation grades, all of said scores and grades being collected over a representative training data base of student speech.
  - 7. The method according to claim 6 wherein said mapping function is obtained by non-linear regression implemented with a neural net which allows arbitrary mappings from machine scores to human expert grades.
  - 8. The method according to claim 4 wherein the step of mapping comprises using a decision tree or class probability tree whose parameters were established using training duration scores.
  - 9. The method according to claim 1 wherein the step of operating said acoustic segmentation system comprises the steps of:
    - computing a path through trained hidden Markov models (HMMs) from among said speech acoustic models, said path being an allowable path through the HMMs that has maximum likelihood of generating an observed acoustic features sequence from said student speech sample; and
      
      determining from said path at least one boundary or duration of one acoustic unit.
  - 10. The method according to claim 9 wherein:
    - said spoken sequence of words is spoken according to a known script; and
      
      the path computing step comprises using said script in defining allowability of any path through the HMMs.
  - 11. The method according to claim 9 wherein said spoken sequence of words is unknown, and the path computing step comprises operating a computerized speech recognition system that determines said spoken sequence of words.
  - 12. The method according to claim 9 wherein:
    - said sample acoustic units are syllables; and
      
      the step of determining at least one acoustic unit boundary or duration comprises the steps of;
      
      extracting boundaries or durations of at least two phones from said path; and
      
      combining portions of at least two phones to obtain a boundary or duration of a syllable acoustic unit.
  - 13. The method according to claim 12 wherein the step of combining portions of at least two phones comprises measuring the time difference between centers of vowel phones from among said phones to obtain a duration of a syllable acoustic unit.
  - 14. The method according to claim 1 wherein said sample acoustic units are phones.
  - 15. The method according to claim 1 wherein said sample acoustic units are syllables.
  - 16. The method according to claim 1 wherein:
    - said exemplary acoustic unit duration distribution model is a model of speaker-normalized acoustic unit durations, and the duration measuring step comprises the steps of;
      
      analyzing said student speech sample to determine a student speaker normalization factor; and
      
      employing said student speaker normalization factor to measure speaker-normalized durations as said measured sample acoustic unit durations, whereby the comparing step compares said speaker-normalized sample acoustic unit durations to said exemplary speaker-normalized acoustic unit duration distribution model.
  - 17. The method according to claim 16 wherein said student speaker normalization factor is rate of speech.
  - 18. The method according to claim 1 wherein the step of operating said segmentation system excludes acoustic units in context with silence from analysis.
  - 19. The method according to claim 1 wherein the step of operating said segmentation system comprises operating a speech recognition system as said acoustic segmentation system.

20. A system for assessing pronunciation of a student speech sample, said student speech sample comprising a sequence of words spoken by a student speaker, the system comprising:
- speech acoustic models established using training speech data from at least one speaker, said training speech data not necessarily including said sequence of spoken words;
  
  a computerized acoustic segmentation system configured to identify acoustic units within said student speech sample based on said speech acoustic models;
  
  a duration extractor configured to measure duration of said sample acoustic units;
  
  a model of exemplary acoustic unit duration; and
  
  a duration scorer configured to compare said sample acoustic unit durations to said model of exemplary acoustic unit duration and compute a duration score indicative of similarity between said sample acoustic unit durations and acoustic unit durations in exemplary speech.

21. In an automatic speech processing system, a method for grading the pronunciation of a student speech sample, the method comprising:
- accepting said student speech sample which comprises a sequence of words spoken by a student speaker;
  
  operating a set of trained speech models to compute at least one posterior probability from said speech sample, each of said posterior probabilities being a probability that a particular portion of said student speech sample corresponds to a particular known model given said particular portion of said speech sample; and
  
  computing an evaluation score, herein referred to as the posterior-based evaluation score, of pronunciation quality for said student speech sample from said posterior probabilitieswherein each of said posterior probabilities is derived from a model likelihood by dividing the likelihood that said particular known model generated said particular portion of said student speech sample by the summation of the likelihoods that individual models generated said particular portion of said speech sample.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
- - 22. The method according to claim 21 wherein:
    - said particular known model is a context-dependent model; and
      
      individual models are context-dependent or context-independent models.
  - 23. The method according to claim 21 wherein said particular portion of said speech sample is a phone.
  - 24. The method according to claim 21 further comprising:
    - mapping said posterior-based evaluation score to a grade as would be assigned by human listener; and
      
      presenting said grade to said student speaker.
  - 25. The method according to claim 24 wherein said step of mapping said posterior-based evaluation score to a grade comprises:
    - collecting a set of training speech samples from a plurality of language students of various proficiency levels;
      
      collecting a set of human evaluation grades for each of said training samples from human expert listeners listening to said samples; and
      
      adjusting coefficients used in mapping by minimizing the squared-error between the human expert grades and said evaluation score.
  - 26. The method according to claim 21 wherein said student speech sample comprises an acoustic features sequence, the method further comprising the steps of:
    - computing a path through a set of trained hidden Markov models (HMMs) from among said trained speech models, said path being an allowable path through the HMMs that has maximum likelihood of generating said acoustic features sequence; and
      
      identifying transitions between phones within said path, thereby defining phones.
  - 27. The method according to claim 26 wherein the path computing step is performed using the Viterbi search technique.
  - 28. The method according to claim 26 wherein said spoken sequence of words is unknown, and the path computing step is performed using a computerized speech recognition system that determines said spoken sequence of words.
  - 29. The method according to claim 21 wherein segments in context with silence are excluded from said student speech sample and from training data used to train said speech models.

30. In an automatic speech processing system, a method for grading the pronunciation of a student speech sample, the method comprising:
- accepting said student speech sample which comprises a sequence of words spoken by a student speaker;
  
  operating a set of trained speech models to compute at least one posterior probability from said speech sample, each of said posterior probabilities being a probability that a particular portion of said student speech sample corresponds to a particular known model given said particular portion of said speech sample; and
  
  computing an evaluation score, herein referred to as the posterior-based evaluation score, of pronunciation quality for said student speech sample from said posterior probabilitieswherein;
  
  said trained speech models comprise a set of phone models;
  
  said student speech sample comprises phones; and
  
  the step of operating said speech models comprises computing a frame-based posterior probability for each frame yt within a phone i of a phone type qi;
  
  ##EQU14## wherein;
  
  p(yt|qi, . . . ) is the probability of the frame yt according to a model corresponding to phone type qi;
  
  the sum over q runs over all phone types; and
  
  P(qi) represents the prior probability of the phone type qi.
- View Dependent Claims (31, 32, 33, 34, 35, 36)
- - 31. The method according to claim 30 wherein the step of computing a frame-based posterior probability uses context-dependent models corresponding to each phone type q_i in the numerator, whereby said p(y_t .linevert split.q_i, . . . ) is a context-dependent likelihood p(y_t .linevert split.q_i, ctx_i), wherein ctx_i represents context.
  - 32. The method according to claim 30 wherein the step of computing said posterior-based evaluation score for said student speech sample comprises computing for a phone i an average of the logarithm of the frame-based posterior probabilities of all frames within said phone i, said average herein referred to as a phone score ρ
    - _i, which is expressible as;
      
      ##EQU15## wherein the sum runs over all d_i frames of said phone i.
  - 33. The method according to claim 32 wherein said posterior-based evaluation score for said student speech sample is defined as an average of the individual phone scores ρ
    - _i for each phone i within said student speech sample;
      
      ##EQU16## wherein the sum runs over the number of phones in said student speech sample.
  - 34. The method according to claim 30 wherein the model corresponding to each phone type is a Gaussian mixture phone model.
  - 35. The method according to claim 30 wherein the model corresponding to each phone type is a context-independent phone model.
  - 36. The method according to claim 30 wherein the model corresponding to each phone type is a hidden markov model.

37. A system for pronunciation training in a client/server environment wherein there exists a client process for presenting prompts to a student and for accepting student speech elicited by said prompts, the system comprising:
- a server process for sending control information to said client process to specify a prompt to be presented to said student and for receiving a speech sample derived from said student speech elicited by said presented prompt; and
  
  a pronunciation evaluator invocable by said server process for analyzing said student speech sample wherein;
  
  said pronunciation evaluator is established using training speech data; and
  
  said server process is adapted to specify a prompt for eliciting a sequence of words not necessarily found in said training speech data as said student speech sample.
- View Dependent Claims (38, 39)
- - 38. The system according to claim 37 wherein said server process receives said speech sample over a speech channel that is separate from a communication channel through which said server process and said client process communicate.
  - 39. The system according to claim 37 wherein said client process and said server process are located on two separate computer processors and communicate via a network.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
SRI International, Inc.
Inventors
Neumeyer, Leonardo, Franco, Horacio, Weintraub, Mitchel, Price, Patti, Digalakis, Vassilios
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Storm, Donald L.

Application Number

US08/942,780
Time in Patent Office

936 Days
Field of Search

704/246, 704/249, 704/254, 704/276, 704/200, 434/185
US Class Current

704/246
CPC Class Codes

G09B 19/04   Speaking with audible prese...

G10L 15/04   Segmentation; Word boundary...

G10L 15/26   Speech to text systems G10L...

H04L 67/01   Protocols

Method and apparatus for automatic text-independent grading of pronunciation for language instruction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

179 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatic text-independent grading of pronunciation for language instruction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

179 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links