Method and apparatus for voice-interactive language instruction

US 5,634,086 A
Filed: 09/18/1995
Issued: 05/27/1997
Est. Priority Date: 03/12/1993
Status: Expired due to Term

First Claim

Patent Images

1. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis, the accuracy being a measure of how well the input speech corresponds with preselected script which the Speaker of the input speech was prompted to recite; and

outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Spoken-language instruction method and apparatus employ context-based speech recognition for instruction and evaluation, particularly language instruction and language fluency evaluation. A system can administer a lesson, and particularly a language lesson, and evaluate performance in a natural interactive manner while tolerating strong foreign accents, and produce as an output a reading quality score. A finite state grammar set corresponding to the range of word sequence patterns in the lesson is employed as a constraint on a hidden Markov model (HMM) search apparatus in an HMM speech recognizer which includes a set of hidden Markov models of target-language narrations produced by native speakers of the target language. The invention is preferably based on use of a linguistic context-sensitive speech recognizer. The invention includes a system with an interactive decision mechanism which employs at least three levels of error tolerance to simulate a natural level of patience in human-based interactive instruction. A system for a reading phase is implemented through a finite state machine having at least four states which recognizes reading error at any position in a script and which employs a first set of actions. A related system for an interactive question phase is implemented through a finite state machine, but which recognizes reading errors as well as incorrect answers while invoking a second set of actions. A linguistically-sensitive utterance endpoint detector is provided for judging termination of a spoken utterance to simulate human turn-taking in conversational speech.

Citations

20 Claims

1. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
  
  imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
  
  generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
  
  parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;
  
  evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis, the accuracy being a measure of how well the input speech corresponds with preselected script which the Speaker of the input speech was prompted to recite; and
  
  outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.
- View Dependent Claims (2, 3, 20)
- - 2. The method of claim 1, further comprising the steps of:
    - digitizing the input speech and storing digitized input speech in a digital memory;
      
      storing the grammar model and the altered grammar model in the digital memory; and
      
      using a digital computer to compare the input speech with the stored grammar models.
  - 3. The method of claim 1, further comprising a step of, in response to the input speech, prompting the speaker to re-recite the preselected script with phonetic and semantic accuracy, according to at least three levels of patience.
  - 20. The language instruction and evaluation method of claim 1, wherein the step of outputting an indication is a step of indirectly outputting an indication and comprises the steps of:
    - inputting the indication to a lesson program; and
      
      indicating, using the lesson program, to the speaker the accuracy of the speaker'"'"'s recitation by taking an action consistent with the accuracy input to the lesson program.

4. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
  
  imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
  
  generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
  
  parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;
  
  evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and
  
  outputting an indication of the accuracy of the input speech to the speaker,wherein the preselected script includes alternative texts, the method further comprising a step of generating an interactive conversation grammar model for the alternative texts, the interactive conversation grammar model comprising a first common alt element disposed before a selection of alternative phrases and a second common alt element disposed after the selection of an alternative phrase, thereby permitting alternative responses having phonetic accuracy and semantic inaccuracy.
- View Dependent Claims (5)
- - 5. The method of claim 4, further comprising a step of structuring an alt element as a plurality of transition arcs for events, including prolonged silence, prolonged out-of-script speech, speech alternating between periods of silence and periods of out-of-script speech, and speech without pauses or out-of-script speech.

6. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
  
  imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
  
  generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
  
  parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of;
  
  a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones;
  
  b) determining reject density for the current segment; and
  
  c) denoting the current segment as out-of-script speech if the reject density exceeds a reject density threshold;
  
  evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and
  
  outputting an indication of the accuracy of the input speech to the speaker.
- View Dependent Claims (7)
- - 7. The method of claim 6, wherein the step of determining the reject density for the current segment comprises the step of dividing a reject phone count returned by the speech recognizer for a preselected number of consecutive scripted words by a sum of the reject phone count and a count of the preselected number of consecutive scripted words.

8. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
  
  imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
  
  generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
  
  parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of;
  
  a) recurrently examining a current segment output by the speech recognizer for-scripted words, pause phones and reject phones;
  
  b) determining reject indicator for the current segment; and
  
  c) denoting the current segment as out-of-script speech if the reject indicator exceeds a reject density threshold;
  
  evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and
  
  outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.
- View Dependent Claims (9)
- - 9. The method of claim 8, wherein the step of determining the reject indicator for the current segment comprises the step of summing a reject phone count returned by the speech recognizer for a preselected number of consecutive scripted words.

10. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
  
  imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
  
  generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
  
  parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of;
  
  a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones;
  
  b) determining a pause indicator for the current segment; and
  
  c) denoting the current segment as an actionable pause if the pause indicator exceeds a pause indicator threshold, the actionable pause representing a turn-taking point in interaction between the automatic speech recognizer and the speaker;
  
  evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and
  
  outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.
- View Dependent Claims (11, 12)
- - 11. The method of claim 10, further comprising a step of generating the pause indicator threshold as a threshold dependent upon linguistic context of the current segment and position of the current segment in the preselected script, the pause indicator threshold being smaller at ends of sentences and major clauses than elsewhere among words of sentences of the preselected script.
  - 12. The method of claim 10, wherein the pause indicator determining step comprises a step of summing pause phones returned by the speech recognizer out of a preselected number of consecutive words of the preselected script.

13. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:
- presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script;
  
  means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory;
  
  means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses;
  
  assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite; and
  
  producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script.
- View Dependent Claims (14, 15, 16)
- - 14. The system according to claim 13, wherein the system for tracking is used for instruction in a language foreign to the speaker and wherein the producing means includes means for generating an audible response as an example of native pronunciation and rendition of speech in the language.
  - 15. The system according to claim 13, further comprising means for measuring recitation speed comprising:
    - means for counting words recited to determine a recited word count;
      
      means for measuring time duration of a recitation of scripted words; and
      
      means for dividing the recited word count by the measured time elapsed.
  - 16. The system according to claim 13, further comprising means (192) for measuring recitation quality, thereby obtaining a recitation quality score (230), the means for measuring recitation quality comprising:
    - means (194) for counting words (195) in the preselected script to determine a preselected script word count;
      
      means (196) for determining an optimum recitation time (197;
      
      means (198) for counting reject phones (199) to determine a reject phone count;
      
      means (200) for measuring a total time (201) elapsed during recitation of the preselected script;
      
      means (202) for measuring good time (203) elapsed during recitation of phrases deemed acceptable by the analyzing means;
      
      means (204) for dividing the good time (203) by the total time (201) to obtain a first quotient (205);
      
      means (210) for outputting a preferred maximum value (211) which is a maximum of the optimum recitation time (197) and the good time (203);
      
      means (212) for dividing the optimum recitation time (197) by the preferred maximum value (211) to obtain a second quotient (213);
      
      means (218) for summing the reject phone count (199) and the preselected script word count (195) to obtain a quality value (219);
      
      means (220) for dividing the preselected script word count (195) by the quality value (219) to obtain a third quotient (221); and
      
      means for calculating the recitation quality score (230) as a weighted sum of the first quotient (208), the second score quotient (216) and the third score quotient (224).

17. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:
- presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script;
  
  means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory;
  
  means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses;
  
  assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script;
  
  producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script;
  
  means (192) for measuring recitation quality, thereby obtaining a recitation quality score (230), the means for measuring recitation quality comprising;
  
  a) means (194) for counting words (195) in the preselected script to determine a preselected script word count;
  
  b) means (196) for determining an optimum recitation time (197);
  
  c) means (198) for counting reject phones (199) to determine a reject phone count;
  
  d) means (200) for measuring a total time (201) elapsed during recitation of the preselected script;
  
  e) means (202) for measuring good time (203) elapsed during recitation of phrases deemed acceptable by the analyzing means;
  
  f) means (204) for dividing the good time (203) by the total time (201) to obtain a first quotient (205);
  
  g) means (210) for outputting a preferred maximum value (211) which is a maximum of the optimum recitation time (197) and the good time (203);
  
  h) means (212) for dividing the optimum recitation time (197) by the preferred maximum value (211) to obtain a second quotient (213);
  
  i) means (218) for summing the reject phone count (199) and the preselected script word count (195) to obtain a quality value (219);
  
  j) means (220) for dividing the preselected script word count (195) by the quality value (219) to obtain a third quotient (221); and
  
  k) means for calculating the recitation quality score (230) as a weighted sum of the first quotient (208), the second score quotient (216) and the third score quotient (224), the means for calculating further comprising;
  
  1) means (206) for weighting the first quotient (205) by a first weighting parameter (a) to obtain a first score component (208);
  
  2) means (214) for weighting the second quotient (213) by a second weighting parameter (b) to obtain a second score component (216);
  
  3) means (222) for weighting the third quotient (221) by a third weighting parameter (c) to obtain a third score component (224);
  
  4) means (226) for summing the first score component (208), the second score component (216) and the third score component (224) to produce a score sum (227); and
  
  5) means for weighting the score sum (227) by a scale factor (228) to obtain the recitation quality score (230).

18. A system for tracking speech and interacting with a speaker using spoken and graphic outputs and an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from input speech spoken by the speaker after being prompted to recite from a preselected script which includes a plurality of preselected script alternatives and from a grammar model, the system comprising:
- presentation means for presenting information to the speaker about a subject and prompting the speaker to recite one of the plurality of preselected script alternatives;
  
  sensing means for electronically capturing the input speech, wherein the captured input speech is stored in a computer memory;
  
  analyzing means for analyzing the captured input speech to determine an input hypothesis corresponding to the input speech spoken by the speaker;
  
  identifying means, coupled to the analyzing means, for identifying which preselected script alternative from the plurality of preselected script alternatives best corresponds to the input hypothesis;
  
  assessing means, coupled to the identifying means, for assessing completeness of an utterance to determine accuracy of recitation of the identified preselected script alternative, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite;
  
  output means, coupled to the assessing means, for outputting a response upon the completion of the utterance, the response indicating to the speaker the accuracy of the recitation of the identified preselected script alternative and the semantic appropriateness of the identified preselected script alternative.
- View Dependent Claims (19)
- - 19. The system according to claim 18, wherein the interacting system is for instruction in a language foreign to the speaker and wherein the producing means includes means for generating an audible response as an example of native pronunciation and rendition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
SRI International, Inc.
Inventors
Chen, George T., Butzberger, John W., Rtischev, Dimitry, Bernstein, Jared C.
Primary Examiner(s)
Hafiz, Tariq R.

Application Number

US08/529,376
Time in Patent Office

617 Days
Field of Search

364/419, 381/41-43, 381/47, 395/2, 395/2.1, 395/2.4, 395/2.44, 395/2.42, 395/2.43, 395/2.41, 395/2.6, 395/2.64, 395/2.65, 395/2.75, 395/2.76, 395/2.55, 395/2.59, 395/22, 395/2.66
US Class Current

704/270
CPC Class Codes

G09B 19/06   Foreign languages with audi...

G10L 15/183   using context dependencies,...

G10L 15/193   Formal grammars, e.g. finit...

Method and apparatus for voice-interactive language instruction

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for voice-interactive language instruction

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links