Method and apparatus for voice-interactive language instruction
First Claim
1. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
- generating a grammar model from the preselected script;
imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;
generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;
parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;
evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis, the accuracy being a measure of how well the input speech corresponds with preselected script which the Speaker of the input speech was prompted to recite; and
outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.
0 Assignments
0 Petitions
Accused Products
Abstract
Spoken-language instruction method and apparatus employ context-based speech recognition for instruction and evaluation, particularly language instruction and language fluency evaluation. A system can administer a lesson, and particularly a language lesson, and evaluate performance in a natural interactive manner while tolerating strong foreign accents, and produce as an output a reading quality score. A finite state grammar set corresponding to the range of word sequence patterns in the lesson is employed as a constraint on a hidden Markov model (HMM) search apparatus in an HMM speech recognizer which includes a set of hidden Markov models of target-language narrations produced by native speakers of the target language. The invention is preferably based on use of a linguistic context-sensitive speech recognizer. The invention includes a system with an interactive decision mechanism which employs at least three levels of error tolerance to simulate a natural level of patience in human-based interactive instruction. A system for a reading phase is implemented through a finite state machine having at least four states which recognizes reading error at any position in a script and which employs a first set of actions. A related system for an interactive question phase is implemented through a finite state machine, but which recognizes reading errors as well as incorrect answers while invoking a second set of actions. A linguistically-sensitive utterance endpoint detector is provided for judging termination of a spoken utterance to simulate human turn-taking in conversational speech.
-
Citations
20 Claims
-
1. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
-
generating a grammar model from the preselected script; imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses; generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model; parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence; evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis, the accuracy being a measure of how well the input speech corresponds with preselected script which the Speaker of the input speech was prompted to recite; and outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script. - View Dependent Claims (2, 3, 20)
-
-
4. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
-
generating a grammar model from the preselected script; imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses; generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model; parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence; evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and outputting an indication of the accuracy of the input speech to the speaker, wherein the preselected script includes alternative texts, the method further comprising a step of generating an interactive conversation grammar model for the alternative texts, the interactive conversation grammar model comprising a first common alt element disposed before a selection of alternative phrases and a second common alt element disposed after the selection of an alternative phrase, thereby permitting alternative responses having phonetic accuracy and semantic inaccuracy. - View Dependent Claims (5)
-
-
6. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
-
generating a grammar model from the preselected script; imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses; generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model; parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of; a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones; b) determining reject density for the current segment; and c) denoting the current segment as out-of-script speech if the reject density exceeds a reject density threshold; evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and outputting an indication of the accuracy of the input speech to the speaker. - View Dependent Claims (7)
-
-
8. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
-
generating a grammar model from the preselected script; imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses; generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model; parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of; a) recurrently examining a current segment output by the speech recognizer for-scripted words, pause phones and reject phones; b) determining reject indicator for the current segment; and c) denoting the current segment as out-of-script speech if the reject indicator exceeds a reject density threshold; evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script. - View Dependent Claims (9)
-
-
10. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:
-
generating a grammar model from the preselected script; imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses; generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model; parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of; a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones; b) determining a pause indicator for the current segment; and c) denoting the current segment as an actionable pause if the pause indicator exceeds a pause indicator threshold, the actionable pause representing a turn-taking point in interaction between the automatic speech recognizer and the speaker; evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script. - View Dependent Claims (11, 12)
-
-
13. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:
-
presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script; means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory; means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses; assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite; and producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script. - View Dependent Claims (14, 15, 16)
-
-
17. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:
-
presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script; means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory; means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses; assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script; producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script; means (192) for measuring recitation quality, thereby obtaining a recitation quality score (230), the means for measuring recitation quality comprising; a) means (194) for counting words (195) in the preselected script to determine a preselected script word count; b) means (196) for determining an optimum recitation time (197); c) means (198) for counting reject phones (199) to determine a reject phone count; d) means (200) for measuring a total time (201) elapsed during recitation of the preselected script; e) means (202) for measuring good time (203) elapsed during recitation of phrases deemed acceptable by the analyzing means; f) means (204) for dividing the good time (203) by the total time (201) to obtain a first quotient (205); g) means (210) for outputting a preferred maximum value (211) which is a maximum of the optimum recitation time (197) and the good time (203); h) means (212) for dividing the optimum recitation time (197) by the preferred maximum value (211) to obtain a second quotient (213); i) means (218) for summing the reject phone count (199) and the preselected script word count (195) to obtain a quality value (219); j) means (220) for dividing the preselected script word count (195) by the quality value (219) to obtain a third quotient (221); and k) means for calculating the recitation quality score (230) as a weighted sum of the first quotient (208), the second score quotient (216) and the third score quotient (224), the means for calculating further comprising; 1) means (206) for weighting the first quotient (205) by a first weighting parameter (a) to obtain a first score component (208); 2) means (214) for weighting the second quotient (213) by a second weighting parameter (b) to obtain a second score component (216); 3) means (222) for weighting the third quotient (221) by a third weighting parameter (c) to obtain a third score component (224); 4) means (226) for summing the first score component (208), the second score component (216) and the third score component (224) to produce a score sum (227); and 5) means for weighting the score sum (227) by a scale factor (228) to obtain the recitation quality score (230).
-
-
18. A system for tracking speech and interacting with a speaker using spoken and graphic outputs and an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from input speech spoken by the speaker after being prompted to recite from a preselected script which includes a plurality of preselected script alternatives and from a grammar model, the system comprising:
-
presentation means for presenting information to the speaker about a subject and prompting the speaker to recite one of the plurality of preselected script alternatives; sensing means for electronically capturing the input speech, wherein the captured input speech is stored in a computer memory; analyzing means for analyzing the captured input speech to determine an input hypothesis corresponding to the input speech spoken by the speaker; identifying means, coupled to the analyzing means, for identifying which preselected script alternative from the plurality of preselected script alternatives best corresponds to the input hypothesis; assessing means, coupled to the identifying means, for assessing completeness of an utterance to determine accuracy of recitation of the identified preselected script alternative, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite; output means, coupled to the assessing means, for outputting a response upon the completion of the utterance, the response indicating to the speaker the accuracy of the recitation of the identified preselected script alternative and the semantic appropriateness of the identified preselected script alternative. - View Dependent Claims (19)
-
Specification