Systems and methods for automated evaluation of human speech

US 9,947,322 B2
Filed: 02/25/2016
Issued: 04/17/2018
Est. Priority Date: 02/26/2015
Status: Active Grant

First Claim

Patent Images

1. A system for performing automated proficiency scoring of speech, the system comprising:

a microphone coupled with a computing device comprising a microprocessor, a memory, and a display operatively coupled together;

wherein the microphone is configured to receive an audible unconstrained speech utterance from a user whose proficiency in a language is being tested and provide a corresponding audio signal to the computing device; and

wherein the microprocessor and memory are configured to;

receive the audio signal; and

process the audio signal by;

recognizing a plurality of phones and a plurality of pauses comprised in the audio signal corresponding with the utterance;

dividing the plurality of phones and plurality of pauses into a plurality of tone units;

grouping the plurality of phones into a plurality of syllables;

identifying a plurality of filled pauses from among the plurality of pauses;

detecting a plurality of prominent syllables from among the plurality of syllables;

identifying, from among the plurality of prominent syllables, a plurality of tonic syllables;

identifying a tone choice for each of the tonic syllables of the plurality of tonic syllables to form a plurality of tone choices;

calculating a relative pitch for each of the tonic syllables of the plurality of tonic syllables to form a plurality of relative pitch values;

calculating a plurality of suprasegmental parameters using one of the plurality of pauses, the plurality of filled pauses, the plurality of tone units, the plurality of syllables, the plurality of prominent syllables, the plurality of tone choices the plurality of relative pitch values, and any combination thereof;

using the plurality of suprasegmental parameters, calculating a language proficiency rating for the user; and

displaying the language proficiency rating of the user on the display associated with the computing device using the microprocessor and the memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for evaluating human speech. Implementations may include: a microphone coupled with a computing device comprising a microprocessor, a memory, and a display operatively coupled together. The microphone may be configured to receive an audible unconstrained speech utterance from a user whose proficiency in a language is being tested and provide a corresponding audio signal to the computing device. The microprocessor and memory may receive the audio signal and process the audio signal by recognizing a plurality of phones and a plurality of pauses and calculate a plurality of suprasegmental parameters using the plurality of pauses and the plurality of phones. The microprocessor and memory may use the plurality of suprasegmental parameters to calculate a language proficiency rating for the user and display the language proficiency rating of the user on the display associated with the computing device.

45 Citations

View as Search Results

20 Claims

1. A system for performing automated proficiency scoring of speech, the system comprising:
- a microphone coupled with a computing device comprising a microprocessor, a memory, and a display operatively coupled together;
  
  wherein the microphone is configured to receive an audible unconstrained speech utterance from a user whose proficiency in a language is being tested and provide a corresponding audio signal to the computing device; and
  
  wherein the microprocessor and memory are configured to;
  
  receive the audio signal; and
  
  process the audio signal by;
  
  recognizing a plurality of phones and a plurality of pauses comprised in the audio signal corresponding with the utterance;
  
  dividing the plurality of phones and plurality of pauses into a plurality of tone units;
  
  grouping the plurality of phones into a plurality of syllables;
  
  identifying a plurality of filled pauses from among the plurality of pauses;
  
  detecting a plurality of prominent syllables from among the plurality of syllables;
  
  identifying, from among the plurality of prominent syllables, a plurality of tonic syllables;
  
  identifying a tone choice for each of the tonic syllables of the plurality of tonic syllables to form a plurality of tone choices;
  
  calculating a relative pitch for each of the tonic syllables of the plurality of tonic syllables to form a plurality of relative pitch values;
  
  calculating a plurality of suprasegmental parameters using one of the plurality of pauses, the plurality of filled pauses, the plurality of tone units, the plurality of syllables, the plurality of prominent syllables, the plurality of tone choices the plurality of relative pitch values, and any combination thereof;
  
  using the plurality of suprasegmental parameters, calculating a language proficiency rating for the user; and
  
  displaying the language proficiency rating of the user on the display associated with the computing device using the microprocessor and the memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein recognizing a plurality of phones and a plurality of pauses further comprises recognizing using an automatic speech recognition system (ASR) and the microprocessor wherein the ASR is trained using a speech corpus.
  - 3. The system of claim 2, further comprising identifying a plurality of silent pauses of the plurality of pauses after recognizing using the ASR.
  - 4. The system of claim 3, wherein dividing the plurality of phones and plurality of pauses into a plurality of tone units further comprises using the plurality of silent pauses and one of a plurality of pitch resets and a plurality of slow pace values.
  - 5. The system of claim 1, wherein grouping the plurality of phones into a plurality of syllables further comprises using a predetermined bias value.
  - 6. The system of claim 1, wherein detecting a plurality of prominent syllables from among the plurality of syllables further comprises detecting using a bagging ensemble of decision tree learners, two or more speech corpora, and the microprocessor.
  - 7. The system of claim 1, wherein identifying a tone choice for each of the tonic syllables further comprises identifying using a rule-based classifier comprising a 4-point model, two or more speech corpora, and the microprocessor.
  - 8. The system of claim 1, wherein the plurality of suprasegmental parameters are selected from the group consisting of articulation rate (ARTI), phonation time ratio (PHTR), tone unit mean length (RNLN), syllables per second (SYPS), filled pause mean length (FPLN), filled pauses per second (FPRT), silent pause mean length (SPLN), silent pauses per second (SPRT), prominent syllables per tone unit (PACE), percent of tone units containing at least one prominent syllable (PCHR), percent of syllables that are prominent (SPAC), overall pitch range (PRAN), non-prominent syllable mean pitch (AVNP), prominent syllable mean pitch (AVPP), falling-high rate (FALH), falling-low rate (FALL), falling-mid rate (FALM), fall-rise-high rate (FRSH), fall-rise-low rate (FRSL), fall-rise-mid rate (FRSM), neutral-high rate (NEUH), neutral-low rate (NEUL), neutral-mid rate (NEUM), rise-fall-high rate (RFAH), rise-fall-low rate (RFAL), rise-fall-mid rate (RFAM), rising-high rate (RISH), rising-low rate (RISL), rising-mid rate (RISM), given lexical item mean pitch (GIVP), new lexical item mean pitch (NEWP), paratone boundary onset pitch mean height (OPTH), paratone boundaries per second (PARA), paratone boundary mean pause length (PPLN), paratone boundary mean termination pitch height (TPTH), and any combination thereof.
  - 9. The system of claim 1, wherein calculating the language proficiency rating for the user further comprises calculating using the plurality of suprasegmental parameters and a pairwise coupled ensemble of decision tree learners and the microprocessor.
  - 10. The system of claim 1, wherein the language is English and the language proficiency rating is based on a Cambridge English Language Assessment rating system.

11. A method of performing automated proficiency scoring of speech, the method comprising:
- generating an audio signal using a microphone by receiving an audible unconstrained speech utterance from a user whose proficiency in a language is being tested;
  
  providing the audio signal to a computing device coupled with the microphone, the computing device comprising a microprocessor, a memory, and a display operatively coupled together;
  
  processing the audio signal using the microprocessor and memory by;
  
  recognizing a plurality of phones and a plurality of pauses comprised in the audio signal corresponding with the utterance;
  
  dividing the plurality of phones and plurality of pauses into a plurality of tone units;
  
  grouping the plurality of phones into a plurality of syllables;
  
  identifying a plurality of filled pauses from among the plurality of pauses;
  
  detecting a plurality of prominent syllables from among the plurality of syllables;
  
  identifying, from among the plurality of prominent syllables, a plurality of tonic syllables;
  
  identifying a tone choice for each of the tonic syllables of the plurality of tonic syllables to form a plurality of tone choices;
  
  calculating a relative pitch for each of the tonic syllables of the plurality of tonic syllables to form a plurality of relative pitch values;
  
  calculating a plurality of suprasegmental parameters using one of the plurality of pauses, the plurality of filled pauses, the plurality of tone units, the plurality of syllables, the plurality of prominent syllables, the plurality of tone choices, the plurality of relative pitch values, and any combination thereof;
  
  using the plurality of suprasegmental parameters, calculating a language proficiency rating for the user; and
  
  displaying the language proficiency rating of the user on the display associated with the computing device using the microprocessor and the memory.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, wherein recognizing a plurality of phones and a plurality of pauses further comprises recognizing using an automatic speech recognition system (ASR) and the microprocessor wherein the ASR is trained using a speech corpus.
  - 13. The method of claim 12, further comprising identifying a plurality of silent pauses of the plurality of pauses after recognizing using the ASR.
  - 14. The method of claim 13, wherein dividing the plurality of phones and plurality of pauses into a plurality of tone units further comprises using the plurality of silent pauses and one of a plurality of pitch resets and a plurality of slow pace values.
  - 15. The method of claim 11, wherein detecting a plurality of prominent syllables from among the plurality of syllables further comprises detecting using a bagging ensemble of decision tree learners, two or more speech corpora, and the microprocessor.
  - 16. The method of claim 11, wherein identifying a tone choice for each of the tonic syllables further comprises identifying using a rule-based classifier comprising a 4-point model, two or more speech corpora, and the microprocessor.
  - 17. The system of claim 11, wherein the plurality of suprasegmental parameters are selected from the group consisting of articulation rate (ARTI), phonation time ratio (PHTR), tone unit mean length (RNLN), syllables per second (SYPS), filled pause mean length (FPLN), filled pauses per second (FPRT), silent pause mean length (SPLN), silent pauses per second (SPRT), prominent syllables per tone unit (PACE), percent of tone units containing at least one prominent syllable (PCHR), percent of syllables that are prominent (SPAC), overall pitch range (PRAN), non-prominent syllable mean pitch (AVNP), prominent syllable mean pitch (AVPP), falling-high rate (FALH), falling-low rate (FALL), falling-mid rate (FALM), fall-rise-high rate (FRSH), fall-rise-low rate (FRSL), fall-rise-mid rate (FRSM), neutral-high rate (NEUH), neutral-low rate (NEUL), neutral-mid rate (NEUM), rise-fall-high rate (RFAH), rise-fall-low rate (RFAL), rise-fall-mid rate (RFAM), rising-high rate (RISH), rising-low rate (RISL), rising-mid rate (RISM), given lexical item mean pitch (GIVP), new lexical item mean pitch (NEWP), paratone boundary onset pitch mean height (OPTH), paratone boundaries per second (PARA), paratone boundary mean pause length (PPLN), paratone boundary mean termination pitch height (TPTH), and any combination thereof.
  - 18. The system of claim 11, wherein calculating a language proficiency rating for the user further comprises calculating using the plurality of suprasegmental parameters and a pairwise coupled ensemble of decision tree learners and the microprocessor.
  - 19. The system of claim 11, wherein the language is English and the language proficiency rating is based on a Cambridge English Language Assessment rating system.

20. A method of calculating a plurality of suprasegmental values for an utterance, the method comprising:
- generating an audio signal using a microphone by receiving an audible unconstrained speech utterance from a user;
  
  providing the audio signal to a computing device coupled with the microphone, the computing device comprising a microprocessor, a memory, and a display operatively coupled together;
  
  processing the audio signal using the microprocessor and memory by;
  
  recognizing a plurality of phones and a plurality of pauses comprised in the audio signal corresponding with the utterance;
  
  dividing the plurality of phones and plurality of pauses into a plurality of tone units;
  
  grouping the plurality of phones into a plurality of syllables;
  
  identifying a plurality of filled pauses from among the plurality of pauses;
  
  detecting a plurality of prominent syllables from among the plurality of syllables;
  
  identifying, from among the plurality of prominent syllables, a plurality of tonic syllables;
  
  identifying a tone choice for each of the tonic syllables of the plurality of tonic syllables to form a plurality of tone choices;
  
  calculating a relative pitch for each of the tonic syllables of the plurality of tonic syllables to form a plurality of relative pitch values; and
  
  calculating a plurality of suprasegmental parameters using one of the plurality of pauses, the plurality of filled pauses, the plurality of tone units, the plurality of syllables, the plurality of prominent syllables, the plurality of tone choices, the plurality of relative pitch values, and any combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Arizona Board of Regents (University of Arizona)
Original Assignee
Arizona Board of Regents (University of Arizona)
Inventors
Kang, Okim, Johnson, David O.
Primary Examiner(s)
He, Jialong

Application Number

US15/054,128
Publication Number

US 20160253999A1
Time in Patent Office

782 Days
Field of Search
US Class Current
CPC Class Codes

G09B 19/04   Speaking with audible prese...

G10L 15/00   Speech recognition G10L17/0...

G10L 17/00   Speaker identification or v...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/027   Syllables being the recogni...

G10L 25/60   for measuring the quality o...

G10L 25/75   for modelling vocal tract p...

G10L 25/87   Detection of discrete point...

Systems and methods for automated evaluation of human speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

45 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for automated evaluation of human speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links