Assessing speech prosody

US 9,368,126 B2
Filed: 04/29/2011
Issued: 06/14/2016
Est. Priority Date: 04/30/2010
Status: Active Grant

First Claim

Patent Images

1. A method for assessing speech prosody, comprising:

receiving, by a computing device, spoken speech, the spoken speech being converted into input speech data representing the spoken speech;

processing, by the computing device, the input speech data to acquire an input language structure that corresponds to the input speech data and that represents part of speech role of words of the spoken speech;

obtaining, from a corpus of standard speech data comprising at least one example of standard speech data having a matching language structure as at least a portion of the input speech data, a language structure of standard speech;

traversing a decision tree that corresponds to the language structure of standard speech based on at least a portion of the input language structure to identify, for a word in the input language structure, an occurrence probability of phrase boundary location at the word, wherein a leaf node of the decision tree identifies a determined occurrence probability of phrase boundary location for a part of speech based on a first adjacent part of speech to the left of the part of speech and a second adjacent part of speech to the right of the part of speech;

acquiring a rhythm feature and a fluency feature of the input speech data based, at least in part, on the occurrence probability of phrase boundary location for the word;

acquiring, from the corpus of standard speech data, a prosody constraint based on the rhythm feature and the fluency feature;

assessing prosody of the input speech data according to the prosody constraint;

providing an assessment result based on the prosody constraint; and

the corpus of standard speech data or outputting reference speech that indicates a correct way to say the spoken speech.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system and computer readable storage medium for assessing speech prosody. The method includes the steps of: receiving input speech data; acquiring a prosody constraint; assessing prosody of the input speech data according to the prosody constraint; and providing assessment result where at least of the steps is carried out using a computer device.

Citations

22 Claims

1. A method for assessing speech prosody, comprising:
- receiving, by a computing device, spoken speech, the spoken speech being converted into input speech data representing the spoken speech;
  
  processing, by the computing device, the input speech data to acquire an input language structure that corresponds to the input speech data and that represents part of speech role of words of the spoken speech;
  
  obtaining, from a corpus of standard speech data comprising at least one example of standard speech data having a matching language structure as at least a portion of the input speech data, a language structure of standard speech;
  
  traversing a decision tree that corresponds to the language structure of standard speech based on at least a portion of the input language structure to identify, for a word in the input language structure, an occurrence probability of phrase boundary location at the word, wherein a leaf node of the decision tree identifies a determined occurrence probability of phrase boundary location for a part of speech based on a first adjacent part of speech to the left of the part of speech and a second adjacent part of speech to the right of the part of speech;
  
  acquiring a rhythm feature and a fluency feature of the input speech data based, at least in part, on the occurrence probability of phrase boundary location for the word;
  
  acquiring, from the corpus of standard speech data, a prosody constraint based on the rhythm feature and the fluency feature;
  
  assessing prosody of the input speech data according to the prosody constraint;
  
  providing an assessment result based on the prosody constraint; and
  
  the corpus of standard speech data or outputting reference speech that indicates a correct way to say the spoken speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method according to claim 1 further comprising:
    - acquiring a standard rhythm feature for the input speech data; and
      
      wherein acquiring the prosody constraint comprises comparing the rhythm feature to the standard rhythm feature.
  - 3. The method according to claim 2, wherein the rhythm feature is represented as a phrase boundary location of the input speech data.
  - 4. The method according to claim 3, wherein comparing the rhythm feature to the standard rhythm feature comprises determining whether the phrase boundary location matches with a standard phrase boundary location.
  - 5. The method according to claim 3, wherein acquiring the rhythm feature comprises:
    - acquiring input text data corresponding to the input speech data;
      
      aligning the input text data with the input speech data; and
      
      determining the phrase boundary location based on alignment of the input text data with the input speech data.
  - 6. The method according to claim 5, wherein acquiring the standard rhythm feature comprises:
    - matching the input language structure with the standard language structure of standard speech; and
      
      selecting a standard phrase boundary location for the input language structure as the standard rhythm feature based on a plurality of occurrence probabilities of phrase boundary locations wherein individual occurrence probabilities of phrase boundary locations in the plurality of occurrence probabilities of phrase boundary locations correspond to individual words in the input speech data.
  - 7. The method according to claim 6, wherein selecting the standard phrase boundary location for the input language structure as the standard rhythm feature comprises:
    - determining that the occurrence probability is above a predetermined threshold.
  - 8. The method according to claim 6, wherein matching the input language structure with the standard language structure comprises traversing the decision tree and determining, for each word in the input speech data, an occurrence probability of phrase boundary location of that word.
  - 9. The method according to claim 1, wherein acquiring the fluency feature comprises:
    - acquiring input text data corresponding to the input speech data; and
      
      aligning the input text data with the input speech data.
  - 10. The method according to claim 9, wherein:
    - the fluency feature comprises a total number of phrase boundaries within a sentence of the input text data;
      
      the phrase boundary comprises a characteristic selected from the group consisting of silence and pitch reset; and
      
      acquiring the prosody constraint comprises predicting a total number of phrase boundaries based on a length of the sentence and comparing the total number of phrase boundaries to a predicted total number of phrase boundaries.
  - 11. The method according to claim 9, wherein:
    - the fluency feature comprises a silence duration within a first phrase boundary;
      
      acquiring the prosody constraint comprises determining a standard silence duration for the input speech data and comparing the silence duration to the standard silence duration; and
      
      the first phrase boundary is a phrase boundary of at least one word of the input text data.
  - 12. The method according to claim 11, wherein determining the standard silence duration comprises:
    - matching the input language structure with the language structure of standard speech to determine the standard silence duration.
  - 13. The method according to claim 12, wherein matching the input language structure with a standard language structure comprises:
    - traversing the decision tree to determine the standard silence duration of the first phrase boundary; and
      
      wherein the standard silence duration is an average value of a silence duration of a second phrase boundary of the language structure of standard speech.
  - 14. The method according to claim 1, wherein:
    - the fluency feature comprises a repetition number wherein the repetition number represents a number of times a word is repeated within the input speech data; and
      
      acquiring the prosody constraint comprises acquiring a value indicating a permissible number of repetitions and comparing the repetition number to the value.
  - 15. The method according to claim 1, wherein:
    - the fluency feature comprises a phone hesitation degree wherein the phone hesitation degree includes a metric selected from the group consisting of a count of phone hesitations and a phone hesitation duration; and
      
      acquiring prosody constraint comprises acquiring a value indicating a permissible phone hesitation degree and comparing the phone hesitation degree to the value.
  - 16. The method according to claim 1, wherein the assessment result comprises a result selected from the group consisting of a score of prosody of the input speech data and a detailed analysis on prosody of the input speech data.

17. A system for assessing speech prosody, comprising:
- one or more processors;
  
  an input speech data an audio receiver configured to receive spoken speech; and
  
  memory storing instructions that, when executed by one of the processors, cause the system toconvert the spoken speech into input speech data representing the spoken speech,process the input speech data to acquire an input language structure that corresponds to the input speech data and that represents part of speech role of words of the spoken speech,obtain, from a corpus of standard speech data comprising at least one example of standard speech data having a matching language structure as at least a portion of the input speech data, a language structure of standard speech,traverse a decision tree that corresponds to the language structure of standard speech based on at least a portion of the input language structure to identify, for a word in the input language structure, an occurrence probability of phrase boundary location at the word, wherein a leaf node of the decision tree identifies a determined occurrence probability of phrase boundary location for a part of speech based on a first adjacent part of speech to the left of the part of speech and a second adjacent part of speech to the right of the part of speech,acquire a rhythm feature and a fluency feature of the input speech data based, at least in part, on the occurrence probability of phrase boundary location for the word,acquire, from the corpus of standard speech data, a prosody constraint based on the rhythm feature and the fluency feature,assess prosody of the input speech data according to the prosody constraint,provide an assessment result based on the prosody constraint, andbased on the assessment result, either add the input speech data to the corpus of standard speech data or output reference speech that indicates a correct way to say the spoken speech.
- View Dependent Claims (18, 19, 20)
- - 18. The system according to claim 17 wherein:
    - the instructions, when executed, further cause the system to acquire a standard rhythm feature for the input speech data; and
      
      acquiring the prosody constraint comprises comparing the rhythm feature to the standard rhythm feature.
  - 19. The system according to claim 17, wherein:
    - the instructions, when executed, further cause the system toacquire input text data corresponding to the input speech data, andalign the input text data with the input speech data.
  - 20. The system according to claim 19, wherein:
    - the fluency feature is selected from the group consisting of a total number of phrase boundaries, a silence duration of a phrase boundary, a number of repetition times of a word, and a phone hesitation degree; and
      
      the phone hesitation degree includes a metric selected from the group consisting of a total number of phone hesitations and a phone hesitation duration.

21. A computer-implemented method for assessing speech prosody comprising:
- receiving, by a computing device, spoken speech, the spoken speech being converted into input speech data representing the spoken speech;
  
  processing, by the computing device, the input speech data to acquire an input language structure that corresponds to the input speech data and that represents part of speech role of words of the spoken speech;
  
  obtaining, from a corpus of standard speech data comprising at least one example of standard speech data having a matching language structure as at least a portion of the input speech data, a language structure of standard speech;
  
  obtaining traversing a decision tree that corresponds to the language structure of standard speech based on at least a portion of the input language structure to identify, for a word in the input language structure, an occurrence probability of phrase boundary location at the word and a silence duration of phrase boundary location at the word, wherein a leaf node of the decision tree identifies a determined occurrence probability of phrase boundary location for a part of speech and a determined average silence duration for the part of speech each based on a first adjacent part of speech to the left of the part of speech and a second adjacent part of speech to the right of the part of speech;
  
  acquiring a rhythm feature and a fluency feature of the input speech data, wherein the rhythm feature is acquired based, at least in part, on the occurrence probability of phrase boundary location for the word and wherein the fluency feature is acquired based, at least in part, on the silence duration of phrase boundary location for the word;
  
  acquiring, from the corpus of standard speech data, a standard rhythm feature and a standard fluency feature based on the decision tree;
  
  performing a first comparison of the rhythm feature to the standard rhythm feature;
  
  performing a second comparison of the fluency feature to the standard fluency feature;
  
  obtaining a prosody assessment result based on the first and second comparisons; and
  
  based on the prosody assessment result, either adding the input speech data to the corpus of standard speech data or outputting reference speech data that indicates a correct way to say the spoken speech.
- View Dependent Claims (22)
- - 22. The computer-implemented method of claim 21 further comprising:
    - acquiring input text data corresponding to the input speech data; and
      
      the input language structure corresponding to the input text data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Qin, Yong, Shi, Qin, Shuang, Zhiwei, Zhang, Shi Lei
Primary Examiner(s)
Sirjani, Fariba

Application Number

US13/097,191
Publication Number

US 20110270605A1
Time in Patent Office

1,873 Days
Field of Search

704 1- 10
US Class Current

1/1
CPC Class Codes

G10L 25/48 specially adapted for parti...

Assessing speech prosody

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Assessing speech prosody

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links