Hierarchical transcription and display of input speech

US 20020133340A1
Filed: 03/16/2001
Published: 09/19/2002
Est. Priority Date: 03/16/2001
Status: Active Grant

First Claim

Patent Images

1. A method for hierarchical transcription and display of input speech, the method comprising the steps of:

converting a speech portion to a word;

determining a confidence of the word;

displaying the word if the confidence of the word meets a threshold confidence; and

displaying at least one syllable, corresponding to the word, if the confidence of the word does not meet the threshold confidence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Generally, the present invention provides the ability to present a mixed display of a transcription to a user. The mixed display is preferably organized in a hierarchical fashion. Words, syllables and phones can be placed on the same display by the present invention, and the present invention can select the appropriate symbol transcription based on the parts of speech that meet minimum confidences. Words are displayed if they meet a minimum confidence or else syllables, which make up the word, are displayed. Additionally, if a syllable does not meet a predetermined confidence, then phones, which make up the syllable, may be displayed. A transcription, in one aspect of the present invention, may also be described as a hierarchical transcription, because a unique confidence is derived that accounts for mixed word/syllable/phone data.

55 Citations

View as Search Results

41 Claims

1. A method for hierarchical transcription and display of input speech, the method comprising the steps of:
- converting a speech portion to a word;
  
  determining a confidence of the word;
  
  displaying the word if the confidence of the word meets a threshold confidence; and
  
  displaying at least one syllable, corresponding to the word, if the confidence of the word does not meet the threshold confidence.
- View Dependent Claims (2)
- - 2. The method of claim 1, wherein the step of displaying at least one syllable comprises the steps of:
    - determining a confidence of the at least one syllable; and
      
      displaying at least one phone that corresponds to the at least one syllable if the confidence of the at least one syllable does not meet a threshold confidence.

3. A method comprising the steps of:
- providing a recognized sentence portion comprising words and syllables;
  
  transforming a plurality of hypothesis scores of the recognized sentence portion to phone level;
  
  determining, by using the transformed hypothesis scores, confidence of the recognized sentence portion as a function of time; and
  
  using the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
- - 4. The method of claim 3, further comprising the steps of:
    - providing a history comprising words already learned by a student; and
      
      using the history to determine how likely it is that a student made a reading mistake.
  - 5. The method of claim 4, wherein:
    - the method further comprises the steps of;
      
      determining, by using confidence as a function of time, a series of phones in the recognized sentence portion, where each phone in the series is selected as a most likely phone;
      
      determining a correct phonetic pronunciation of a word; and
      
      determining if phones in the series that correspond to the word match the correct phonetic pronunciation of the word; and
      
      the step of using the history to determine how likely it is that a student made a reading mistake further comprises the steps of;
      
      if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is marked as previously learned, displaying the correct phonetic pronunciation; and
      
      if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is not marked as previously learned, displaying the correct phonetic pronunciation and emphasizing the phones that are incorrect.
  - 6. The method of claim 3, further comprising the steps of:
    - determining, by using confidence as a function of time, a series of phones in the recognized sentence portion, where each phone in the series is selected as a most likely phone;
      
      determining a correct phonetic pronunciation of a word;
      
      determining if phones in the series that correspond to the word match the correct phonetic pronunciation of the word; and
      
      if one or more of the phones are incorrect, displaying the correct phonetic pronunciation of the word and emphasizing the phones that are incorrect.
  - 7. The method of claim 3, wherein:
    - the step of transforming a plurality of hypothesis scores of the recognized sentence portion to phone level further comprises the steps of;
      
      determining a plurality of hypotheses for a recognized sentence portion;
      
      converting the plurality of hypotheses into a sequence of phones;
      
      determining a probability from each hypothesis score; and
      
      determining start and end times for each phone, whereby the probabilities may be assigned to each phone and the hypothesis scores are thereby transformed to phone level; and
      
      wherein the step of determining confidence as a function of time comprises the steps of;
      
      associating a number of hypotheses, associated probabilities and phones with each of a plurality of frames; and
      
      computing, for each frame, a frame confidence by adding probabilities of all hypotheses for which a phone hypothesized at time t matches a phone hypothesized at time t in a top hypothesis.
  - 8. The method of claim 3, wherein the step of using the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion further comprises the steps of, for each part of speech of interest:
    - selecting a part of speech that spans a time period;
      
      determining an average confidence over that time period; and
      
      equating the average confidence over that time period as the confidence of the part of speech.
  - 9. The method of claim 8, further comprising the step of combining, for each part of speech of interest, the confidence of this part of speech with one or more additional confidences of this part of speech determined through other methods.
  - 10. The method of claim 3, wherein the parts of speech comprise words, syllables and phones.

11. A method for hierarchical transcription and display of input speech, the method comprising the steps of:
- determining for a speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech; and
  
  displaying the part of speech that meets the predetermined criteria for that part of speech.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 35, 36)
- - 12. The method of claim 11, wherein:
    - the step of determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech comprises the step of determining if a word determined from the speech portion meets a predetermined word confidence; and
      
      the step of displaying the part of speech that meets the predetermined criteria for that part of speech comprises the step of displaying the word if the word meets the predetermined word confidence.
  - 13. The method of claim 12, wherein:
    - the step of determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech further comprises the step of if the word does not meet the predetermined word confidence, determining if at least one syllable that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the step of displaying the part of speech that meets the predetermined criteria for that part of speech comprises the step of if the word does not meet the predetermined word confidence, displaying the at least one syllable if the at least one syllable meets the predetermined syllable confidence.
  - 14. The method of claim 13, wherein:
    - the step of determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech further comprises the step of if the word does not meet the predetermined word confidence, determining if each of the at least one syllables that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the step of displaying the part of speech that meets the predetermined criteria for that part of speech comprises the steps of if the word does not meet the predetermined word confidence, displaying at least one phone for each of the at least one syllables that does not meet the predetermined syllable confidence.
  - 15. The method of claim 11, wherein the plurality of parts of speech comprise words, syllables or phones.
  - 16. The method of claim 11, wherein:
    - the step of determining for a speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech further comprises the steps of;
      
      determining confidence as a function of time for the speech portion;
      
      determining a confidence for a word by determining an average confidence of a time period spanned by the word; and
      
      determining if the confidence of the word meets a predetermined word confidence; and
      
      the step of displaying the part of speech that meets the predetermined criteria for that part of speech comprises the steps of;
      
      displaying the word if the confidence of the word meets the predetermined word confidence; and
      
      displaying at least one syllable that corresponds to the word if the confidence of the word does not meet the predetermined word confidence.
  - 17. The method of claim 16, wherein:
    - the step of determining for a speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech further comprises the steps of;
      
      determining a confidence for each of the at least one syllables that correspond to the word by determining an average confidence for each syllable, wherein each syllable spans a time period that is equal to or less than the time period spanned by the word; and
      
      determining if the confidence for each syllable meets a predetermined syllable confidence; and
      
      the step of displaying the part of speech that meets the predetermined criteria for that part of speech further comprises the steps of;
      
      for each of the syllables, displaying a syllable if the confidence of the syllable meets the predetermined syllable confidence; and
      
      for each of the syllables, displaying at least one phone that corresponds to the syllable if the confidence of the syllable does not meet the predetermined syllable confidence.
  - 19. The system of claim 18, wherein the computer-readable code is further configured to:
    - provide a history comprising words already learned by a student; and
      
      use the history to determine how likely it is that a student made a reading mistake.
  - 20. The system of claim 19, wherein:
    - the computer-readable code is further configured to;
      
      determine, by using confidence as a function of time, a series of phones in the recognized sentence portion, where each phone in the series is selected as a most likely phone;
      
      determine a correct phonetic pronunciation of a word; and
      
      determine if phones in the series that correspond to the word match the correct phonetic pronunciation of the word; and
      
      the computer-readable code is further configured, when using the history to determine how likely it is that a student made a reading mistake further, to;
      
      if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is marked as previously learned, display the correct phonetic pronunciation; and
      
      if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is not marked as previously learned, display the correct phonetic pronunciation and emphasize the phones that are incorrect.
  - 21. The system of claim 18, wherein:
    - the computer-readable code is further configured, when transforming a plurality of hypothesis scores of the recognized sentence portion to phone level, to;
      
      determine a plurality of hypotheses for a recognized sentence portion;
      
      convert the plurality of hypotheses into a sequence of phones;
      
      determine a probability from each hypothesis score; and
      
      determine start and end times for each phone, whereby the probabilities may be assigned to each phone and the hypothesis scores are thereby transformed to phone level; and
      
      the computer-readable code is further configured, when determining confidence as a function of time, to;
      
      associate a number of hypotheses, associated probabilities and phones with each of a plurality of frames; and
      
      compute, for each frame, a frame confidence by adding probabilities of all hypotheses for which a phone hypothesized at time t matches a phone hypothesized at time t in a top hypothesis.
  - 22. The system of claim 18, wherein the computer-readable code is further configured, when using the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion, to, for each part of speech of interest:
    - select a part of speech that spans a time period;
      
      determine an average confidence over that time period; and
      
      equate the average confidence over that time period as the confidence of the part of speech.
  - 23. The system of claim 22, wherein the computer-readable code is further configured to combine, for each part of speech of interest, the confidence of this part of speech with one or more additional confidences for this part of speech determined through other methods.
  - 24. The system of claim 18, wherein the parts of speech comprise words, syllables and phones.
  - 26. The system of claim 25, wherein:
    - the computer-readable code is further configured, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, to determine if a word determined from the speech portion meets a predetermined word confidence; and
      
      the computer-readable code is further configured, when displaying the part of speech that meets the predetermined criteria for that part of speech, to display the word if the word meets the predetermined word confidence.
  - 27. The system of claim 26, wherein:
    - the computer-readable code is further configured, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, to, if the word does not meet the predetermined word confidence, determine if at least one syllable that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the computer-readable code is further configured, when displaying the part of speech that meets the predetermined criteria for that part of speech, to, if the word does not meet the predetermined word confidence, display the at least one syllable if the at least one syllable meets the predetermined syllable confidence.
  - 28. The system of claim 27, wherein:
    - the computer-readable code is further configured, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, to, if the word does not meet the predetermined word confidence, determine if each of the at least one syllables that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the computer-readable code is further configured, when displaying the part of speech that meets the predetermined criteria for that part of speech, to, if the word does not meet the predetermined word confidence, display at least one phone for each of the at least one syllables that does not meet the predetermined syllable confidence.
  - 29. The system of claim 25, wherein the plurality of parts of speech comprise words, syllables or phones.
  - 35. The article of manufacture of claim 22, wherein the computer-readable code further comprises a step to combine, for each part of speech of interest, the confidence of this part of speech with one or more additional confidences for this part of speech determined through other methods.
  - 36. The article of manufacture of claim 22, wherein the parts of speech comprise words, syllables and phones.

18. A system comprising:
- a memory that stores computer-readable code; and
  
  a processor operatively coupled to the memory, the processor configured to implement the computer-readable code, the computer-readable code configured to;
  
  provide a recognized sentence portion comprising words and syllables;
  
  transform a plurality of hypothesis scores of the recognized sentence portion to phone level;
  
  determine, by using the transformed hypothesis scores, confidence of the recognized sentence portion as a function of time; and
  
  use the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion.

25. A system for hierarchical transcription and display of input speech, the system comprising:
- a memory that stores computer-readable code; and
  
  a processor operatively coupled to the memory, the processor configured to implement the computer-readable code, the computer-readable code configured to;
  
  determine for a speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech; and
  
  display the part of speech that meets the predetermined criteria for that part of speech.

30. An article of manufacture comprising:
- a computer-readable medium having computer-readable code embodied thereon, the computer-readable code comprising;
  
  a step to provide a recognized sentence portion comprising words and syllables;
  
  a step to transform a plurality of hypothesis scores of the recognized sentence portion to phone level;
  
  a step to determine, by using the transformed hypothesis scores, confidence of the recognized sentence portion as a function of time; and
  
  a step to use the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion.
- View Dependent Claims (31, 32, 33, 34, 38, 39, 40, 41)
- - 31. The article of manufacture of claim 30, wherein the computer-readable code further comprises:
    - a step to provide a history comprising words already learned by a student; and
      
      a step to use the history to determine how likely it is that a student made a reading mistake.
  - 32. The article of manufacture of claim 31, wherein:
    - the computer-readable code further comprises;
      
      a step to determine, by using confidence as a function of time, a series of phones in the recognized sentence portion, where each phone in the series is selected as a most likely phone;
      
      a step to determine a correct phonetic pronunciation of a word; and
      
      a step to determine if phones in the series that correspond to the word match the correct phonetic pronunciation of the word; and
      
      the computer-readable code further comprises, when using the history to determine how likely it is that a student made a reading mistake further;
      
      a step to if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is marked as previously learned, display the correct phonetic pronunciation; and
      
      a step to if a confidence score of one or more of the phones is below a predetermined phone confidence and if the word is not marked as previously learned, display the correct phonetic pronunciation and emphasize the phones that are incorrect.
  - 33. The article of manufacture of claim 30, wherein:
    - the computer-readable code further comprises, when transforming a plurality of hypothesis scores of the recognized sentence portion to phone level;
      
      a step to determine a plurality of hypotheses for a recognized sentence portion;
      
      a step to convert the plurality of hypotheses into a sequence of phones;
      
      a step to determine a probability from each hypothesis score; and
      
      a step to determine start and end times for each phone, whereby the probabilities may be assigned to each phone and the hypothesis scores are thereby transformed to phone level; and
      
      the computer-readable code further comprises, when determining confidence as a function of time;
      
      a step to associate a number of hypotheses, associated probabilities and phones with each of a plurality of frames; and
      
      a step to compute, for each frame, a frame confidence by adding probabilities of all hypotheses for which a phone hypothesized at time t matches a phone hypothesized at time t in a top hypothesis.
  - 34. The article of manufacture of claim 30, wherein the computer-readable code further comprises, when using the confidence as a function of time to determine confidences for parts of speech in the recognized sentence portion, for each part of speech of interest:
    - a step to select a part of speech that spans a time period;
      
      a step to determine an average confidence over that time period; and
      
      a step to equate the average confidence over that time period as the confidence of the part of speech.
  - 38. The article of manufacture of claim 37, wherein:
    - the computer-readable code further comprises, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, a step to determine if a word determined from the speech portion meets a predetermined word confidence; and
      
      the computer-readable code further comprises, when displaying the part of speech that meets the predetermined criteria for that part of speech, a step to display the word if the word meets the predetermined word confidence.
  - 39. The article of manufacture of claim 38, wherein:
    - the computer-readable code further comprises, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, a step to, if the word does not meet the predetermined word confidence, determine if at least one syllable that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the computer-readable code further comprises, when displaying the part of speech that meets the predetermined criteria for that part of speech, a step to, if the word does not meet the predetermined word confidence, display the at least one syllable if the at least one syllable meets the predetermined syllable confidence.
  - 40. The article of manufacture of claim 39, wherein:
    - the computer-readable code further comprises, when determining for a decoded speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech, a step to, if the word does not meet the predetermined word confidence, determine if each of the at least one syllables that comprises the word and that is determined from the speech portion meets a predetermined syllable confidence; and
      
      the computer-readable code further comprises, when displaying the part of speech that meets the predetermined criteria for that part of speech, a step to, if the word does not meet the predetermined word confidence, display at least one phone for each of the at least one syllables that does not meet the predetermined syllable confidence.
  - 41. The article of manufacture of claim 37, wherein the plurality of parts of speech comprise words, syllables or phones.

37. An article of manufacture comprising:
- a computer-readable medium having computer-readable code embodied thereon, the computer-readable code comprising;
  
  a step to determine for a speech portion which of a plurality of parts of speech meets predetermined criteria for that part of speech; and
  
  a step to display the part of speech that meets the predetermined criteria for that part of speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Maison, Benoit Emmanuel, Basson, Sara H., Kanevsky, Dimitri

Granted Patent

US 6,785,650 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 15/22 Procedures used during a sp...

Hierarchical transcription and display of input speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

55 Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Hierarchical transcription and display of input speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links