Methodology for implementing a vocabulary set for use in a speech recognition system

US 20030110031A1
Filed: 03/14/2002
Published: 06/12/2003
Est. Priority Date: 12/07/2001
Status: Active Grant

First Claim

Patent Images

1. A system for implementing a vocabulary set for a speech recognizer, comprising:

a recognizer for analyzing utterances from said vocabulary set to generate N-best lists of recognition candidates;

an acoustical matrix configured to relate said utterances to top recognition candidates from said N-best lists;

a lexical matrix configured to relate said utterances to said top recognition candidates from said N-best lists only when second-highest recognition candidates from said N-best lists are correct recognition results; and

an utterance ranking created according to composite individual error/accuracy values for each of said utterances, said composite individual error/accuracy values being derived from both said acoustical matrix and said lexical matrix, a lowest-ranked utterance being eliminated from said vocabulary set when a total error/accuracy value for all of said utterances does not exceed a predetermined threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention comprises a methodology for implementing a vocabulary set for use in a speech recognition system, and may preferably include a recognizer for analyzing utterances from the vocabulary set to generate N-best lists of recognition candidates. The N-best lists may then be utilized to create an acoustical matrix configured to relate said utterances to top recognition candidates from said N-best lists, as well as a lexical matrix configured to relate the utterances to the top recognition candidates from the N-best lists only when second-highest recognition candidates from the N-best lists are correct recognition results. An utterance ranking may then preferably be created according to composite individual error/accuracy values for each of the utterances. The composite individual error/accuracy values may preferably be derived from both the acoustical matrix and the lexical matrix. Lowest-ranked utterances from the foregoing utterance ranking may preferably be repeatedly eliminated from the vocabulary set when a total error/accuracy value for all of the utterances fails to exceed a predetermined threshold value.

11 Citations

View as Search Results

43 Claims

1. A system for implementing a vocabulary set for a speech recognizer, comprising:
- a recognizer for analyzing utterances from said vocabulary set to generate N-best lists of recognition candidates;
  
  an acoustical matrix configured to relate said utterances to top recognition candidates from said N-best lists;
  
  a lexical matrix configured to relate said utterances to said top recognition candidates from said N-best lists only when second-highest recognition candidates from said N-best lists are correct recognition results; and
  
  an utterance ranking created according to composite individual error/accuracy values for each of said utterances, said composite individual error/accuracy values being derived from both said acoustical matrix and said lexical matrix, a lowest-ranked utterance being eliminated from said vocabulary set when a total error/accuracy value for all of said utterances does not exceed a predetermined threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The system of claim 1 wherein an initial set of said utterances from said vocabulary set are defined based upon intended tasks of said speech recognizer.
  - 3. The system of claim 2 wherein each of said intended tasks of said speech recognizer is associated with one or more alternative commands for requesting said intended tasks.
  - 4. The system of claim 1 wherein said recognizer analyzes each of said utterances by comparing said utterances with word models of said vocabulary set to generate recognition scores.
  - 5. The system of claim 4 wherein said recognizer creates said N-best lists to rank said recognition candidates for each of said utterances according to said recognition scores.
  - 6. The system of claim 1 wherein said acoustical matrix includes input utterances that are vertically configured in said acoustical matrix, and recognition results that are horizontally configured in said acoustical matrix, said acoustical matrix being populated by acoustical matrix values by adding a value of 1 to a corresponding acoustical matrix location each time one of said top recognition candidates is identified as a recognition result for a corresponding one of said input utterances.
  - 7. The system of claim 1 wherein an individual acoustical error value, Acoustical Error_i, for an input utterance may be calculated with information from an acoustical matrix row by utilizing a formula:
    - Acoustical Error_i=Σ
      
      Incorrect_i/(Correct_i+Σ
      
      Incorrect_i)where said Correct_iis an acoustical matrix value for a correctly-identified recognition result from said input utterance, and said Σ
      
      Incorrect_iis a sum of all acoustical matrix values for incorrectly-identified recognition results from said input utterance.
  - 8. The system of claim 1 wherein said lexical matrix includes input utterances that are vertically configured in said lexical matrix, and recognition results that are horizontally configured in lexical matrix, said lexical matrix being populated by lexical matrix values by adding a value of 1 to a lexical matrix location for a recognition result of one of said top recognition candidates and an input utterance, but only when said one of said top recognition candidates is incorrectly identified by said recognizer, and a corresponding one of said second-highest recognition candidates is a correct recognition result for said input utterance.
  - 9. The system of claim 1 wherein an individual lexical error value, Lexical Error_j, for one of said recognition results may be calculated from a lexical matrix column by utilizing a formula:
    - Lexical Error_j=Σ
      
      Incorrect_j/(Correct_i+Σ
      
      Incorrect_i)where said Σ
      
      Incorrect_jis a sum of all lexical matrix values for incorrectly-identified input utterances for a particular recognition result that have the correct recognition result as a second-highest recognition candidate, said Correct_iis an acoustical matrix value for a correctly-identified recognition result from an individual input utterance, and said Σ
      
      Incorrect_iis a sum of all acoustical matrix values for incorrectly-identified recognition results from said individual input utterance.
  - 10. The system of claim 1 wherein said composite individual error/accuracy values for each of said utterances are implemented as a composite Acoustical-Lexical Error that is calculated according to a formula:
    - Acoustical-Lexical Error=Acoustical Error_i+Lexical Error_jwhere said Acoustical Error_iis an individual acoustical error value for one of said utterances from said acoustical matrix, and said Lexical Error_jis an individual lexical error value for said one of said utterances from said lexical matrix.
  - 11. The system of claim 1 wherein said composite individual error/accuracy values for each of said utterances are implemented as an Acoustical-Lexical Accuracy that is calculated according to a formula:
    - Acoustical-Lexical Accuracy=(Correct_i−
      
      Σ
      
      Incorrect_j)/(Correct_i+Σ
      
      Incorrect_i)where said Correct_iis an acoustical matrix value for a correctly-identified recognition result and an input utterance, said Σ
      
      Incorrect_jis a sum of all lexical matrix values for incorrectly-identified input utterances for a recognition result that has a correct recognition result as one of said second-highest recognition candidates, and said Σ
      
      Incorrect_iis a summation of all acoustical matrix values for incorrectly-identified recognition results from said input utterance.
  - 12. The system of claim 1 wherein said utterances of said utterance ranking are preferably ranked by respective individual composite acoustical-lexical error values, or by respective individual composite acoustical-lexical accuracy values, with said lowest-ranked utterance having a lowest individual composite acoustical-lexical error value, or a lowest individual composite acoustical-lexical accuracy value.
  - 13. The system of claim 1 wherein said total error/accuracy value for all of said utterances is implemented as a total acoustical error value, Acoustical Error_T, that is calculated according to a formula:
    - Acoustical Error_T=Σ
      
      Incorrect_T/(Σ
      
      Correct_T+Σ
      
      Incorrect_T)where said Correct_Tis a sum from said acoustical matrix for correctly-identified recognition results from all input utterances, and said Σ
      
      Incorrect_Tis a summation from said acoustical matrix for incorrectly-identified recognition results from said all input utterances.
  - 14. The system of claim 1 wherein accuracy values, Accuracy, may be calculated from corresponding error values, Error, to implement said composite individual error/accuracy values or said total error/accuracy value according to a formula:
    - Error=1−
      
      Accuracywhere either said error values or said accuracy values are alternately utilized to evaluate individual or total utterance recognition characteristics for said speech recognizer.
  - 15. The system of claim 1 wherein said total error/accuracy value is compared to said predetermined threshold to determine whether said vocabulary set is optimized, said predetermined threshold being selected to produce desired speech recognition performance characteristics in said speech recognizer.
  - 16. The system of claim 1 wherein said vocabulary set is finalized when said total error/accuracy value is implemented as a total error value and said predetermined threshold is greater than said total error value, or when said total error/accuracy value is implemented as a total accuracy value and said predetermined threshold is less than said total accuracy value.
  - 17. The system of claim 1 wherein multiple lower-ranked utterances are eliminated from said utterance ranking when said total error/accuracy value for all of said utterances does not exceed said predetermined threshold.
  - 18. The system of claim 1 wherein acoustical matrix values from said acoustical matrix and lexical matrix values from said lexical matrix are set to zero for said lowest-ranked utterance to thereby generate an updated acoustical matrix and an updated lexical matrix.
  - 19. The system of claim 18 wherein said total error/accuracy value for all remaining utterances is repeatedly recalculated by using revised acoustical matrix values from said updated acoustical matrix, said total error/accuracy value then being iteratively recalculated to eliminate lower ranked utterances from said vocabulary set until said predetermined threshold value is exceeded.
  - 20. The system of claim 1 wherein, after eliminating said lowest-ranked utterance from said vocabulary set, said recognizer reanalyzes remaining utterances from said vocabulary set, and responsively generates new N-best lists which may then be utilized to create a new acoustical matrix and a new lexical matrix for ranking said remaining utterances.

21. A method for implementing a vocabulary set for a speech recognizer, comprising the steps of:
- analyzing utterances from said vocabulary set with a recognizer to generate N-best lists of recognition candidates;
  
  relating said utterances to top recognition candidates from said N-best lists with an acoustical matrix;
  
  compiling a lexical matrix that relates said utterances to said top recognition candidates from said N-best lists only when second-highest recognition candidates from said N-best lists are correct recognition results; and
  
  creating an utterance ranking according to composite individual error/accuracy values for each of said utterances, said composite individual error/accuracy values being derived from both said acoustical matrix and said lexical matrix, a lowest-ranked utterance being eliminated from said vocabulary set when a total error/accuracy value for all of said utterances does not exceed a predetermined threshold.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 22. The method of claim 21 wherein an initial set of said utterances from said vocabulary set are defined based upon intended tasks of said speech recognizer.
  - 23. The method of claim 22 wherein each of said intended tasks of said speech recognizer is associated with one or more alternative commands for requesting said intended tasks.
  - 24. The method of claim 21 wherein said recognizer analyzes each of said utterances by comparing said utterances with word models of said vocabulary set to generate recognition scores.
  - 25. The method of claim 24 wherein said recognizer creates said N-best lists to rank said recognition candidates for each of said utterances according to said recognition scores.
  - 26. The method of claim 21 wherein said acoustical matrix includes input utterances that are vertically configured in said acoustical matrix, and recognition results that are horizontally configured in said acoustical matrix, said acoustical matrix being populated by acoustical matrix values by adding a value of 1 to a corresponding acoustical matrix location each time one of said top recognition candidates is identified as a recognition result for a corresponding one of said input utterances.
  - 27. The method of claim 21 wherein an individual acoustical error value, Acoustical Error_i, for an input utterance may be calculated with information from an acoustical matrix row by utilizing a formula:
    - Acoustical Error_i=Σ
      
      Incorrect_i/(Correct_i+Σ
      
      Incorrect_i)where said Correct_iis an acoustical matrix value for a correctly-identified recognition result from said input utterance, and said Σ
      
      Incorrect_iis a sum of all acoustical matrix values for incorrectly-identified recognition results from said input utterance.
  - 28. The method of claim 21 wherein said lexical matrix includes input utterances that are vertically configured in said lexical matrix, and recognition results that are horizontally configured in lexical matrix, said lexical matrix being populated by lexical matrix values by adding a value of 1 to a lexical matrix location for a recognition result of one of said top recognition candidates and an input utterance, but only when said one of said top recognition candidates is incorrectly identified by said recognizer, and a corresponding one of said second-highest recognition candidates is a correct recognition result for said input utterance.
  - 29. The method of claim 21 wherein an individual lexical error value, Lexical Error_j, for one of said recognition results may be calculated from a lexical matrix column by utilizing a formula:
    - Lexical Error_j=Σ
      
      Incorrect_j/(Correct_i+Σ
      
      Incorrect_i)where said Σ
      
      Incorrect_jis a sum of all lexical matrix values for incorrectly-identified input utterances for a particular recognition result that have the correct recognition result as a second-highest recognition candidate, said Correct_iis an acoustical matrix value for a correctly-identified recognition result from an individual input utterance, and said Σ
      
      Incorrect_iis a sum of all acoustical matrix values for incorrectly-identified recognition results from said individual input utterance.
  - 30. The method of claim 21 wherein said composite individual error/accuracy values for each of said utterances are implemented as a composite Acoustical-Lexical Error that is calculated according to a formula:
    - Acoustical-Lexical Error=Acoustical Error_i+Lexical Error_jwhere said Acoustical Error_iis an individual acoustical error value for one of said utterances from said acoustical matrix, and said Lexical Error_iis an individual lexical error value for said one of said utterances from said lexical matrix.
  - 31. The method of claim 21 wherein said composite individual error/accuracy values for each of said utterances are implemented as an Acoustical-Lexical Accuracy that is calculated according to a formula:
    - Acoustical-Lexical Accuracy=(Correct_i−
      
      Σ
      
      Incorrect_j)/(Correct_i+Σ
      
      Incorrect_i)where said Correct_iis an acoustical matrix value for a correctly-identified recognition result and an input utterance, said Σ
      
      Incorrect_jis a sum of all lexical matrix values for incorrectly-identified input utterances for a recognition result that has a correct recognition result as one of said second-highest recognition candidates, and said Σ
      
      Incorrect_iis a summation of all acoustical matrix values for incorrectly-identified recognition results from said input utterance.
  - 32. The method of claim 21 wherein said utterances of said utterance ranking are preferably ranked by respective individual composite acoustical-lexical error values, or by respective individual composite acoustical-lexical accuracy values, with said lowest-ranked utterance having a lowest individual composite acoustical-lexical error value, or a lowest individual composite acoustical-lexical accuracy value.
  - 33. The method of claim 21 wherein said total error/accuracy value for all of said utterances is implemented as a total acoustical error value, Acoustical Error_T, that is calculated according to a formula:
    - Acoustical Error_T=Σ
      
      Incorrect_T/(Σ
      
      Correct_T+Σ
      
      Incorrect_T)where said Correct_Tis a sum from said acoustical matrix for correctly-identified recognition results from all input utterances, and said Σ
      
      Incorrect_Tis a summation from said acoustical matrix for incorrectly-identified recognition results from said all input utterances.
  - 34. The method of claim 21 wherein accuracy values, Accuracy, may be calculated from corresponding error values, Error, to implement said composite individual error/accuracy values or said total error/accuracy value according to a formula:
    - Error=1−
      
      Accuracywhere either said error values or said accuracy values are alternately utilized to evaluate individual or total utterance recognition characteristics for said speech recognizer.
  - 35. The method of claim 21 wherein said total error/accuracy value is compared to said predetermined threshold to determine whether said vocabulary set is optimized, said predetermined threshold being selected to produce desired speech recognition performance characteristics in said speech recognizer.
  - 36. The method of claim 21 wherein said vocabulary set is finalized when said total error/accuracy value is implemented as a total error value and said predetermined threshold is greater than said total error value, or when said total error/accuracy value is implemented as a total accuracy value and said predetermined threshold is less than said total accuracy value.
  - 37. The method of claim 21 wherein multiple lower-ranked utterances are eliminated from said utterance ranking when said total error/accuracy value for all of said utterances does not exceed said predetermined threshold.
  - 38. The method of claim 21 wherein acoustical matrix values from said acoustical matrix and lexical matrix values from said lexical matrix are set to zero for said lowest-ranked utterance to thereby generate an updated acoustical matrix and an updated lexical matrix.
  - 39. The method of claim 38 wherein said total error/accuracy value for all remaining utterances is repeatedly recalculated by using revised acoustical matrix values from said updated acoustical matrix, said total error/accuracy value then being iteratively recalculated to eliminate lower ranked utterances from said vocabulary set until said predetermined threshold value is exceeded.
  - 40. The method of claim 21 wherein, after eliminating said lowest-ranked utterance from said vocabulary set, said recognizer reanalyzes remaining utterances from said vocabulary set, and responsively generates new N-best lists which may then be utilized to create a new acoustical matrix and a new lexical matrix for ranking said remaining utterances.

41. A computer-readable medium comprising program instructions for implementing a vocabulary set for a speech recognizer, by performing the steps of:
- analyzing utterances from said vocabulary set with a recognizer to generate N-best lists of recognition candidates;
  
  relating said utterances to top recognition candidates from said N-best lists with an acoustical matrix;
  
  compiling a lexical matrix that relates said utterances to said top recognition candidates from said N-best lists only when second-highest recognition candidates from said N-best lists are correct recognition results; and
  
  creating an utterance ranking according to composite individual error/accuracy values for each of said utterances, said composite individual error/accuracy values being derived from both said acoustical matrix and said lexical matrix, a lowest-ranked utterance being eliminated from said vocabulary set when a total error/accuracy value for all of said utterances does not exceed a predetermined threshold.

42. A system for implementing a vocabulary set for a speech recognizer, comprising the steps of:
- means for analyzing utterances from said vocabulary set to generate N-best lists of recognition candidates;
  
  means for relating said utterances to top recognition candidates from said N-best lists;
  
  means for correlating said utterances to said top recognition candidates from said N-best lists only when second-highest recognition candidates from said N-best lists are correct recognition results; and
  
  means for ranking said utterances according to composite individual error/accuracy values for each of said utterances, said composite individual error/accuracy values being derived from both said means for relating and said means for correlating, a lowest-ranked utterance being eliminated from said vocabulary set when a total error/accuracy value for all of said utterances does not exceed a predetermined threshold.

43. A system for implementing a vocabulary set for a speech recognizer, comprising:
- a recognizer for analyzing utterances from said vocabulary set to generate recognition candidates;
  
  an acoustical matrix configured to relate said utterances to top recognition candidates;
  
  a lexical matrix configured to relate said utterances to said top recognition candidates only when second-highest recognition candidates are correct recognition results; and
  
  an utterance ranking of said utterances based upon both said acoustical matrix and said lexical matrix, a lowest-ranked utterance being eliminated from said vocabulary set when a recognition accuracy for all of said utterances fails to exceed a predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.)
Inventors
Menendez-Pidal, Xavier, Olorenshaw, Lex S.

Granted Patent

US 6,970,818 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/240
CPC Class Codes

G10L 15/10 using distance or distortio...

Methodology for implementing a vocabulary set for use in a speech recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

11 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

Methodology for implementing a vocabulary set for use in a speech recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links