Method of assessing degree of acoustic confusability, and system therefor
First Claim
1. A process comprising:
- representing each text form of two spoken phrases with a first and second strings of phonemes, respectively;
assigning costs to transform the first string to the second string of phonemes according to a speech recognizer confusability of arbitrary pairs of audio files; and
calculating an acoustic confusability measure as a least cost, according to the cost assigning, to transform the first string of phonemes to the second string of phonemes, thereby predicting when a speech recognizer will confuse the two spoken phrases by directly using text forms of the spoken phrases.
8 Assignments
0 Petitions
Accused Products
Abstract
Predicting speech recognizer confusion where utterances can be represented by any combination of text form and audio file. The utterances are represented with an intermediate representation that directly reflects the acoustic characteristics of the utterances. Text representations of the utterances can be directly used for predicting confusability without access to audio file examples of the utterances. First embodiment: two text utterances are represented with strings of phonemes and one of the strings of phonemes is transformed into the other strings of phonemes for a least cost as a confusability measure. Second embodiment: two utterances are represented with an intermediate representation of sequences of acoustic events based on phonetic capabilities of speakers obtained from acoustic signals of the utterances and the acoustic events are compared. Predicting confusability of the utterances according to a formula 2K/(T), K is a number of matched acoustic events and T is a total number of acoustic events.
-
Citations
33 Claims
-
1. A process comprising:
-
representing each text form of two spoken phrases with a first and second strings of phonemes, respectively; assigning costs to transform the first string to the second string of phonemes according to a speech recognizer confusability of arbitrary pairs of audio files; and calculating an acoustic confusability measure as a least cost, according to the cost assigning, to transform the first string of phonemes to the second string of phonemes, thereby predicting when a speech recognizer will confuse the two spoken phrases by directly using text forms of the spoken phrases. - View Dependent Claims (2, 3, 4)
-
-
5. A process comprising:
-
representing text phrases with corresponding strings of phonemes; calculating an acoustic confusability measure as a least cost to transform one of the strings of phonemes into another of the strings of phonemes; determining and adjusting, using a speech recognizer, a threshold value according to confusability of example pairs of spoken phrases represented as audio files; and comparing the least cost to the threshold value to determine acoustic confusability of the text phrases when spoken to a speech recognizer.
-
-
6. A process, comprising:
-
representing text phrases with corresponding strings of phonemes; calculating a distance between the strings of phonemes using a string edit distance algorithm as a least cost to transform according to a series of transformations one of the strings of phonemes into another of the strings of phonemes; determining and adjusting, using a speech recognizer, a threshold value according to confusability of example pairs of spoken phrases represented as audio files; and comparing the least cost to the threshold value to predict when a speech recognizer will confuse spoken phrases of the text phrases. - View Dependent Claims (7, 8)
-
-
9. A process comprising calculating an acoustic confusability measure as a least cost to transform a first string of phonemes corresponding to a first text phrase to a second string of phonemes corresponding to a second text phrase, the process further comprising:
-
determining confusability of example pairs of audio files using a speech recognizer; calculating a confusability threshold value based upon the determined confusability, the threshold value compared to the least cost to determine when a speech recognizer will confuse spoken phrases of the two text phrases.
-
-
10. A computer system, comprising:
a processor programmed to control the computer system according to a process comprising; representing each text form of two spoken phrases with a first and second strings of phonemes, respectively, assigning costs to transform the first string to the second string of phonemes according to a speech recognizer confusability of arbitrary pairs of audio files, and calculating an acoustic confusability measure as a least cost, according to the cost assigning, to transform the first string of phonemes to the second string of phonemes, thereby predicting when a speech recognizer will confuse two spoken phrases by directly using text representations of the spoken phrases.
-
11. A computer system, comprising:
a processor programmed to represent text phrases with corresponding strings of phonemes, to calculate an acoustic confusability measure as a least cost to transform one of the strings of phonemes into another of the strings of phonemes, to determine and adjust, using a speech recognizer, a threshold value according to confusability of example pairs of spoken phrases represented as audio flies, and to compare the least cost to the threshold value to determine acoustic confusability of the text phrases when spoken to a speech recognizer.
-
12. A computer system, comprising:
a processor programmed to represent text phrases with corresponding strings of phonemes, to calculate a distance between the strings of phonemes using a string edit distance algorithm as a least cost to transform according to a series of transformations one of the strings of phonemes into another of the strings of phonemes, to determine and adjust, using a speech recognizer, a threshold value according to confusability of example pairs of spoken phrases represented as audio files, and to compare the least cost to the threshold value to predict when a speech recognizer will confuse spoken forms of the text phrases. - View Dependent Claims (13)
-
14. A computer system programmed to calculate an acoustic confusability measure as a least cost to transform a first string of phonemes corresponding to a first text phrase to a second string of phonemes corresponding to a second text phrase, the system further comprising:
a processor programmed to determine confusability of example pairs of audio files using a speech recognizer, and to calculate a confusability threshold value based upon the determined confusability, the threshold value compared to the least cost to determine when a speech recognizer will confuse spoken form of the two text phrases.
-
15. A process of predicting when a speech recognizer will confuse two spoken phrases where two utterances corresponding to the spoken phrases in any combination of text and audio file are available to a confusability prediction algorithm, comprising:
-
representing each utterance with corresponding sequences of acoustic characteristics based on phonetic capabilities of speakers obtained from acoustic signals of the utterances; aligning the sequences of acoustic characteristics; comparing the sequences of acoustic characteristics; and calculating a metric of acoustic confusability according to a formula 2K/(T), wherein K is a number of acoustic characteristics that match from the comparing and T is a total number of acoustic characteristics in the sequences of acoustic characteristics for both utterances. - View Dependent Claims (16, 17)
-
-
18. A process comprising:
-
representing two utterances that are in any combination of a text form and an audio file with corresponding sequences of acoustic characteristics based on phonetic capabilities of speakers obtained from acoustic signals of the utterances; aligning the sequences of acoustic characteristics; comparing the sequences of acoustic characteristics; and calculating a metric of acoustic confusability according to the comparing. - View Dependent Claims (19)
-
-
20. A process comprising:
-
representing two utterances with corresponding sequences of acoustic characteristics based on phonetic capabilities of speakers obtained from acoustic signals of the utterances; aligning the sequences of acoustic characteristics; comparing the sequences of acoustic characteristics; and calculating a metric of acoustic confusability according to a formula 2K/(T), wherein K is a number of acoustic characteristics that match from the comparing and T is a total number of acoustic characteristics in the sequences of acoustic characteristics for both utterances. - View Dependent Claims (21)
-
-
22. A process, comprising:
-
representing a first phrase and a second phrase, respectively, with a corresponding sequence of acoustic features and acoustic measures obtained from a dictionary of acoustic features and a database of prerecorded audio files; aligning the sequence of acoustic features and acoustic measures of the first phrase with the sequence of acoustic features and acoustic measures of the second phrase; comparing the aligned first phrase and second phrase sequences; and calculating a metric of acoustic confusability according to the comparing according to a formula 2K/(N+M), wherein K is a number of acoustic features and acoustic measures that match from the comparing and N is a number of acoustic features and acoustic measures in the first phrase and M is a number of acoustic features and acoustic measures in the second phrase.
-
-
23. A computer system, comprising:
a processor programmed to predict when a speech recognizer will confuse two spoken phrases where two phrases in any combination of text and spoken form are available to a confusability prediction algorithm, according to a process comprising; representing each phrase with corresponding sequences of acoustic characteristics based on phonetic capabilities of speakers obtained from acoustic signals of the phrases; aligning the sequences of acoustic characteristics; comparing the sequences of acoustic characteristics; and calculating a metric of acoustic confusability according to a formula 2K/(T), wherein K is a number of acoustic characteristics that match from the comparing and T is a total number of acoustic characteristics in the sequences of acoustic characteristics for both phrases. - View Dependent Claims (24, 25)
-
26. A computer system, comprising:
a processor programmed to represent two spoken phrases that are in any combination of a text form and an audio file with corresponding sequences of acoustic characteristics based on phonetic capabilities of speakers obtained from acoustic signals of the phrases, to align the sequences of acoustic characteristics, to compare the sequences of acoustic characteristics, and to calculate a metric of acoustic confusability according to the comparing. - View Dependent Claims (27)
-
28. A computer system, comprising:
a processor programmed to represent a first phrase and a second phrase, respectively, with a corresponding sequence of acoustic features and acoustic measures obtained from a dictionary of acoustic features and a database of prerecorded audio files, to align the sequence of acoustic features and acoustic measures of the first phrase with the sequence of acoustic features and acoustic measures of the second phrase, to compare the aligned first phrase and second phrase sequences, and to calculate a metric of acoustic confusability according to the comparing according to a formula 2K/(N+M), wherein K is a number of acoustic features and acoustic measures that match from the comparing and N is a number of acoustic features and acoustic measures in the first phrase and M is a number of acoustic features and acoustic measures in the second phrase.
-
29. A process, comprising:
-
receiving two utterances in any combination of a text form and an audio file, the audio file having a speech signal recorded by any speaker and under any acoustic condition; representing the two utterances with an intermediate representation that directly reflects the salient acoustic characteristics of the two utterances when spoken; and predicting when a speech recognizer will confuse the two utterances when spoken by using the intermediate representation of the two utterances. - View Dependent Claims (30)
-
-
31. A computer system, comprising:
a processor programmed to receive two phrases in any combination of text and spoken form, the spoken form spoken by any speaker and under any acoustic condition, to represent the two phrases with an intermediate representation that directly reflects the salient acoustic characteristics of the two phrases when spoken, and to predict when a speech recognizer will confuse the two phrases when spoken by using the intermediate representation of the two phrases. - View Dependent Claims (32)
-
33. A computer program, embodied on a computer-readable medium, comprising:
-
an input segment receiving two utterances in any combination of a text form and an audio file, the audio file having speech signals recorded by any speaker and under any acoustic condition; a transformation segment representing the two utterances with an intermediate representation that directly reflects the salient acoustic characteristics of the two utterances when spoken; and a predicting segment predicting acoustic confusability of the two utterances when spoken to a speech recognizer using the intermediate representation of the two utterances.
-
Specification