Method and apparatus for speech recognition and generation of speech recognition engine
First Claim
Patent Images
1. A method of speech recognition, the method comprising:
- receiving a speech input;
transmitting the speech input to a speech recognition engine; and
receiving a speech recognition result from the speech recognition engine,wherein the speech recognition engine is configured toobtain a phoneme sequence from the speech input,identify an embedding vector representative of a phoneme sequence that is closest in a phonetic distance to the obtained phoneme sequence among embedding vectors arranged on an N-dimensional embedding space, anddetermine, based on the identified embedding vector, the speech recognition result based on a previous phoneme sequence mapping into the N-dimensional embedding space corresponding to the identified embedding vector,wherein the identifying of the embedding vector includes identifying the embedding vector from the obtained phoneme sequence using a recognition model that is trained-based on probabilities of respective phonemes of phoneme sequences being substituted by different phonemes when pronounced, andwherein embedding vectors to which words phonetically similar to one another are mapped among the embedding vectors in the N-dimensional embedding space are positioned closer to one another than other embedding vectors on the N-dimensional embedding space.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for speech recognition and for generation of speech recognition engine, and a speech recognition engine are provided. The method of speech recognition involves receiving a speech input, transmitting the speech input to a speech recognition engine, and receiving a speech recognition result from the speech recognition engine, in which the speech recognition engine obtains a phoneme sequence from the speech input and provides the speech recognition result based on a phonetic distance of the phoneme sequence.
11 Citations
19 Claims
-
1. A method of speech recognition, the method comprising:
-
receiving a speech input; transmitting the speech input to a speech recognition engine; and receiving a speech recognition result from the speech recognition engine, wherein the speech recognition engine is configured to obtain a phoneme sequence from the speech input, identify an embedding vector representative of a phoneme sequence that is closest in a phonetic distance to the obtained phoneme sequence among embedding vectors arranged on an N-dimensional embedding space, and determine, based on the identified embedding vector, the speech recognition result based on a previous phoneme sequence mapping into the N-dimensional embedding space corresponding to the identified embedding vector, wherein the identifying of the embedding vector includes identifying the embedding vector from the obtained phoneme sequence using a recognition model that is trained-based on probabilities of respective phonemes of phoneme sequences being substituted by different phonemes when pronounced, and wherein embedding vectors to which words phonetically similar to one another are mapped among the embedding vectors in the N-dimensional embedding space are positioned closer to one another than other embedding vectors on the N-dimensional embedding space. - View Dependent Claims (2, 3, 4)
-
-
5. A processor implemented method of generating a speech recognition engine, the method comprising:
-
comparing phonemes comprised in training phoneme sequences for words; calculating a substitution probability between each phoneme of the phonemes and all other phonemes of the phonemes; determining phonetic similarities between the training phoneme sequences based on the substitution probability calculated for each phoneme of the training phoneme sequences; calculating phonetic distances between the words based on the determined phonetic similarities between the training phoneme sequences; generating embedding vectors by implementing a multidimensional scaling algorithm to convert the calculated phonetic distances between the words to the embedding vectors arranged on an N-dimensional embedding space; and generating the recognition engine by training, using the generated embedding vectors as training outputs and the training phoneme sequences as training inputs, a recognition model to identify an embedding vector representative of a phoneme sequence that is closest in a phonetic distance to an input phoneme sequence among other positions in the N-dimensional embedding space representative of other phoneme sequences, wherein the training of the recognition model includes repeatedly applying an input training phoneme sequence by the recognition model to identify respective resulting embedding vectors until the recognition model is trained to generate the identified embedding vector representative of a word recognition result for the input phoneme sequence. - View Dependent Claims (6, 7, 8, 9, 10)
-
-
11. A method of speech recognition, the method comprising:
-
receiving a speech input; obtaining a phoneme sequence from the speech input; selecting an embedding vector representative of a phoneme sequence that is closest in a phonetic distance to the phoneme sequence among embedding vectors arranged on an N-dimensional embedding space; and identifying a word of the speech input based on the selected embedding vector, wherein embedding vectors to which words phonetically similar to one another are mapped among the embedding vectors in the N-dimensional embedding space are positioned closer to one another than other embedding vectors on the N-dimensional embedding space, and wherein the selecting of the embedding vector includes selecting the embedding vector using a recognition model that is trained using training phoneme data and trained to identify corresponding embedding vectors representing dimensional scale reductions of inter-word distance information. - View Dependent Claims (12, 13)
-
-
14. An apparatus comprising:
-
a microphone configured to receive a speech input; a processor configured to obtain a phoneme sequence from the speech input, identify an embedding vector representative of a phoneme sequence that is closest in the phonetic distance to the obtained phoneme sequence among embedding vectors arranged on an N-dimensional embedding space, and determine, based on the identified embedding vector, a speech recognition result based on a previous phoneme sequence mapping into the N-dimensional embedding space corresponding to the identified embedding vector, wherein the identifying of the embedding vector includes identifying the embedding vector from the obtained phoneme sequence using a recognition model that is trained based on probabilities of respective phonemes of phoneme sequences being substituted by different phonemes when pronounced, and wherein embedding vectors to which words phonetically similar to one another are mapped among the embedding vectors in the N-dimensional embedding space are positioned closer to one another than other embedding vectors on the N-dimensional embedding space. - View Dependent Claims (15, 16, 17)
-
-
18. A speech recognition engine generator, comprising:
-
a processor; and a memory having instructions stored thereon executed by the at least one processor to perform; comparing phonemes comprised in training phoneme sequences of words; calculating a substitution probability between each phoneme of the phonemes and all other phonemes of the phonemes; determining phonetic similarities between the training phoneme sequences based on the substitution probability calculated for each phoneme of the training phoneme sequences; calculating phonetic distances between the words based on the determined phonetic similarities between the training phoneme sequences; and generating embedding vectors by implementing a multidimensional scaling algorithm to convert the calculated phonetic distances between the words to the embedding vectors arranged on an N-dimensional embedding space; and generating the recognition engine by training, using the generated embedding vectors, a recognition model to identify an embedding vector representative of a phoneme sequence that is closest in a phonetic distance to an input phoneme sequence among other positions in the N-dimensional embedding space representative of other phoneme sequences, wherein the training of the recognition model includes repeatedly applying an input training phoneme sequence by the recognition model to identify respective resulting embedding vectors until the recognition model is trained to generate the identified embedding vector representative of a word recognition result for the input phoneme sequence. - View Dependent Claims (19)
-
Specification