Keyword detection with international phonetic alphabet by foreground model and background model
First Claim
1. A method of detecting keywords, comprising:
- at an electronic device with one or more processors and memory;
training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes;
a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and
a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword;
after training the acoustic model, generating a phone decoder based on the trained acoustic model;
obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including;
collecting a set of keyword audio samples for the respective keyword in the respective language;
decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and
selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence;
after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including;
matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and
filtering out one or more phonemic non-keyword portions of the input audio signal with the background model.
1 Assignment
0 Petitions
Accused Products
Abstract
An electronic device with one or more processors and memory trains an acoustic model with an international phonetic alphabet (IPA) phoneme mapping collection and audio samples in different languages, where the acoustic model includes: a foreground model; and a background model. The device generates a phone decoder based on the trained acoustic model. The device collects keyword audio samples, decodes the keyword audio samples with the phone decoder to generate phoneme sequence candidates, and selects a keyword phoneme sequence from the phoneme sequence candidates. After obtaining the keyword phoneme sequence, the device detects one or more keywords in an input audio signal with the trained acoustic model, including: matching phonemic keyword portions of the input audio signal with phonemes in the keyword phoneme sequence with the foreground model; and filtering out phonemic non-keyword portions of the input audio signal with the background model.
-
Citations
20 Claims
-
1. A method of detecting keywords, comprising:
at an electronic device with one or more processors and memory; training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes; a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword; after training the acoustic model, generating a phone decoder based on the trained acoustic model; obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including; collecting a set of keyword audio samples for the respective keyword in the respective language; decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence; after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including; matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and filtering out one or more phonemic non-keyword portions of the input audio signal with the background model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
11. An electronic device, comprising:
-
one or more processors; and memory storing one or more programs to be executed by the one or more processors, the one or more programs comprising instructions for; training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes; a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword; after training the acoustic model, generating a phone decoder based on the trained acoustic model; obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including; collecting a set of keyword audio samples for the respective keyword in the respective language; decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence; after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including; matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and filtering out one or more phonemic non-keyword portions of the input audio signal with the background model. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which, when executed by an electronic device with one or more processors, cause the device to:
-
train an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes; a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword; after training the acoustic model, generate a phone decoder based on the trained acoustic model; obtain a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, wherein the obtaining includes; collecting a set of keyword audio samples for the respective keyword in the respective language; decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence; after obtaining the keyword phoneme sequence, detect one or more keywords in the input audio signal with the trained acoustic model, wherein the detecting includes; matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and filtering out one or more phonemic non-keyword portions of the input audio signal with the background model. - View Dependent Claims (17, 18, 19, 20)
-
Specification