Keyword detection with international phonetic alphabet by foreground model and background model

US 9,466,289 B2
Filed: 12/11/2013
Issued: 10/11/2016
Est. Priority Date: 01/29/2013
Status: Active Grant

First Claim

Patent Images

1. A method of detecting keywords, comprising:

at an electronic device with one or more processors and memory;

training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes;

a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and

a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword;

after training the acoustic model, generating a phone decoder based on the trained acoustic model;

obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including;

collecting a set of keyword audio samples for the respective keyword in the respective language;

decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and

selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence;

after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including;

matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and

filtering out one or more phonemic non-keyword portions of the input audio signal with the background model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic device with one or more processors and memory trains an acoustic model with an international phonetic alphabet (IPA) phoneme mapping collection and audio samples in different languages, where the acoustic model includes: a foreground model; and a background model. The device generates a phone decoder based on the trained acoustic model. The device collects keyword audio samples, decodes the keyword audio samples with the phone decoder to generate phoneme sequence candidates, and selects a keyword phoneme sequence from the phoneme sequence candidates. After obtaining the keyword phoneme sequence, the device detects one or more keywords in an input audio signal with the trained acoustic model, including: matching phonemic keyword portions of the input audio signal with phonemes in the keyword phoneme sequence with the foreground model; and filtering out phonemic non-keyword portions of the input audio signal with the background model.

Citations

20 Claims

1. A method of detecting keywords, comprising:
- at an electronic device with one or more processors and memory;
  
  training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes;
  
  a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and
  
  a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword;
  
  after training the acoustic model, generating a phone decoder based on the trained acoustic model;
  
  obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including;
  
  collecting a set of keyword audio samples for the respective keyword in the respective language;
  
  decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and
  
  selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence;
  
  after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including;
  
  matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and
  
  filtering out one or more phonemic non-keyword portions of the input audio signal with the background model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein selecting the keyword phoneme sequence from the set of phoneme sequence candidates includes:
    - in accordance with a determination that the set of keyword audio samples includes one collected keyword audio sample, selecting the phoneme sequence candidate generated from decoding the one collected keyword audio sample as the keyword phoneme sequence; and
      
      in accordance with a determination that the set of audio samples includes two or more collected keyword audio samples, selecting one of the two or more phoneme sequence candidates generated from decoding the two or more collected keyword audio samples as the keyword phoneme sequence.
  - 3. The method of claim 1, further including:
    - collecting the plurality of audio samples in the plurality of different languages and labeled data for the plurality of audio samples;
      
      mapping phonemes from each phoneme collection to phonemes in the IPA so as to generate the IPA phoneme mapping collection; and
      
      wherein the acoustic model is trained based on the collected plurality of audio samples in the plurality of different languages, the collected labeled data for the plurality of audio samples, and the generated IPA phoneme mapping collection.
  - 4. The method of claim 3, further including:
    - processing the collected plurality of audio samples with a predetermined characteristic extraction protocol so as to obtain a plurality of corresponding audio characteristic sequences;
      
      obtaining a characteristic phoneme collection corresponding to the plurality of audio characteristic sequences based on the IPA phoneme mapping collection;
      
      training the foreground and background models based on the characteristic phoneme collection and the collected labeled data; and
      
      integrating the trained foreground and background models into the acoustic model.
  - 5. The method of claim 4, wherein:
    - generating the plurality of triphones by linking phonemes in the phoneme collection corresponding to the language includes, for each phoneme in the phoneme collection;
      
      obtaining a context phoneme; and
      
      generating a triphone by linking the context phoneme to a corresponding monophone for the phoneme;
      
      performing Gaussian splitting training on the triphones that are clustered with the decision tree corresponding to the language updates a parameter of the clustered triphone; and
      
      training the foreground model further includes;
      
      for each phoneme in the characteristic phoneme collection;
      
      training an initial hidden Markov model (HMM) for three statuses of a respective phoneme in the characteristic phoneme collection;
      
      obtaining data related to the respective phoneme from the collected labeled data;
      
      updating the initial HMM with the obtained data so as to obtain a monophone model for the respective phoneme; and
      
      after performing the Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language, performing minimum phoneme error discriminative training so as to obtain triphone models for respective phonemes in the phoneme collection corresponding to the language; and
      
      training the foreground model based on the obtained monophone and triphone models.
  - 6. The method of claim 5, further including:
    - calculating a Gaussian Mixed Model (GMM) distance between two monophone models;
      
      comparing the calculated GMM distance with a predefined similarity threshold value; and
      
      in accordance with a determination that the calculated GMM distance is larger than the predefined similarity threshold value;
      
      clustering the two monophones corresponding to the two monophone models; and
      
      recording the two monophones in a confusion matrix, wherein the confusion matrix is configured to describe similar monophones.
  - 7. The method of claim 6, wherein training the background model includes:
    - generating a confusion phoneme collection by processing the phonemes in the characteristic phoneme collection with the confusion matrix; and
      
      training the background model with the generated confusion phoneme collection.
  - 8. The method of claim 1, wherein:
    - the set of audio samples includes two or more collected keyword audio samples; and
      
      the set of phoneme sequence candidates generated from decoding the two or more collected keyword audio samples include two or more phoneme sequence candidates; and
      
      the method including;
      
      integrating the phoneme sequence candidates into a linear structure, including;
      
      mapping the corresponding phonemes of the phoneme sequence candidates to an edge of the linear structure; and
      
      categorizing the edges corresponding to similar phonemes into a same slot of the linear structure, wherein the slots form a linear connection relation with each other; and
      
      selecting a path of the linear structure, wherein the phonemes corresponding to the edges of the selected path comprise the keyword phoneme sequence.
  - 9. The method of claim 8, wherein selecting the path from the linear structure includes:
    - calculating an occurrence frequency for the phoneme corresponding to each edge of the linear structure;
      
      calculating a score for each path of the linear structure by summing the occurrence frequencies of the phonemes corresponding to each edge in a respective path;
      
      sorting the scores for the paths of the linear structure from high to low, and selecting the first N paths as alternative paths, wherein N is an integer larger than 1; and
      
      calculating a confidence measure for each of the N alternative paths, and selecting one of the N alternative paths according to corresponding calculated confidences measures, wherein the phonemes corresponding to the edges of the selected one of the N alternative paths comprise the keyword phoneme sequence.
  - 10. The method of claim 9, wherein calculating the confidence measure for each of the N alternative paths includes:
    - for each of the N alternative paths;
      
      aligning a respective alternative path with each of the phoneme sequence candidates and calculating an intermediate confidence measure; and
      
      after aligning the respective alternative path with each of the phoneme sequence candidates, calculating the average of the intermediate confidence measures, wherein the average of the calculated intermediate confidence measures is the confidence measure for the alternative path; and
      
      wherein selecting one of the N alternative paths includes selecting the alternative path with the highest confidence measure, wherein the phonemes corresponding to the edges of the selected alternative path with the highest confidence measure comprise the keyword phoneme sequence.

11. An electronic device, comprising:
- one or more processors; and
  
  memory storing one or more programs to be executed by the one or more processors, the one or more programs comprising instructions for;
  
  training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes;
  
  a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and
  
  a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword;
  
  after training the acoustic model, generating a phone decoder based on the trained acoustic model;
  
  obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including;
  
  collecting a set of keyword audio samples for the respective keyword in the respective language;
  
  decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and
  
  selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence;
  
  after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including;
  
  matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and
  
  filtering out one or more phonemic non-keyword portions of the input audio signal with the background model.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The device of claim 11, wherein selecting the keyword phoneme sequence from the set of phoneme sequence candidates includes:
    - in accordance with a determination that the set of keyword audio samples includes one collected keyword audio sample, selecting the phoneme sequence candidate generated from decoding the one collected keyword audio sample as the keyword phoneme sequence; and
      
      in accordance with a determination that the set of audio samples includes two or more collected keyword audio samples, selecting one of the two or more phoneme sequence candidates generated from decoding the two or more collected keyword audio samples as the keyword phoneme sequence.
  - 13. The device of claim 11, wherein the one or more programs comprising instructions for:
    - collecting the plurality of audio samples in the plurality of different languages and labeled data for the plurality of audio samples;
      
      obtaining a phoneme collection for each of the plurality of different languages;
      
      mapping phonemes from each phoneme collection to phonemes in the IPA so as to generate the IPA phoneme mapping collection; and
      
      wherein the acoustic model is trained based on the collected plurality of audio samples in the plurality of different languages, the collected labeled data for the plurality of audio samples, and the generated IPA phoneme mapping collection.
  - 14. The device of claim 13, wherein the one or more programs comprising instructions for:
    - processing the collected plurality of audio samples with a predetermined characteristic extraction protocol so as to obtain a plurality of corresponding audio characteristic sequences;
      
      obtaining a characteristic phoneme collection corresponding to the plurality of audio characteristic sequences based on the IPA phoneme mapping collection;
      
      training the foreground and background models based on the characteristic phoneme collection and the collected labeled data; and
      
      integrating the trained foreground and background models into the acoustic model.
  - 15. The device of claim 11, wherein:
    - the set of audio samples includes two or more collected keyword audio samples; and
      
      the set of phoneme sequence candidates generated from decoding the two or more collected keyword audio samples include two or more phoneme sequence candidates; and
      
      the method including;
      
      integrating the phoneme sequence candidates into a linear structure, including;
      
      mapping the corresponding phonemes of the phoneme sequence candidates to an edge of the linear structure; and
      
      categorizing the edges corresponding to similar phonemes into a same slot of the linear structure, wherein the slots form a linear connection relation with each other; and
      
      selecting a path of the linear structure, wherein the phonemes corresponding to the edges of the selected path comprise the keyword phoneme sequence.

16. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which, when executed by an electronic device with one or more processors, cause the device to:
- train an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes;
  
  a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and
  
  a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword;
  
  after training the acoustic model, generate a phone decoder based on the trained acoustic model;
  
  obtain a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, wherein the obtaining includes;
  
  collecting a set of keyword audio samples for the respective keyword in the respective language;
  
  decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and
  
  selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence;
  
  after obtaining the keyword phoneme sequence, detect one or more keywords in the input audio signal with the trained acoustic model, wherein the detecting includes;
  
  matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and
  
  filtering out one or more phonemic non-keyword portions of the input audio signal with the background model.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer readable storage medium of claim 16, wherein selecting the keyword phoneme sequence from the set of phoneme sequence candidates includes:
    - in accordance with a determination that the set of keyword audio samples includes one collected keyword audio sample, selecting the phoneme sequence candidate generated from decoding the one collected keyword audio sample as the keyword phoneme sequence; and
      
      in accordance with a determination that the set of audio samples includes two or more collected keyword audio samples, selecting one of the two or more phoneme sequence candidates generated from decoding the two or more collected keyword audio samples as the keyword phoneme sequence.
  - 18. The non-transitory computer readable storage medium of claim 16, wherein the one or more programs comprising instructions, which further cause the device to:
    - collect the plurality of audio samples in the plurality of different languages and labeled data for the plurality of audio samples;
      
      obtain a phoneme collection for each of the plurality of different languages;
      
      map phonemes from each phoneme collection to phonemes in the IPA so as to generate the IPA phoneme mapping collection; and
      
      wherein the acoustic model is trained based on the collected plurality of audio samples in the plurality of different languages, the collected labeled data for the plurality of audio samples, and the generated IPA phoneme mapping collection.
  - 19. The non-transitory computer readable storage medium of claim 18, wherein the one or more programs comprising instructions, which further cause the device to:
    - processing the collected plurality of audio samples with a predetermined characteristic extraction protocol so as to obtain a plurality of corresponding audio characteristic sequences;
      
      obtaining a characteristic phoneme collection corresponding to the plurality of audio characteristic sequences based on the IPA phoneme mapping collection;
      
      training the foreground and background models based on the characteristic phoneme collection and the collected labeled data; and
      
      integrating the trained foreground and background models into the acoustic model.
  - 20. The non-transitory computer readable storage medium of claim 16, wherein:
    - the set of audio samples includes two or more collected keyword audio samples;
      
      the set of phoneme sequence candidates generated from decoding the two or more collected keyword audio samples include two or more phoneme sequence candidates; and
      
      the one or more programs comprising instructions, which further cause the device to;
      
      integrate the phoneme sequence candidates into a linear structure, wherein the integrating includes;
      
      mapping the corresponding phonemes of the phoneme sequence candidates to an edge of the linear structure; and
      
      categorizing the edges corresponding to similar phonemes into a same slot of the linear structure, wherein the slots form a linear connection relation with each other; and
      
      select a path of the linear structure, wherein the phonemes corresponding to the edges of the selected path comprise the keyword phoneme sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Inventors
Lu, Li, Zhang, Xiang, Yue, Shuai, Rao, Feng, Wang, Eryu, Li, Lu
Primary Examiner(s)
Lerner, Martin

Application Number

US14/103,775
Publication Number

US 20140236600A1
Time in Patent Office

1,035 Days
Field of Search

704/239, 704/240, 704/242, 704/243, 704/245, 704/254, 704/255, 704/257
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 2015/088   Word spotting

Keyword detection with international phonetic alphabet by foreground model and background model

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword detection with international phonetic alphabet by foreground model and background model

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links