×

Method and system for robust pattern matching in continuous speech for spotting a keyword of interest using orthogonal matching pursuit

  • US 9,293,130 B2
  • Filed: 05/02/2008
  • Issued: 03/22/2016
  • Est. Priority Date: 05/02/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for speech recognition in mismatched environments, the method comprising:

  • extracting, by a computing device, time-frequency speech features from a series of reference speech elements in a first series of sampling windows spanning each occurrence of a keyword of interest, wherein the time-frequency speech features represent each reference speech element as a two-dimensional image in the time-frequency plane, wherein extracting time-frequency speech features from the series of reference speech elements is obtained from a feature domain comprising Perceptual Linear Predictive (PLP) modified power spectrum;

    aligning, by the computing device, the extracted time-frequency speech features when the reference speech elements from the series of speech elements are not of equal time span duration;

    constructing, by the computing device, a sparse representation model common to the aligned extracted time-frequency speech features, using simultaneous sparse approximation of reference speech signals in a time-frequency domain, wherein the simultaneous sparse approximation determines an approximation of a reference speech signal as a linear combination of reference speech signals drawn from a large, linearly dependent collection of reference speech signals;

    determining, by the computing device, a first set of coefficient vectors for the aligned extracted time-frequency speech features;

    extracting, by the computing device, a time-frequency feature image from a test speech stream spanned by a second sampling window, wherein the reference speech elements and the test speech stream are obtained under mismatched conditions where the test speech stream contains background noise;

    approximating, by the computing device, the extracted time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a second coefficient vector;

    computing, by the computing device, a similarity measure between the first set of coefficient vectors and the second coefficient vector;

    determining, by the computing device, if the similarity measure is below a predefined threshold; and

    wherein a match between the reference speech elements and a portion of the test speech stream spanned by the second sampling window is made in response to the similarity measure being below the predefined threshold, the match indicating the presence of the keyword of interest in the second sampling window;

    wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used for constructing the sparse representation model for the aligned extracted time-frequency speech features by extracting a subspace of common time-frequency structures from different occurrences of the keyword of interest.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×