Method and system for robust pattern matching in continuous speech for spotting a keyword of interest using orthogonal matching pursuit

US 9,293,130 B2
Filed: 05/02/2008
Issued: 03/22/2016
Est. Priority Date: 05/02/2008
Status: Active Grant

First Claim

Patent Images

1. A method for speech recognition in mismatched environments, the method comprising:

extracting, by a computing device, time-frequency speech features from a series of reference speech elements in a first series of sampling windows spanning each occurrence of a keyword of interest, wherein the time-frequency speech features represent each reference speech element as a two-dimensional image in the time-frequency plane, wherein extracting time-frequency speech features from the series of reference speech elements is obtained from a feature domain comprising Perceptual Linear Predictive (PLP) modified power spectrum;

aligning, by the computing device, the extracted time-frequency speech features when the reference speech elements from the series of speech elements are not of equal time span duration;

constructing, by the computing device, a sparse representation model common to the aligned extracted time-frequency speech features, using simultaneous sparse approximation of reference speech signals in a time-frequency domain, wherein the simultaneous sparse approximation determines an approximation of a reference speech signal as a linear combination of reference speech signals drawn from a large, linearly dependent collection of reference speech signals;

determining, by the computing device, a first set of coefficient vectors for the aligned extracted time-frequency speech features;

extracting, by the computing device, a time-frequency feature image from a test speech stream spanned by a second sampling window, wherein the reference speech elements and the test speech stream are obtained under mismatched conditions where the test speech stream contains background noise;

approximating, by the computing device, the extracted time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a second coefficient vector;

computing, by the computing device, a similarity measure between the first set of coefficient vectors and the second coefficient vector;

determining, by the computing device, if the similarity measure is below a predefined threshold; and

wherein a match between the reference speech elements and a portion of the test speech stream spanned by the second sampling window is made in response to the similarity measure being below the predefined threshold, the match indicating the presence of the keyword of interest in the second sampling window;

wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used for constructing the sparse representation model for the aligned extracted time-frequency speech features by extracting a subspace of common time-frequency structures from different occurrences of the keyword of interest.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for speech recognition, the method includes: extracting time-frequency speech features from a series of reference speech elements in a first series of sampling windows; aligning reference speech elements that are not of equal time span duration; constructing a common subspace for the aligned speech features; determining a first set of coefficient vectors; extracting a time-frequency feature image from a test speech stream spanned by a second sampling window; approximating the extracted image in the common subspace for the aligned extracted time-frequency speech features with a second coefficient vector; computing a similarity measure between the first and the second coefficient vector; determining if the similarity measure is below a predefined threshold; and wherein a match between the reference speech elements and a portion of the test speech stream is made in response to a similarity measure below a predefined threshold. The said reference speech elements correspond to a keyword of interest, wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used in their alignment.

15 Citations

View as Search Results

12 Claims

1. A method for speech recognition in mismatched environments, the method comprising:
- extracting, by a computing device, time-frequency speech features from a series of reference speech elements in a first series of sampling windows spanning each occurrence of a keyword of interest, wherein the time-frequency speech features represent each reference speech element as a two-dimensional image in the time-frequency plane, wherein extracting time-frequency speech features from the series of reference speech elements is obtained from a feature domain comprising Perceptual Linear Predictive (PLP) modified power spectrum;
  
  aligning, by the computing device, the extracted time-frequency speech features when the reference speech elements from the series of speech elements are not of equal time span duration;
  
  constructing, by the computing device, a sparse representation model common to the aligned extracted time-frequency speech features, using simultaneous sparse approximation of reference speech signals in a time-frequency domain, wherein the simultaneous sparse approximation determines an approximation of a reference speech signal as a linear combination of reference speech signals drawn from a large, linearly dependent collection of reference speech signals;
  
  determining, by the computing device, a first set of coefficient vectors for the aligned extracted time-frequency speech features;
  
  extracting, by the computing device, a time-frequency feature image from a test speech stream spanned by a second sampling window, wherein the reference speech elements and the test speech stream are obtained under mismatched conditions where the test speech stream contains background noise;
  
  approximating, by the computing device, the extracted time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a second coefficient vector;
  
  computing, by the computing device, a similarity measure between the first set of coefficient vectors and the second coefficient vector;
  
  determining, by the computing device, if the similarity measure is below a predefined threshold; and
  
  wherein a match between the reference speech elements and a portion of the test speech stream spanned by the second sampling window is made in response to the similarity measure being below the predefined threshold, the match indicating the presence of the keyword of interest in the second sampling window;
  
  wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used for constructing the sparse representation model for the aligned extracted time-frequency speech features by extracting a subspace of common time-frequency structures from different occurrences of the keyword of interest.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the method further comprises:
    - incrementing the second sampling window by a unit of time and extracting a new time-frequency feature image from the test speech stream spanned by the incremented second sampling window;
      
      approximating the extracted new time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a new coefficient vector;
      
      computing a similarity measure between the first set of coefficient vectors and the new coefficient vector;
      
      determining if the similarity measure is below the predefined threshold;
      
      wherein a match between the reference speech elements and a portion of the test speech stream spanned by the incremented second sampling window is made in response to a similarity measure below the predefined threshold; and
      
      wherein the process of incrementing the second sampling window and computing additional similarity measures continues until the test speech stream is exhausted.
  - 3. The method of claim 1, wherein the aligning the extracted time-frequency speech features employs Dynamic Time Warping (DTW).
  - 4. The method of claim 1, wherein the first set of coefficient vectors are obtained by solving least squares equations.

5. A computer-readable storage device encoded with computer instructions that, when executed by a computing device, perform a method for speech recognition in mismatched environments, the method comprising:
- extracting time-frequency speech features from a series of reference speech elements in a first series of sampling windows spanning each occurrence of a keyword of interest, wherein the time-frequency speech features represent each reference speech element as a two-dimensional image in the time-frequency plane, wherein extracting time-frequency speech features from the series of reference speech elements is obtained from a feature domain comprising Perceptual Linear Predictive (PLP) modified power spectrum;
  
  aligning the extracted time-frequency speech features when the reference speech elements from the series of speech elements are not of equal time duration;
  
  constructing a sparse representation model common to the aligned extracted time-frequency speech features, using simultaneous sparse approximation of reference speech signals in a time-frequency domain, wherein the simultaneous sparse approximation determines an approximation of a reference speech signal as a linear combination of reference speech signals drawn from a large, linearly dependent collection of reference speech signals;
  
  determining a first set of coefficient vectors for the aligned extracted time-frequency speech features;
  
  extracting a time-frequency feature image from a test speech stream spanned by a second sampling window, wherein the reference speech elements and the test speech stream are obtained under mismatched conditions where the test speech stream contains background noise;
  
  approximating the extracted time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a second coefficient vector;
  
  computing a similarity measure between the first set of coefficient vectors and the second coefficient vector; and
  
  determining if the similarity measure is below a predefined threshold, wherein a match between the reference speech elements and a portion of the test speech stream spanned by the second sampling window is made in response to the similarity measure being below the predefined threshold, the match indicating the presence of the keyword of interest in the second sampling window;
  
  wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used for constructing the sparse representation model for the aligned extracted time-frequency speech features by extracting a subspace of common time-frequency structures from different occurrences of the keyword of interest.
- View Dependent Claims (6, 7, 8)
- - 6. A computer-readable storage device as defined in claim 5, wherein the method further comprises:
    - incrementing the second sampling window by a unit of time and extracting a new time-frequency feature image from the test speech stream spanned by the incremented second sampling window;
      
      approximating the extracted new time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a new coefficient vector;
      
      computing a similarity measure between the first set of coefficient vectors and the new coefficient vector;
      
      determining if the similarity measure is below the predefined threshold, wherein a match between the reference speech elements and a portion of the test speech stream spanned by the incremented second sampling window is identified in response to a similarity measure below the predefined threshold; and
      
      continuing the process of incrementing the second sampling window and computing additional similarity measures until the test speech stream is exhausted.
  - 7. A computer-readable storage device as defined in claim 5, wherein aligning the extracted time-frequency speech features uses Dynamic Time Warping (DTW).
  - 8. A computer-readable storage device as defined in claim 5, wherein determining the first set of coefficient vectors comprises solving least squares equations.

9. A system comprising a computing device and a storage device encoded with instructions that, when executed by the computing device, perform a method for speech recognition in mismatched environments, the instructions configured to:
- extract time-frequency speech features from a series of reference speech elements in a first series of sampling windows spanning each occurrence of a keyword of interest, wherein the time-frequency speech features represent each reference speech element as a two-dimensional image in the time-frequency plane, wherein extracting time-frequency speech features from the series of reference speech elements is obtained from a feature domain comprising Perceptual Linear Predictive (PLP) modified power spectrum;
  
  align the extracted time-frequency speech features when the reference speech elements from the series of speech elements are not of equal time duration;
  
  construct a sparse representation model common to the aligned extracted time-frequency speech features, using simultaneous sparse approximation of reference speech signals in a time-frequency domain, wherein the simultaneous sparse approximation determines an approximation of a reference speech signal as a linear combination of reference speech signals drawn from a large, linearly dependent collection of reference speech signals;
  
  determine a first set of coefficient vectors for the aligned extracted time-frequency speech features;
  
  extract a time-frequency feature image from a test speech stream spanned by a second sampling window, wherein the reference speech elements and the test speech stream are obtained under mismatched conditions where the test speech stream contains background noise;
  
  approximate the extracted time-frequency feature image in the sparse representation model for the aligned extracted time-frequency speech features with a second coefficient vector;
  
  compute a similarity measure between the first set of coefficient vectors and the second coefficient vector; and
  
  determine if the similarity measure is below a predefined threshold, wherein a match between the reference speech elements and a portion of the test speech stream spanned by the second sampling window is made in response to the similarity measure being below the predefined threshold, the match indicating the presence of the keyword of interest in the second sampling window;
  
  wherein Simultaneous Orthogonal Matching Pursuit (SOMP) is used to construct the sparse representation model for the aligned extracted time-frequency speech features by extracting a subspace of common time-frequency structures from different occurrences of the keyword of interest.
- View Dependent Claims (10, 11, 12)
- - 10. The system of claim 9, wherein the instructions are further configured to:
    - increment the second sampling window by a unit of time and extract a new time-frequency feature image from the test speech stream spanned by the incremented second sampling window;
      
      approximate the extracted new time-frequency speech feature image in the sparse representation model for the aligned extracted time-frequency speech features with a new coefficient vector;
      
      compute a similarity measure between the first set of coefficient vectors and the new coefficient vector;
      
      determine if the similarity measure is below the predefined threshold, wherein a match between the reference speech elements and a portion of the test speech stream spanned by the incremented second sampling window is identified in response to a similarity measure below the predefined threshold; and
      
      continue the process of incrementing the second sampling window and computing additional similarity measures until the test speech stream is exhausted.
  - 11. A system as defined in claim 9, wherein the instructions are configured to align the extracted time-frequency speech features using Dynamic Time Warping (DTW).
  - 12. A system as defined in claim 9, wherein the instructions are configured to determine the first set of coefficients by solving least squares equations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Amini, Lisa, Frossard, Pascal, Kokiopoulou, Effrosyni, Verscheure, Oliver
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US12/114,128
Publication Number

US 20090276216A1
Time in Patent Office

2,881 Days
Field of Search

704/270, 704/232, 704/236, 704/275
US Class Current

1/1
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 15/22 Procedures used during a sp...

Method and system for robust pattern matching in continuous speech for spotting a keyword of interest using orthogonal matching pursuit

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for robust pattern matching in continuous speech for spotting a keyword of interest using orthogonal matching pursuit

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links