Estimating speaker-specific affine transforms for neural network based speech recognition systems

US 9,378,735 B1
Filed: 12/19/2013
Issued: 06/28/2016
Est. Priority Date: 12/19/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

under control of one or more computing devices configured with specific computer-executable instructions,obtaining a Gaussian mixture model-based (“

GMM-based”

) acoustic model;

obtaining a neural network-based (“

NN-based”

) acoustic model;

receiving an audio signal comprising speech;

computing a first sequence of feature vectors from the audio signal;

computing a GMM-based transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the GMM-based transform comprises a first linear portion and a first bias portion;

computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the second linear portion and the first linear portion;

computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the second bias portion and the first bias portion;

computing a second sequence of feature vectors from the audio signal;

computing a third sequence of feature vectors by applying the second linear portion and the second bias portion of the NN-based transform to the second sequence of feature vectors;

performing speech recognition using the third sequence of feature vectors and the NN-based acoustic model generate speech processing results; and

determining, using the speech processing results, an action to perform.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for estimating affine transforms in Log Filter-Bank Energy Space (“LFBE” space) in order to adapt artificial neural network-based acoustic models to a new speaker or environment. Neural network-based acoustic models may be trained using concatenated LFBEs as input features. The affine transform may be estimated by minimizing the least squares error between corresponding linear and bias transform parts for the resultant neural network feature vector and some standard speaker-specific feature vector obtained for a GMM-based acoustic model using constrained Maximum Likelihood Linear Regression (“cMLLR”) techniques. Alternatively, the affine transform may be estimated by minimizing the least squares error between the resultant transformed neural network feature and some standard speaker-specific feature obtained for a GMM-based acoustic model.

13 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,obtaining a Gaussian mixture model-based (“
  
  GMM-based”
  
  ) acoustic model;
  
  obtaining a neural network-based (“
  
  NN-based”
  
  ) acoustic model;
  
  receiving an audio signal comprising speech;
  
  computing a first sequence of feature vectors from the audio signal;
  
  computing a GMM-based transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the GMM-based transform comprises a first linear portion and a first bias portion;
  
  computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the second linear portion and the first linear portion;
  
  computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the second bias portion and the first bias portion;
  
  computing a second sequence of feature vectors from the audio signal;
  
  computing a third sequence of feature vectors by applying the second linear portion and the second bias portion of the NN-based transform to the second sequence of feature vectors;
  
  performing speech recognition using the third sequence of feature vectors and the NN-based acoustic model generate speech processing results; and
  
  determining, using the speech processing results, an action to perform.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer-implemented method of claim 1, wherein an input to the first least squares difference function comprises the second sequence of feature vectors.
  - 3. The computer-implemented method of claim 1, wherein the second sequence of feature vectors comprises log filter bank energy (“
    - LFBE”
      
      ) vectors.
  - 4. The computer-implemented method of claim 1, wherein the first linear portion comprises a first matrix, wherein the second linear portion comprises a second matrix, wherein the first bias portion comprises a first column vector, and wherein the second bias portion comprises a second column vector.
  - 5. The computer-implemented method of claim 1, wherein computing the third sequence of feature vectors comprises:
    - multiplying a vector of the second sequence of feature vectors by the second linear portion of the NN-based transform to generate a first product; and
      
      adding the first product to the second bias portion of the NN-based transform to generate a vector of the third sequence of feature vectors.
  - 6. The computer-implemented method of claim 1, wherein computing the GMM-based transform comprises using constrained maximum likelihood linear regression (“
    - cMLLR”
      
      ).

7. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
  
  obtain a Gaussian mixture model-based (“
  
  GMM-based”
  
  ) acoustic model;
  
  obtain a neural network-based (“
  
  NN-based”
  
  ) acoustic model;
  
  compute a first sequence of feature vectors from audio data;
  
  compute a first transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the first transform comprises a first linear portion and a first bias portion;
  
  compute a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the first linear portion and the second linear portion;
  
  compute a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the first bias portion and the second bias portion;
  
  compute a second sequence of feature vectors from an audio signal comprising user speech;
  
  compute a third sequence of feature vectors by applying the second linear portion and the second bias portion to the second sequence of feature vectors;
  
  perform speech recognition on the audio signal using the third sequence of feature vectors and the NN-based acoustic model; and
  
  determine using results of the speech recognition, an action to perform.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The system of claim 7, wherein the audio data comprises the audio signal.
  - 9. The system of claim 7, wherein the audio data was obtained using training data.
  - 10. The system of claim 7, wherein an input to the first function comprises the second sequence of feature vectors or a fourth sequence of feature vectors computed from the audio data.
  - 11. The system of claim 7, wherein the second sequence of feature vectors comprises log filter bank energy (“
    - LFBE”
      
      ) vectors.
  - 12. The system of claim 7, wherein the first transform and NN-based transform are computed using constrained maximum likelihood linear regression (“
    - cMLLR”
      
      ).
  - 13. The system of claim 7, wherein the executable instructions to compute the third sequence of feature vectors comprise instructions to multiply a feature vector of the second sequence of feature vectors by the second portion of the NN-based transform, wherein the second portion of the NN-based transform comprises a matrix.

14. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
- obtaining a Gaussian mixture model-based (“
  
  GMM-based”
  
  ) acoustic model;
  
  obtaining a neural network-based (“
  
  NN-based”
  
  ) acoustic model;
  
  computing a first sequence of feature vectors from audio data;
  
  computing a first transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the first transform comprises a first linear portion and a first bias portion;
  
  computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the first linear portion and the second linear portion;
  
  computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the first bias portion and the second bias portion;
  
  computing a second sequence of feature vectors from an audio signal comprising speech;
  
  computing a third sequence of feature vectors by applying the second linear portion and the second bias portion to the second sequence of feature vectors;
  
  performing speech recognition on the audio signal using the third sequence of feature vectors and the NN-based acoustic model; and
  
  determining, using results of the speech recognition, an action to perform.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The one or more non-transitory computer readable media of claim 14, wherein the process further comprises computing the first sequence of feature vectors from audio data comprising the audio signal.
  - 16. The one or more non-transitory computer readable media of claim 14, wherein the process further comprises obtaining the audio data using training data.
  - 17. The one or more non-transitory computer readable media of claim 14, wherein the process further comprises inputting the second sequence of feature vectors or a fourth sequence of feature vectors computed from the audio data into the first function.
  - 18. The one or more non-transitory computer readable media of claim 14, wherein the process further comprises computing the second sequence of feature vectors from LFBEs.
  - 19. The one or more non-transitory computer readable media of claim 14, wherein the process further comprises computing the first transform and NN-based transform using constrained maximum likelihood linear regression (“
    - cMLLR”
      
      ).
  - 20. The one or more non-transitory computer readable media of claim 14, wherein computing the third sequence of feature vectors comprises adding the second portion of the NN-based transform to a product of a matrix and a feature vector of the NN-based sequence of feature vectors, and wherein the NN-based portion of the second transform comprises a vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Garimella, Sri Venkata Surya Siva Rama Krishna, Hoffmeister, Bjorn, Strom, Nikko
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Shin, Seong Ah A

Application Number

US14/135,474
Time in Patent Office

922 Days
Field of Search

704/233, 704/244, 704/257, 704/260, 704/232
US Class Current

1/1
CPC Class Codes

G10L 15/16 using artificial neural net...

Estimating speaker-specific affine transforms for neural network based speech recognition systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

13 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Estimating speaker-specific affine transforms for neural network based speech recognition systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links