Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation

US 20150161994A1
Filed: 12/05/2013
Published: 06/11/2015
Est. Priority Date: 12/05/2013
Status: Active Grant

First Claim

Patent Images

1. A method for speech recognition, the comprising:

receiving, by a deep neural network, input speech data and corresponding speaker data; and

generating, by the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a speech recognition system, deep neural networks (DNNs) are employed in phoneme recognition. While DNNs typically provide better phoneme recognition performance than other techniques, such as Gaussian mixture models (GMM), adapting a DNN to a particular speaker is a real challenge. According to at least one example embodiment, speech data and corresponding speaker data are both applied as input to a DNN. In response, the DNN generates a prediction of a phoneme based on the input speech data and the corresponding speaker data. The speaker data may be generated from the corresponding speech data.

Citations

20 Claims

1. A method for speech recognition, the comprising:
- receiving, by a deep neural network, input speech data and corresponding speaker data; and
  
  generating, by the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as recited in claim 1, wherein the input speech data is part of training data and said receiving and generating are repeated for input speech data and corresponding speaker data associated with multiple speakers, the method further comprising:
    - iteratively updating weighting coefficients of the deep neural network based on the prediction of the phoneme generated and information in the training data.
  - 3. The method as recited in claim 1, wherein the input speech data is deployment speech data collected by a speech recognition system.
  - 4. The method as recited in claim 1, wherein the speaker data is generated from the corresponding input speech data using maximum likelihood linear regression (MLLR).
  - 5. The method as recited in claim 1, wherein the speaker data is generated from the corresponding input speech data using constrained maximum likelihood linear regression (CMLLR).
  - 6. The method as recited in claim 1, wherein the prediction of the phoneme generated includes a probability score.
  - 7. The method as recited in claim 1, wherein the prediction of the phoneme generated includes an indication of a phoneme.
  - 8. The method as recited in claim 1 further comprising training a Gaussian mixture models'"'"' module using the prediction of the phoneme generated by the deep neural network.
  - 9. The method as recited in claim 8, wherein the Gaussian mixture models'"'"' module is adapted using the speaker data.
  - 10. The method as recited in claim 1 further comprising reducing dimensionality of the speaker data using principal component analysis (PCA) prior to reception by the deep neural network.

11. An apparatus for speech recognition comprising:
- at least one processor; and
  
  at least one memory with computer code instructions stored thereon,the at least one processor and the at least one memory with computer code instructions being configured to cause the apparatus to;
  
  receive, by an input layer of a deep neural network, input speech data and corresponding speaker data; and
  
  generate, at an output layer of the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The apparatus as recited in claim 11, wherein the input speech data is part of training data and wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to:
    - repeat receiving input speech data and corresponding speaker data and generating the prediction of the phoneme for multiple speakers; and
      
      iteratively update weighting coefficients of the deep neural network based on the prediction of the phoneme generated and information in the training data.
  - 13. The apparatus as recited in claim 11, wherein the input speech data is deployment speech data.
  - 14. The apparatus as recited in claim 11, wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to generate the speaker data from the corresponding input speech data using maximum likelihood linear regression (MLLR).
  - 15. The apparatus as recited in claim 11, wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to generate the speaker data from the corresponding input speech data using constrained maximum likelihood linear regression (CMLLR).
  - 16. The apparatus as recited in claim 11, wherein the prediction of the phoneme generated includes a probability score or an indication of a phoneme.
  - 17. The apparatus as recited in claim 11, wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to train a Gaussian mixture models'"'"' module using the prediction of the phoneme generated.
  - 18. The apparatus as recited in claim 17, wherein the Gaussian mixture models'"'"' module is adapted using the speaker data.
  - 19. The apparatus as recited in claim 17, wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to reduce the dimensionality of the speaker data using principal component analysis (PCA) prior to reception by the input layer of the deep neural network.

20. A non-transitory computer-readable medium with computer code instructions stored thereon, the computer code instructions being configured, when executed by a processor, to cause an apparatus to:
- receive, by a deep neural network, input speech data and corresponding speaker data; and
  
  generate, by the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Tang, Yun, Nagesha, Venkatesh, Fan, Xing

Granted Patent

US 9,721,561 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/07   to the speaker

G10L 15/16   using artificial neural net...

G10L 2015/025   Phonemes, fenemes or fenone...

Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links