DOMAIN ADAPTATION IN SPEECH RECOGNITION VIA TEACHER-STUDENT LEARNING

US 20190051290A1
Filed: 08/11/2017
Published: 02/14/2019
Est. Priority Date: 08/11/2017
Status: Active Grant

First Claim

Patent Images

1. A system providing for adaption of speech recognition models for speech recognition in new domains, comprising:

a processor; and

a memory storage device including instructions that when executed by the processor enable the system to;

select a teacher model configured for speech recognition of utterances in a source domain;

produce a student model based on the teacher model for speech recognition of utterances in a target domain;

provide source domain utterances to the teacher model to produce teacher posteriors for the source domain utterances;

provide, in parallel to providing the source domain utterances, target domain utterances to the student model to produce student posteriors for the target domain utterances;

determine whether student posteriors converge with the teacher posteriors;

in response to determining that the student posteriors and the teacher posteriors converge, finalize the student model for use in speech recognition in the target domain; and

in response to determining that the that the student posteriors and the teacher posteriors do not converge, update parameters of the student model based on divergences in the student posteriors and the teacher posteriors.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improvements in speech recognition in a new domain are provided via the student/teacher training of models for different speech domains. A student model for a new domain is created based on the teacher model trained in an existing domain. The student model is trained in parallel to the operation of the teacher model, with inputs in the new and existing domains respectfully, to develop a neural network that is adapted to recognize speech in the new domain. The data in the new domain may exclude transcription labels but rather are parallelized with the data analyzed in the existing domain analyzed by the teacher model. The outputs from the teacher model are compared with the outputs of the student model and the differences are used to adjust the parameters of the student model to better recognize speech in the second domain.

Citations

20 Claims

1. A system providing for adaption of speech recognition models for speech recognition in new domains, comprising:
- a processor; and
  
  a memory storage device including instructions that when executed by the processor enable the system to;
  
  select a teacher model configured for speech recognition of utterances in a source domain;
  
  produce a student model based on the teacher model for speech recognition of utterances in a target domain;
  
  provide source domain utterances to the teacher model to produce teacher posteriors for the source domain utterances;
  
  provide, in parallel to providing the source domain utterances, target domain utterances to the student model to produce student posteriors for the target domain utterances;
  
  determine whether student posteriors converge with the teacher posteriors;
  
  in response to determining that the student posteriors and the teacher posteriors converge, finalize the student model for use in speech recognition in the target domain; and
  
  in response to determining that the that the student posteriors and the teacher posteriors do not converge, update parameters of the student model based on divergences in the student posteriors and the teacher posteriors.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the teacher model is selected based on at least one of:
    - a selected language;
      
      a selected dialect; and
      
      a selected accent.
  - 3. The system of claim 1, wherein the parameters of the student model are updated according to a back propagation of the student posteriors.
  - 4. The system of claim 1, wherein providing target domain utterances in parallel with the source domain utterances the system is further operable to:
    - receive a target domain definition specifying at least one of;
      
      a Signal to Noise Ratio;
      
      a codec by which the utterances are encoded;
      
      a frequency band for the utterances;
      
      a volume level; and
      
      an average speech frequency for the utterances;
      
      transform the source domain utterances according to the target domain definition to produce the target domain utterances that simulate utterances according to the target domain definition.
  - 5. The system of claim 1, wherein the source domain utterances and the target domain utterances comprise un-transcribed data.
  - 6. The system of claim 1, wherein in response to updating the student model, the system is further operable to:
    - provide successive source domain utterances to the teacher model to produce successive teacher posteriors for the successive source domain utterances;
      
      provide, in parallel to providing the successive source domain utterances, successive target domain utterances to the updated student model to produce successive student posteriors for the successive target domain utterances;
      
      determine whether successive student posteriors converge with the successive teacher posteriors;
      
      in response to determining that the successive posteriors converge, finalize the updated student model for use in speech recognition in the target domain; and
      
      in response to determining that the successive posteriors do not converge, update parameters of the updated student model based on divergences in the successive posteriors.
  - 7. The system of claim 1, wherein when updating the student model, the system is further operable to adjust parameters of the student model to minimize a divergence score between the student posteriors and the teacher posteriors.

8. A method for adaption of speech recognition models for speech recognition in new domains, comprising:
- receiving a selection of a first speech recognition model adapted for speech recognition of utterances in a first domain;
  
  cloning the first speech recognition model to thereby produce a second speech recognition model;
  
  providing a first dataset of utterances to the first speech recognition model and a second dataset of utterances to the second speech recognition model, wherein the first dataset includes utterances defined according to the first domain and the second dataset includes parallel utterances to those included in the first dataset that are defined according to a second domain;
  
  determining whether posteriors produced by the second speech recognition model from the second dataset converge with posteriors produced by the first speech recognition model from the first dataset;
  
  in response to determining that the posteriors converge, finalizing the second speech recognition model for use in speech recognition in the second domain; and
  
  in response to determining that the posteriors do not converge, updating parameters of the second speech recognition model based on the posteriors.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein the second dataset comprises the utterances of the first dataset transformed from the first domain into the second domain.
  - 10. The method of claim 9, wherein transforming the utterances of the first dataset from the first domain into the second domain comprises at least one of:
    - overlaying a new signal to the utterances of the first dataset;
      
      adjusting a parameter of the utterances of the first dataset; and
      
      frequency warping the utterances of the first dataset.
  - 11. The method of claim 8, wherein the first speech recognition model provides a supervisory signal by which the second speech recognition model is updated.
  - 12. The method of claim 8, wherein determining whether the posteriors produced by the second speech recognition model from the second dataset converge with the posteriors produced by the first speech recognition model from the first dataset further comprises:
    - calculating a divergence score between the posteriors produced by the second speech recognition model and the produced by the first speech recognition model;
      
      comparing the divergence score to a convergence threshold;
      
      in response to the divergence score satisfying the convergence threshold, determining that the posteriors converge; and
      
      in response to the divergence score not satisfying the convergence threshold, determining that the posteriors do not converge.
  - 13. The method of claim 8, wherein the posteriors indicate probabilities of which senones are present in a given frame of a given utterance.
  - 14. The method of claim 8, wherein the second domain is defined relative to the first domain as having at least one of:
    - a different Signal to Noise Ratio than the first domain;
      
      a different encoding codec than the first domain;
      
      a different frequency band of utterances than the first domain;
      
      a different field depth for utterances than the first domain;
      
      a different volume than the first domain; and
      
      a different average pitch than the first domain.

15. A computer readable storage device including instructions that when executed by a processor provide for adaption of speech recognition models for speech recognition in new domains, comprising:
- receiving a selection of a teacher model adapted for speech recognition of utterances in a source domain;
  
  cloning the teacher model to produce a student model;
  
  providing utterances according to the source domain to the teacher model in parallel to providing utterances according to a target domain to the student model;
  
  determining whether posteriors produced by the student model from the target domain utterances converge with posteriors produced by the teacher model from the source domain utterances;
  
  in response to determining that the posteriors converge, finalizing the student model for use in speech recognition in the target domain; and
  
  in response to determining that the posteriors do not converge, updating parameters of the student model based on the posteriors.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer readable storage device of claim 15, wherein the target domain utterances comprise transformed utterances of the source dataset.
  - 17. The computer readable storage device of claim 16, wherein the target domain utterances comprise are transformed from the utterances of the source dataset comprises according to at least one of:
    - overlaying a new signal with the source domain utterances;
      
      adjusting a parameter of the source domain utterances; and
      
      frequency warping the source domain utterances.
  - 18. The computer readable storage device of claim 15, wherein determining whether the posteriors produced by the student model converge with the posteriors produced by the teacher model further comprises:
    - calculating a divergence score between the posteriors produced by the student model and the produced by the teacher model;
      
      comparing the divergence score to a convergence threshold;
      
      in response to the divergence score satisfying the convergence threshold, determining that the posteriors converge; and
      
      in response to the divergence score not satisfying the convergence threshold, determining that the posteriors do not converge.
  - 19. The computer readable storage device of claim 15, wherein the posteriors indicate probabilities of which senones are present in a given frame of a given utterance.
  - 20. The computer readable storage device of claim 15, wherein the target domain is defined as having at least one of:
    - a different Signal to Noise Ratio than the source domain;
      
      a different codec by which utterances are encoded than the source domain;
      
      a different frequency band for the utterances than the source domain;
      
      a different field depth than the source domain;
      
      a different volume than the source domain; and
      
      a different average pitch than the source domain.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Li, Jinyu, Seltzer, Michael Lewis, Wang, Xi, Zhao, Rui, Gong, Yifan

Granted Patent

US 10,885,900 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/084   Backpropagation, e.g. using...

G06N 3/126   Evolutionary algorithms, e....

G06N 5/01   Dynamic search techniques; ...

G10L 15/063   Training

G10L 15/065   Adaptation

G10L 15/16   using artificial neural net...

G10L 15/183   using context dependencies,...

G10L 25/30   using neural networks

DOMAIN ADAPTATION IN SPEECH RECOGNITION VIA TEACHER-STUDENT LEARNING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DOMAIN ADAPTATION IN SPEECH RECOGNITION VIA TEACHER-STUDENT LEARNING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links