MODEL TRAINING FOR AUTOMATIC SPEECH RECOGNITION FROM IMPERFECT TRANSCRIPTION DATA

US 20100318355A1
Filed: 06/10/2009
Published: 12/16/2010
Est. Priority Date: 06/10/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

a. aligning an utterance from a set of training data with a corresponding original transcription from the set of training data to produce a time-aligned transcription with time alignment information for each word in the utterance, wherein the set of training data includes transcription errors;

b. decoding the same utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription with time alignment information for each word;

c. aligning the time-aligned and decoded transcriptions according to time alignment information;

d. selecting all segments from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and

e. training the incremental acoustic model with the selected segments.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further include inserting silence between a pair of words into the decoded transcription and aligning an original transcription corresponding to the utterance with the decoded transcription according to time for each part. The technique may further include selecting a segment from the utterance having at least Q contiguous matching aligned words, and training the incremental acoustic model with the selected segment. The trained incremental acoustic model may then be used on a subsequent part of the training data. Other embodiments are described and claimed.

72 Citations

View as Search Results

20 Claims

1. A computer-implemented method, comprising:
- a. aligning an utterance from a set of training data with a corresponding original transcription from the set of training data to produce a time-aligned transcription with time alignment information for each word in the utterance, wherein the set of training data includes transcription errors;
  
  b. decoding the same utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription with time alignment information for each word;
  
  c. aligning the time-aligned and decoded transcriptions according to time alignment information;
  
  d. selecting all segments from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and
  
  e. training the incremental acoustic model with the selected segments.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The computer-implemented method of claim 1, comprising:
    - dividing training data comprising audio data and transcription data corresponding to the audio data into N parts of M duration, wherein each part includes one or more utterances each comprising a plurality of words, and wherein N and M are positive integers; and
      
      f. iterating 1.a. through 1.d. for each utterance in one of the N parts; and
      
      g. iterating 2.f. for each of the N parts.
  - 3. The computer-implemented method of claim 2, comprising:
    - during a first iteration on a first part, building the incremental language model from the original transcription corresponding to the first part; and
      
      during a subsequent iteration on a subsequent part, building L incremental language models, where M/L is less than or equal to one, and where each of the L incremental language models uses a portion of M/L duration of the original transcription corresponding to the subsequent part.
  - 4. The computer-implemented method of claim 1, comprising:
    - f. evaluating the accuracy of the incremental acoustic model compared to the accuracy of an acoustic model built from a similar amount of training data having no transcription errors.
  - 5. The computer-implemented method of claim 1, wherein selecting a segment from the utterance comprises:
    - including a silence in a selected segment comprising the Q matching aligned words when the selected segment is preceded or followed by a silence; and
      
      if there is no silence preceding or succeeding the selected segment;
      
      selecting the selected segment according to the original transcription with time alignment information; and
      
      inserting part of a silence segment from the beginning of the utterance into the beginning of the selected segment, and appending a part of a silence segment from the end of the utterance to the end of the selected segment.

6. A computer-readable storage medium storing computer-executable program instructions that when executed cause a computing system to:
- compute a frame posterior for each word in an utterance from a corpus comprising audio data and a corresponding transcription that contains transcription errors;
  
  train an acoustic model with confidence-based maximum likelihood estimation (MLE) training using the frame posterior;
  
  estimate acoustic model parameters with confidence-based discriminative training using the frame posterior; and
  
  generate a finalized acoustic model.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 7. The computer-readable storage medium of claim 6, wherein the instructions to compute a frame posterior include instructions that when executed cause the computing system to:
    - decode the audio data using an existing acoustic model to generate a lattice;
      
      merging the decoded lattice with the transcription;
      
      labeling each word in the merged lattice as one of correct or incorrect by examining a degree to which the word is overlapped with the transcription;
      
      computing a posterior probability for each word in the merged lattice; and
      
      computing the frame posterior q(t) of time t by summing the posterior probabilities of all the correct words passing time t.
  - 8. The computer-readable storage medium of claim 6, wherein the instructions to train an acoustic model with confidence-based MLE training include instructions that when executed cause the computing system to:
    - estimate model parameters using the transcription, the audio data and the frame posterior.
  - 9. The computer-readable storage medium of claim 8, wherein the instructions to estimate model parameters include instructions that when executed cause the computing system to:
    - calculate the update formulas for mean (μ
      
      _jk) and variance (σ
      
      _jk²) for a jth state and a kth mixture model as;
  - 10. The computer-readable storage medium of claim 9, wherein ζ
    - _jk(t) is adjusted according to soft confidence training, wherein
      ζ
      
      _jk(t)=q(t)ζ
      
      _jk(t)
  - 11. The computer-readable storage medium of claim 9, wherein ζ
    - _jk(t) is adjusted according hard confidence training, wherein
  - 12. The computer-readable storage medium of claim 6, wherein the instructions to estimate acoustic model parameters with confidence-based discriminative training include instructions that when executed cause the computing system to:
    - estimate model parameters by separating statistics for a numerator lattice corresponding to the original transcription from the statistics of a decoding lattice generated by decoding the audio data with an existing acoustic model to generate the decoding lattice.
  - 13. The computer-readable storage medium of claim 12, wherein the instructions to estimate model parameters include instructions that when executed cause the computing system to:
    - calculate the update formulas for mean (μ
      
      _jk) and variance (σ
      
      _jk²) for a jth state and a kth mixture model as;
  - 14. The computer-readable storage medium of claim 13, wherein γ
    - _qjk^den(t) is adjusted according to soft confidence training, wherein;
      
      γ
      
      _qjk^den(t)=q(t)γ
      
      _qjk^den(t)
  - 15. The computer-readable storage medium of claim 13, wherein γ
    - _qjk^den(t) is adjusted according to hard confidence training, wherein;

16. A system, comprising:
- an alignment component operative to align an utterance from a corpus of training data including transcription errors with a corresponding original transcription from the corpus of training data to produce a time-aligned transcription with time alignment information for each word in the utterance;
  
  a decoding component operative to decode the utterance from the corpus of training data using an incremental acoustic model and an incremental language model to produce a decoded transcription;
  
  wherein the alignment component is operative to align the time-aligned transcription with the decoded transcription;
  
  a segment selecting component operative to select a segment from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and
  
  a training component to train the incremental acoustic model with the selected segment and to generate a final acoustic model.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, wherein the decoding component uses a trained incremental acoustic model from the training component.
  - 18. The system of claim 16, wherein the segment selecting component is operative to:
    - include a silence in a selected segment comprising the Q matching aligned words when the selected segment is preceded or followed by a silence; and
      
      if there is no silence preceding or succeeding the selected segment, to;
      
      select the selected segment according to the original transcription with time alignment information; and
      
      insert part of a silence segment from the beginning of the utterance into the beginning of the selected segment, and appending a part of a silence segment from the end of the utterance to the end of the selected segment.
  - 19. The system of claim 16, wherein the decoding component, alignment component, segment selecting component, and training component are operative to iterate over N parts of the corpus of training data.
  - 20. The system of claim 19, wherein on a first iteration, the decoding component uses the incremental language model from the original transcription corresponding to the first part;
    - andthe training component is operative to build L incremental language models for subsequent iterations, where M/L is less than or equal to one, and where each of the L incremental language models uses a portion of M/L duration of the original transcription corresponding to a subsequent part.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yao, Kaisheng, Liu, Chaojun, Gong, Yifan, Li, Jinyu

Granted Patent

US 9,280,969 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/244
CPC Class Codes

G10L 15/063 Training

G10L 15/065 Adaptation

MODEL TRAINING FOR AUTOMATIC SPEECH RECOGNITION FROM IMPERFECT TRANSCRIPTION DATA

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

72 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

MODEL TRAINING FOR AUTOMATIC SPEECH RECOGNITION FROM IMPERFECT TRANSCRIPTION DATA

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

72 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links