LOW LATENCY REAL-TIME VOCAL TRACT LENGTH NORMALIZATION

US 20090259465A1
Filed: 06/24/2009
Published: 10/15/2009
Est. Priority Date: 01/12/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for training an automatic speech recognition system, the method comprising:

separating training data into speaker specific segments; and

performing, for each speaker specific segment, the acts of;

generating spectral data representative of the speaker specific segment;

selecting a first warping factor as a best warping factor, and generating a warped spectral data representation of the spectral data;

comparing the warped spectral data representation to a predetermined speech model; and

iteratively performing, until an end condition is satisfied, the acts of;

selecting an other warping factor and generating an other warped spectral data representation;

comparing the warped spectral data representation to a respective speech model for a given iteration; and

if the other warping factor produces a closer match to the respective speech model, saving the other warping factor as the best warping factor for the respective speaker specific segment.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for training an automatic speech recognition system are provided. The method includes separating training data into speaker specific segments, and for each speaker specific segment, performing the following acts: generating spectral data, selecting a first warping factor and warping the spectral data, and comparing the warped spectral data with a speech model. The method also includes iteratively performing the steps of selecting another warping factor and generating another warped spectral data, comparing the other warped spectral data with the speech model, and if the other warping factor produces a closer match to the speech model, saving the other warping factor as the best warping factor for the speaker specific segment. The system includes modules configured to control a processor in the system to perform the steps of the method.

Citations

19 Claims

1. A computer-implemented method for training an automatic speech recognition system, the method comprising:
- separating training data into speaker specific segments; and
  
  performing, for each speaker specific segment, the acts of;
  
  generating spectral data representative of the speaker specific segment;
  
  selecting a first warping factor as a best warping factor, and generating a warped spectral data representation of the spectral data;
  
  comparing the warped spectral data representation to a predetermined speech model; and
  
  iteratively performing, until an end condition is satisfied, the acts of;
  
  selecting an other warping factor and generating an other warped spectral data representation;
  
  comparing the warped spectral data representation to a respective speech model for a given iteration; and
  
  if the other warping factor produces a closer match to the respective speech model, saving the other warping factor as the best warping factor for the respective speaker specific segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, wherein the first warping factor is in a range from about 0.8 to about 1.2.
  - 3. The computer-implemented method of claim 2, wherein the range includes increments of about 0.02 between each of the warping factors.
  - 4. The computer-implemented method of claim 1, wherein the end condition includes a predetermined amount of total speech having been used to select the best warping factor.
  - 5. The computer-implemented method of claim 1, wherein the end condition includes a difference between a latest warping factor and a preceding warping factor being smaller than a predetermined amount.
  - 6. The computer-implemented method of claim 1, wherein the spectral data is a short-term magnitude spectrum of the speaker specific segment.
  - 7. The computer-implemented method of claim 1, wherein the spectral data comprises a spectral axis modified by the warping factor.
  - 8. The computer-implemented method of claim 7, further comprising generating a Vocal Tract Length Normalized acoustic model based on the spectral axis modified by the warping factor.
  - 9. The computer-implemented method of claim 8, wherein the respective speech model of a second or later iteration is the Vocal Tract Length Normalized acoustic model.
  - 10. The computer-implemented method of claim 1, wherein the respective speech model of a first iteration is the predetermined speech model.

11. A system for training an automatic speech recognition system, the system comprising:
- a processor;
  
  a module configured to control the processor to generate spectral data from at least a portion of training data;
  
  a module configured to control the processor to generate a plurality of warped spectral axes for the spectral data using a range of warping factors;
  
  a module configured to control the processor to determine which one of the plurality of warped spectral axes best matches one of a generic speech model or a Vocal Tract Length Normalized acoustic model;
  
  a module configured to control the processor to generate the Vocal Tract Length Normalized Acoustic model using a warping factor corresponding to the determined one of the plurality of warped spectral axes; and
  
  a module configured to control the processor to rescore lattices based on the Vocal Tract Length Normalized Acoustic model.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The system of claim 11, wherein the warping factors are in a range from about 0.8 to about 1.2.
  - 13. The system of claim 12, wherein the range includes increments of about 0.02 between each of the warping factors.
  - 14. The system of claim 11, further comprising a module configured to control the processor to rescore the lattices based on determined one of the plurality of warped spectral axes.
  - 15. The system of claim 11, further comprising a module configured to control the processor to determine if the determined one of the plurality of warped spectral axes is stable.
  - 16. The system of claim 11, wherein the module configured to control the processor to determine which one of the plurality of warped spectral axes best matches one of a generic speech model or a Vocal Tract Length Normalized model further comprises the module configured to control the processor to iteratively perform the steps of, until an end condition is met:
    - selecting an other warping factor and generating an other warped spectral data representation based on the respective warped spectral axes;
      
      comparing the warped spectral data representation to a respective speech model for a given iteration; and
      
      if the other warping factor produces a closer match to the respective speech model, saving the other warping factor as the best warping factor for the respective speaker specific segment.
  - 17. The system of claim 16, wherein the end condition includes a predetermined amount of total speech having been used to select the best warping factor.
  - 18. The system of claim 16, wherein the end condition includes a difference between a latest warping factor and a preceding warping factor being smaller than a predetermined amount.

19. A tangible computer-readable storage medium storing a computer program having instructions for training an automatic speech recognition system, the instructions comprising:
- separating training data into speaker specific segments; and
  
  performing, for each speaker specific segment, the acts of;
  
  generating spectral data representative of the speaker specific segment;
  
  selecting a first warping factor as a best warping factor, and generating a warped spectral data representation of the spectral data;
  
  comparing the warped spectral data representation to a predetermined speech model; and
  
  iteratively performing, until an end condition is satisfied, the acts of;
  
  selecting an other warping factor and generating an other warped spectral data representation;
  
  comparing the warped spectral data representation to a respective speech model for a given iteration; and
  
  if the other warping factor produces a closer match to the respective speech model, saving the other warping factor as the best warping factor for the respective speaker specific segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Saraclar, Murat, Goffin, Vincent, Ljolje, Andrej

Granted Patent

US 8,909,527 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/234
CPC Class Codes

G10L 15/063   Training

G10L 15/10   using distance or distortio...

G10L 15/12   using dynamic programming t...

G10L 17/04   Training, enrolment or mode...

G10L 17/08   Use of distortion metrics o...

LOW LATENCY REAL-TIME VOCAL TRACT LENGTH NORMALIZATION

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

LOW LATENCY REAL-TIME VOCAL TRACT LENGTH NORMALIZATION

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links