Low latency real-time vocal tract length normalization

US 9,165,555 B2
Filed: 11/26/2014
Issued: 10/20/2015
Est. Priority Date: 01/12/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

performing, for a speaker specific segment of training data;

generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors;

selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and

generating a warped spectral data representation of the spectral data using the first warping factor;

iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of;

generating another warped spectral data representation using another warping factor;

comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and

when the other warping factor produce a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and

training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for training an automatic speech recognition system are provided. The method includes separating training data into speaker specific segments, and for each speaker specific segment, performing the following acts: generating spectral data, selecting a first warping factor and warping the spectral data, and comparing the warped spectral data with a speech model. The method also includes iteratively performing the steps of selecting another warping factor and generating another warped spectral data, comparing the other warped spectral data with the speech model, and if the other warping factor produces a closer match to the speech model, saving the other warping factor as the best warping factor for the speaker specific segment. The system includes modules configured to control a processor in the system to perform the steps of the method.

Citations

20 Claims

1. A method comprising:
- performing, for a speaker specific segment of training data;
  
  generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors;
  
  selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and
  
  generating a warped spectral data representation of the spectral data using the first warping factor;
  
  iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of;
  
  generating another warped spectral data representation using another warping factor;
  
  comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and
  
  when the other warping factor produce a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and
  
  training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the comparison further requires a predetermined amount of total speech having been used to select the best warping factor.
  - 3. The method of claim 1, wherein the comparison further a difference between a latest warping factor and a preceding warping factor being smaller than a predetermined amount.
  - 4. The method of claim 1, wherein the spectral data is a short-term magnitude spectrum of the speaker specific segment.
  - 5. The method of claim 1, wherein the spectral data comprises a spectral axis modified by the warping factor.
  - 6. The method of claim 5, wherein the new acoustic model comprises a vocal tract length normalized acoustic model based on the spectral axis modified by the warping factor.
  - 7. The method of claim 1, wherein the vocal tract length normalized acoustic model is initially a generic acoustic model.

8. A system comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, result in the processor performing operations comprising;
  
  performing, for a speaker specific segment of training data;
  
  generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors;
  
  selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and
  
  generating a warped spectral data representation of the spectral data using the first warping factor;
  
  iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of;
  
  generating another warped spectral data representation using another warping factor;
  
  comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and
  
  when the other warping factor produces a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and
  
  training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the comparison further requires a predetermined amount of total speech having been used to select the best warping factor.
  - 10. The system of claim 8, wherein the comparison further a difference between a latest warping factor and a preceding warping factor being smaller than a predetermined amount.
  - 11. The system of claim 8, wherein the spectral data is a short-term magnitude spectrum of the speaker specific segment.
  - 12. The system of claim 8, wherein the spectral data comprises a spectral axis modified by the warping factor.
  - 13. The system of claim 12, wherein the new acoustic model comprises a vocal tract length normalized acoustic model based on the spectral axis modified by the warping factor.
  - 14. The system of claim 8, wherein the vocal tract length normalized acoustic model is initially a generic acoustic model.

15. A computer-readable storage device having instructions stored which, when executed by a computing device, result in the computing device performing operations comprising:
- performing, for a speaker specific segment of training data;
  
  generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors;
  
  selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and
  
  generating a warped spectral data representation of the spectral data using the first warping factor;
  
  iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of;
  
  generating another warped spectral data representation using another warping factor;
  
  comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and
  
  when the other warping factors produces a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and
  
  training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable storage device of claim 15, wherein the comparison further requires a predetermined amount of total speech having been used to select the best warping factor.
  - 17. The s computer-readable storage device of claim 15, wherein the comparison further a difference between a latest warping factor and a preceding warping factor being smaller than a predetermined amount.
  - 18. The computer-readable storage device of claim 15, wherein the spectral data is a short-term magnitude spectrum of the speaker specific segment.
  - 19. The computer-readable storage device of claim 15, wherein the spectral data comprises a spectral axis modified by the warping factor.
  - 20. The computer-readable storage device of claim 19, wherein the new acoustic model comprises a vocal tract length normalized acoustic model based on the spectral axis modified by the warping factor.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Goffin, Vincent, Ljolje, Andrej, Saraclar, Murat
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
KOVACEK, DAVID M

Application Number

US14/554,339
Publication Number

US 20150088498A1
Time in Patent Office

328 Days
Field of Search

704/200, 704211-218, 704/220, 704/233, 704/236, 704241-245, 704246-250, 704270-271, 379 8801- 8828, 379156-166, 379/207.01
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/10   using distance or distortio...

G10L 15/12   using dynamic programming t...

G10L 17/04   Training, enrolment or mode...

G10L 17/08   Use of distortion metrics o...

Low latency real-time vocal tract length normalization

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Low latency real-time vocal tract length normalization

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links