Method and apparatus for constructing voice templates for a speaker-independent voice recognition system

US 6,735,563 B1
Filed: 07/13/2000
Issued: 05/11/2004
Est. Priority Date: 07/13/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of creating speech template for uses a speaker-independent speech recognition system, the method comprising:

segmenting each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean;

quantizing the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors;

comparing each one of the plurality of template vectors with a second plurality of utterances using a dynamic time warping calculation to generate at least one comparison result;

matching the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result;

partitioning the first plurality of utterances in time in accordance with the optimal matching path result; and

repeating the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for constructing voice templates for a speaker-independent voice recognition system includes segmenting a training utterance to generate time-clustered segments, each segment being represented by a mean. The means for all utterances of a given word are quantized to generate template vectors. Each template vector is compared with testing utterances to generate a comparison result. The comparison is typically a dynamic time warping computation. The training utterances are matched with the template vectors if the comparison result exceeds at least one predefined threshold value, to generate an optimal path result, and the training utterances are partitioned in accordance with the optimal path result. The partitioning is typically a K-means segmentation computation. The partitioned utterances may then be re-quantized and re-compared with the testing utterances until the at least one predefined threshold value is not exceeded.

36 Citations

View as Search Results

26 Claims

1. A method of creating speech template for uses a speaker-independent speech recognition system, the method comprising:
- segmenting each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean;
  
  quantizing the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors;
  
  comparing each one of the plurality of template vectors with a second plurality of utterances using a dynamic time warping calculation to generate at least one comparison result;
  
  matching the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result;
  
  partitioning the first plurality of utterances in time in accordance with the optimal matching path result; and
  
  repeating the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the comparing comprises calculating a variance measure.
  - 3. The method of claim 1, wherein the comparing comprises calculating an accuracy measure.
  - 4. The method of claim 1, wherein the comparing comprises first calculating a variance measure, and second, if the variance measure does not exceed a first predefined threshold value, calculating an accuracy measure.
  - 5. The method of claim 4, wherein the matching comprises matching the first utterance with the plurality of template vectors if either the variance measure exceeds the first predefined threshold value or the accuracy measure exceeds a second predefined threshold value.
  - 6. The method of claim 1, wherein the matching comprises performing a dynamic time warping computation.
  - 7. The method of claim 1, wherein the matching and the partitioning comprising performing a K-means segmentation computation.
  - 8. The method of claim 1, further comprising detecting endpoints of the first utterance.

9. An apparatus configured to create speech templates for use in a speaker-independent speech recognition system, the apparatus comprising:
- means for segmenting each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean;
  
  means for quantizing the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors;
  
  means for using a dynamic time warping calculation to compare each one of the plurality of template vectors with a second plurality of utterances to generate at least one comparison result;
  
  means for matching the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result;
  
  means for partitioning the first plurality of utterances in time in accordance with the optimal matching path result; and
  
  means for repeating the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.

10. An apparatus configured to create speech templates for use in a speaker-independent speech recognition system, the apparatus comprising:
- segmentation logic configured to segment each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean;
  
  a quantizer coupled to the segmentation logic and configured to quantize the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors;
  
  a convergence test coupled to the quantizer and configured to compare each one of the plurality of template vectors with a second plurality of utterances using a dynamic time warping calculation to generate at least one comparison result; and
  
  partitioning logic coupled to the quantizer and the convergence tester, and configured to match the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result, and to partition the first plurality of utterances in time accordance with the optimal matching path result, wherein the quantizer, the convergence tester, and the partitioning logic are further configured to repeat the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The apparatus of claim 10, wherein the at least one comparison result is a variance measure.
  - 12. The apparatus of claim 10, wherein the at least one comparison result is an accuracy measure.
  - 13. The apparatus of claim 10, wherein the at least one comparison result is a variance measure and an accuracy measure, and wherein the convergence tester is configured to first calculate the variance measure, and second, if the variance measure does not exceed a first predefined threshold value, calculate the accuracy measure.
  - 14. The apparatus of claim 13, wherein the matching comprises matching the first utterance with the plurality of template vectors if either the variance measure exceeds the first predefined threshold value or the accuracy measure exceeds a second predefined threshold value.
  - 15. The apparatus of claim 10, wherein the partitioning logic is configured to perform a dynamic time warping computation.
  - 16. The apparatus of claim 10, wherein the partitioning logic comprises K-means speech segmentation logic.
  - 17. The apparatus of claim 10, further comprising an endpoint detector coupled to the segmentation logic and configured to detect endpoints of the first utterance.

18. An apparatus configured to create speech templates for use in a speaker-independent speech recognition system, the apparatus comprising:
- a processor, and a storage medium coupled to the processor and containing a set of instructions executable by the processor to segment each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a mean, quantize the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors, compare each one of the plurality of template vectors with a second plurality of utterances using a dynamic time warping calculation to generate at least one comparison result, match the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result, partition the first plurality of utterances in time in accordance with the optimal matching path result, and repeat the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
- - 19. The apparatus of claim 18, wherein the at least one comparison result is a variance measure.
  - 20. The apparatus of claim 18, wherein the at least one comparison result is an accuracy measure.
  - 21. The apparatus of claim 18, wherein the at least one comparison result is a variance measure and an accuracy measure, and wherein the set of instructions is executable by the processor to first calculate the variance measure, and second, if the variance measure does not exceed a first predefined threshold value, calculate the accuracy measure.
  - 22. The apparatus of claim 21, wherein the set of instructions is further executable by the processor to match the first utterance with the plurality of template vectors if either the variance measure exceeds the first predefined threshold value or the accuracy measure exceeds a second predefined threshold value.
  - 23. The apparatus of claim 18, wherein the set of instructions is executable by the processor to match partitioning logic is configured to match the first utterance with the plurality of template vectors by performing a dynamic time warping computation.
  - 24. The apparatus of claim 18, wherein the set of instructions is executable by the processor to partition the first utterance by performing a K-means speech segmentation computation.
  - 25. The apparatus of claim 18, wherein the set of instructions is further executable by the processor to detect endpoints of the first utterance.

26. A processor-readable medium containing a set of instructions executable by a processor to:
- segment each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean;
  
  quantize the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors;
  
  compare each one of the plurality of template vectors with a second plurality of utterances using a dynamic time warping calculation to generate at least one comparison result;
  
  match the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result;
  
  partition the first plurality of utterances in time in accordance with the optimal matching path result; and
  
  repeat the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Qualcomm, Inc.
Original Assignee
Qualcomm, Inc.
Inventors
Bi, Ning
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Lerner, Martin

Application Number

US09/615,572
Time in Patent Office

1,398 Days
Field of Search

704/238, 704/239, 704/241, 704/245, 704/253, 704/236, 704/254
US Class Current

704/241
CPC Class Codes

G10L 15/063 Training

Method and apparatus for constructing voice templates for a speaker-independent voice recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

36 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for constructing voice templates for a speaker-independent voice recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others