Speech recognition system using normalized voiced segment spectrogram analysis

US 7,233,899 B2
Filed: 03/07/2002
Issued: 06/19/2007
Est. Priority Date: 03/12/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A data processing method for recognizing a sound record of a human utterance, comprising:

dividing the sound record into a sequence of one or more segments;

comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record; and

wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using a method selected from a group consisting of a triple time transform, a triple frequency transform, a linear-piecewise-linear transform, and combinations thereof.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer comparison of one or more dictionary entries with a sound record of a human utterance to determine whether and where each dictionary entry is contained within the sound record. The record is segmented, and for each vocalized segment a spectrogram is obtained, and for other segments symbolic and numeric data are obtained. The spectrogram of a vocalized segment is then processed using a method selected from a group consisting of a triple time transform, a triple frequency transform, a linear-piecewise-linear transform, and combinations thereof, to decrease noise and to eliminate variations in pronunciation. Each entry in the dictionary is then compared with every sequence of segments of substantially the same length in the sound record. The comparison takes into account the formant profiles within each vocalized segment and symbolic and numeric data for other segments are obtained in the record and in the dictionary entries.

27 Citations

View as Search Results

41 Claims

1. A data processing method for recognizing a sound record of a human utterance, comprising:
- dividing the sound record into a sequence of one or more segments;
  
  comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record; and
  
  wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using a method selected from a group consisting of a triple time transform, a triple frequency transform, a linear-piecewise-linear transform, and combinations thereof.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1 further comprising:
    - for a dictionary entry, determining an expected number of segments associated with the dictionary entry.
  - 3. The method of claim 2 wherein comparing comprises:
    - for the dictionary entry, testing each continuous stretch of the sound record having a segment length substantially equal to the expected number of segments.
  - 4. The method of claim 1 wherein dividing is based on phonemes.
  - 5. The method of claim 1 wherein dividing includes detecting segments comprising at least one of the following types:
    - vowel stressed, vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause or unrecognized.
  - 6. The method of claim 1 wherein the triple time transform comprises:
    - scaling the vocalized parts of the sound record by a scaling factor in the time dimension;
      
      obtaining a spectrogram of the scaled vocalized parts of the sound record using a method optimized for a reference frequency;
      
      scaling the spectrogram by the inverse of the scaling factor in the time dimension; and
      
      scaling the spectrogram by the scaling factor in the frequency dimension.
  - 7. The method of claim 6 further comprising:
    - calculating a characteristic pitch frequency of the scaled vocalized parts of the sound record; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 8. The method of claim 6 wherein the scaling the vocalized parts of the sound record comprises scaling a sound record of a voiced segment of a human utterance.
  - 9. The method of claim 8 further comprising:
    - selecting a characteristic formant within the voiced segment;
      
      calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 10. The method of claim 1 wherein the triple frequency transform comprises:
    - obtaining a scaled set of frequencies by multiplying each frequency in a reference set of frequencies by a scaling factor; and
      
      obtaining a spectrogram of a sound record using the scaled set of frequencies.
  - 11. The method of claim 10 further comprising:
    - calculating a characteristic pitch frequency of a sound for the sound record; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 12. The method of claim 10 wherein obtaining the spectrogram of the sound record comprises scaling a spectrogram of a sound record of a voiced segment of a human utterance.
  - 13. The method of claim 12 further comprising:
    - selecting a characteristic formant within the voiced segment;
      
      calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 14. The method of claim 1 where the linear-piecewise-linear transform comprises:
    - scaling an analyzed spectrogram in the time and frequency dimensions using a scaling factor;
      
      dividing the scaled spectrogram into one or more non-overlapping formant areas, each formant area essentially spanning the duration of the scaled spectrograph;
      
      for each formant area, calculating a characteristic frequency;
      
      for each formant area, choosing a reference frequency from at least one reference frequency associated with a prototype continuous voiced segment;
      
      moving each formant area along the frequency axis on the spectrogram so that the characteristic frequency of each formant area in its moved state is equal to the reference frequency chosen for the formant area;
      
      assigning to each overlap point on the spectrogram, where a plurality of formant areas overlap after the movement of the formant areas, a value equal to an average of the spectrogram values of the overlapping formant areas at the overlap point after the movement;
      
      locating on the spectrogram a gap point to which no value is assigned after the movement of the formant areas;
      
      for the gap point, choosing a non-gap point; and
      
      assigning to the gap point a value equal to the value of the spectrogram at the non-gap point.
  - 15. The method of claim 14 wherein choosing the non-gap point for the gap point comprises choosing a non-gap point on the spectrogram at the same time as and at higher frequency than the gap point, so that all the points on a straight line connecting the gap point and the chosen non-gap point on the spectrogram are gap points.
  - 16. The method of claim 14 further comprising calculating the scaling factor by comparing the duration of the spectrogram with the duration of the prototype continuous voiced segment.
  - 17. The method of claim 14 wherein each formant area includes only one formant crest.
  - 18. The method of claim 14 wherein each formant crest spans the entire duration of the spectrogram.
  - 19. The method of claim 14 wherein the border between any two adjacent formant areas is equidistant from formant crests in the adjacent formant areas.
  - 20. The method of claim 1, wherein comparing includes of comparing the spectrogram of a continuous voiced segment with a prototype continuous voiced segment by:
    - locating one or more formants on an analyzed spectrogram;
      
      calculating a characteristic frequency for each formant;
      
      assigning to each formant on the analyzed spectrogram a corresponding formant in a prototype continuous voiced segment; and
      
      for each characteristic frequency, determining whether the characteristic frequency falls within a frequency interval associated with the corresponding formant.

21. A data processing system for recognizing a sound record of a human utterance, comprising:
- a segmentation engine for dividing the sound record into a sequence of one or more segments;
  
  a comparison engine for comparing a plurality of dictionary entries with the sound record, each dictionary entry bring incrementally compared with a continuous stretch of segments of the sound record; and
  
  wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using a method selected from a group consisting of a triple time transform, a triple frequency transform, a linear-piecewise-linear transform and combinations thereof.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 22. The system of claim 21 further comprising:
    - for a dictionary entry, an algorithm for determining an expected number of segments associated with the dictionary entry.
  - 23. The system of claim 22 wherein the comparison engine comprises:
    - for the dictionary entry process, testing each continuous stretch of the sound record having a segment length substantially equal to the expected number of segments.
  - 24. The system of claim 21 wherein the segmentation engines divides based on phonemes.
  - 25. The system of claim 21 wherein the segmentation engine detects segments comprising at least one of the following types:
    - vowel stressed vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause, or unrecognized.
  - 26. The system of claim 21 wherein the triple time transform comprises:
    - a scaling factor for scaling the vocalized parts of the sound record in the time dimension;
      
      a spectrogram of the scaled vocalized parts of the sound record optimized for a reference frequency;
      
      an algorithm for scaling the spectrogram by they inverse of the scaling factor in the time dimension; and
      
      an algorithm for scaling the spectrogram by the scaling factor in the frequency dimension.
  - 27. The system of claim 26 further comprising:
    - an algorithm for calculating a characteristic pitch frequency of sound for the sound record; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 28. The system of claim 26 wherein the scaling the vocalized parts of the sound record comprises scaling a sound record of a voiced segment of a human utterance.
  - 29. The system of claim 28 further comprising:
    - a characteristic formant selected from within the voiced segment;
      
      an algorithm for calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 30. The system of claim 21 wherein the triple frequency transform comprises:
    - a scaled set of frequencies obtained by multiplying each frequency in a reference set of frequencies by a scaling factor; and
      
      a spectrogram of a vocalized segment of a sound record obtained using the scaled set of frequencies.
  - 31. The system of claim 30 further comprising:
    - an algorithm for calculating a characteristic pitch frequency of sound for the sound record; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 32. The system of claim 30 wherein the spectrogram of the vocalized segment of the sound record comprises a spectrogram of a sound record of a voiced segment of a human utterance.
  - 33. The system of claim 32 further comprising:
    - a characteristic formant within the voiced segment;
      
      an algorithm for calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 34. The system of claim 21 wherein the linear-piecewise-linear transform comprises:
    - an analyzed spectrogram sealed in the time and frequency dimensions by a scaling factor;
      
      an algorithm for dividing the scaled spectrogram into one of more non-overlapping formant areas, each formant area essentially spanning the duration of the scaled spectrograph;
      
      for each area, a calculated characteristic frequency;
      
      for each formant area, a reference frequency chosen from at least one reference frequency associated with a prototype continuous voiced segment;
      
      an algorithm for moving each formant area along the frequency axis on the spectrogram so that the characteristic frequency of each formant area in its moved state is equal to the reference frequency chosen for the formant area;
      
      an algorithm for assigning to each overlap point on the spectrogram, where a plurality of formant areas overlap after the movement of the formant areas, a value equal to an average of the spectrogram values of the overlapping formant areas at the overlap point alter the movement;
      
      an algorithm for Locating on the spectrogram a gap point to which no value is assigned after the movement of the formant areas;
      
      for the gap point a chosen non-gap point; and
      
      a value assigned to the gap point equal to the value of the spectrogram at the non-gap point.
  - 35. The system of claim 34 wherein the non-gap paint chosen for the gap point comprises a non-gap point on the spectrogram at the same time as and at higher frequency than the gap point, so that all the points on a straight line connecting the gap point and the chosen non-gap point on the spectrogram are gap points.
  - 36. The system of claim 34 further comprising calculating the scaling factor by comparing the duration of the spectrogram with the duration of the prototype continuous voiced segment.
  - 37. The system of claim 34 wherein each formant area includes only one formant crest.
  - 38. The system of claim 34 wherein each formant crest spans the entire duration of the spectrogram.
  - 39. The system of claim 34 wherein the border between any two adjacent formant areas is equidistant from formant crests in the adjacent formant areas.
  - 40. The system of claim 21 wherein the comparison includes a comparison of spectrogram of a continuous voiced segment with a prototype continuous voiced segment by:
    - locating one or more formants on an analyzed spectrogram;
      
      calculating a characteristic frequency for each formant;
      
      assigning to each formant on the analyzed spectrogram a corresponding formant in a prototype continuous voiced segment; and
      
      for each characteristic frequency, determining whether the characteristic frequency falls within a frequency interval associated with the corresponding formant.

41. A computer program product comprising:
- A computer usable medium; and
  
  A data processing method stored on the medium for recognizing a sound record of a human utterance, comprising computer instructions for;
  
  dividing the sound record into a sequence of one or more segments;
  
  comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record; and
  
  wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using a method selected from a group consisting of a triple time transform, a triple frequency transform, a linear-piecewise-linear transform, and combinations thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fain Systems, Inc.
Original Assignee
Fain Systems, Inc.
Inventors
Fain, Samuel V., Fain, Vitaliy S.
Primary Examiner(s)
{hacek over (S)}mits; Talivaldis Ivars
Assistant Examiner(s)
Pierre; Myriam

Application Number

US10/094,696
Publication Number

US 20020128834A1
Time in Patent Office

1,930 Days
Field of Search

704/251, 704/208, 704/254, 381/94.2
US Class Current

704/251
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/10   using distance or distortio...

G10L 15/22   Procedures used during a sp...

G10L 2015/025   Phonemes, fenemes or fenone...

Speech recognition system using normalized voiced segment spectrogram analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

27 Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition system using normalized voiced segment spectrogram analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links