Speech recognition system using spectrogram analysis

US 20020128834A1
Filed: 03/07/2002
Published: 09/12/2002
Est. Priority Date: 03/12/2001
Status: Active Grant

First Claim

Patent Images

1. A data processing method for recognizing a sound record of a human utterance, comprising:

dividing the sound record into a sequence of one or more segments; and

comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer comparison of one or more dictionary entries with a sound record of a human utterance to determine whether and where each dictionary entry is contained within the sound record. The record is segmented, and for each vocalized segment a spectrogram is obtained, and for other segments symbolic and numeric data are obtained. The spectrogram of a vocalized segment is then processed to decrease noise and to eliminate variations in pronunciation. Each entry in the dictionary is then compared with every sequence of segments of substantially the same length in the sound record. The comparison takes into account the formant profiles within each vocalized segment and symbolic and numeric data for other segments are obtained in the record and in the dictionary entries.

Citations

50 Claims

1. A data processing method for recognizing a sound record of a human utterance, comprising:
- dividing the sound record into a sequence of one or more segments; and
  
  comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 further comprising:
    - for a dictionary entry, determining an expected number of segments associated with the dictionary entry.
  - 3. The method of claim 2 wherein comparing comprises:
    - for the dictionary entry, testing each continuous stretch of the sound record having a segment length substantially equal to the expected number of segments.
  - 4. The method of claim 1 wherein dividing is based on phonemes.
  - 5. The method of claim 1 wherein dividing includes detecting segments comprising at least one of the following types:
    - vowel stressed, vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause, or unrecognized.

6. A data processing method for recognizing a sound record of a human utterance, comprising:
- dividing the sound record into a sequence of at least one segment;
  
  processing a plurality of stored dictionary entries against the sound record, comprising for each of a plurality of dictionary entries;
  
  determining an expected number of segments associated with the dictionary entry; and
  
  comparing the dictionary entry against the sound record by incrementally testing each continuous stretch of the sound record having a length substantially equal to the expected number of segments.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29)
- - 7. The method of claim 6 wherein dividing is based on phonemes.
  - 8. The method of claim 6 wherein dividing includes detecting segments comprising at least one of the following types:
    - vowel stressed, vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause, or unrecognized.
  - 9. The method of claim 6 wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using at least one of:
    - a triple time transform and a linear-piecewise-linear transform, or a triple frequency transform and a linear-piecewise-linear transform.
  - 10. The method of claim 9 wherein the triple time transform comprises:
    - scaling the vocalized parts of the sound record by a scaling factor in the time dimension;
      
      obtaining a spectrogram of the scaled vocalized parts of the sound record using a method optimized for a reference frequency;
      
      scaling the spectrogram by the inverse of the scaling factor in the time dimension; and
      
      scaling the spectrogram by the scaling factor in the frequency dimension.
  - 11. The method of claim 10 further comprising:
    - calculating a characteristic pitch frequency of the scaled vocalized parts of the sound record; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 12. The method of claim 10 wherein the scaling the vocalized parts of the sound record comprises scaling a sound record of a voiced segment of a human utterance.
  - 13. The method of claim 12 further comprising:
    - selecting a characteristic formant within the voiced segment;
      
      calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 14. The method of claim 9 wherein the triple frequency transform comprises:
    - obtaining a scaled set of frequencies by multiplying each frequency in a reference set of frequencies by a scaling factor; and
      
      obtaining a spectrogram of a sound record using the scaled set of frequencies.
  - 15. The method of claim 14 further comprising:
    - calculating a characteristic pitch frequency of sound for the sound record; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 16. The method of claim 14 wherein obtaining the spectrogram of the sound record comprises scaling a spectrogram of a sound record of a voiced segment of a human utterance.
  - 17. The method of claim 16 further comprising:
    - selecting a characteristic formant within the voiced segment;
      
      calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 18. The method of claim 9 wherein the linear-piecewise-linear transform comprises:
    - scaling an analyzed spectrogram in the time and frequency dimensions using a scaling factor;
      
      dividing the scaled spectrogram into one or more non-overlapping formant areas, each formant area essentially spanning the duration of the scaled spectrograph;
      
      for each formant area, calculating a characteristic frequency;
      
      for each formant area, choosing a reference frequency from at least one reference frequency associated with a prototype continuous voiced segment;
      
      moving each formant area along the frequency axis on the spectrogram so that the characteristic frequency of each formant area in its moved state is equal to the reference frequency chosen for the formant area;
      
      assigning to each overlap point on the spectrogram, where a plurality of formant areas overlap after the movement of the formant areas, a value equal to an average of the spectrogram values of the overlapping formant areas at the overlap point after the movement;
      
      locating on the spectrogram a gap point to which no value is assigned after the movement of the formant areas;
      
      for the gap point, choosing a non-gap point; and
      
      assigning to the gap point a value equal to the value of the spectrogram at the non-gap point.
  - 19. The method of claim 18 wherein choosing the non-gap point for the gap point comprises choosing a non-gap point on the spectrogram at the same time as and at higher frequency than the gap point, so that all the points on a straight line connecting the gap point and the chosen non-gap point on the spectrogram are gap points.
  - 20. The method of claim 18 further comprising calculating the scaling factor by comparing the duration of the spectrogram with the duration of the prototype continuous voiced segment.
  - 21. The method of claim 18 wherein each formant area includes only one formant crest.
  - 22. The method of claim 18 wherein each formant crest spans the entire duration of the spectrogram.
  - 23. The method of claim 18 wherein the border between any two adjacent formant areas is equidistant from formant crests in the adjacent formant areas.
  - 24. The method of claim 9, wherein comparing includes of comparing the spectrogram of a continuous voiced segment with a prototype continuous voiced segment by:
    - locating one or more formants on an analyzed spectrogram;
      
      calculating a characteristic frequency for each formant;
      
      assigning to each formant on the analyzed spectrogram a corresponding formant in a prototype continuous voiced segment; and
      
      for each characteristic frequency, determining whether the characteristic frequency falls within a frequency interval associated with the corresponding formant.
  - 26. The system of claim 25 further comprising:
    - for a dictionary entry, an algorithm for determining an expected number of segments associated with the dictionary entry.
  - 27. The system of claim 26 wherein the comparison engine comprises:
    - for the dictionary entry process, testing each continuous stretch of the sound record having a segment length substantially equal to the expected number of segments.
  - 28. The system of claim 25 wherein the segmentation engines divides based on phonemes.
  - 29. The system of claim 25 wherein the segmentation engine detects segments comprising at least one of the following types:
    - vowel stressed, vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause, or unrecognized.

25. A data processing system for recognizing a sound record of a human utterance, comprising:
- a segmentation engine for dividing the sound record into a sequence of one or more segments; and
  
  a comparison engine for comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record.

30. A data processing system for recognizing a sound record of a human utterance, comprising:
- a segmentation engine for dividing the sound record into a sequence of at least one segment;
  
  an algorithm for processing a plurality of stored dictionary entries against the sound record, comprising for each of a plurality of dictionary entries;
  
  determining an expected number of segments associated with the dictionary entry; and
  
  comparing the dictionary entry against the sound record by incrementally testing each continuous stretch of the sound record having a length substantially equal to the expected number of segments.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 31. The system of claim 30 wherein the segmentation engine divides based on phonemes.
  - 32. The system of claim 30 wherein the segmentation engine detects segments comprising at least one of the following types:
    - vowel stressed, vowel unstressed, adjacent voiced consonant, voiced fricative, voiceless fricative, voiced plosive, voiceless plosive, pause, or unrecognized.
  - 33. The system of claim 30 wherein vocalized parts of the sound record are represented as a spectrogram, optimized for comparison with the dictionary entries using at least one of:
    - a triple time transform and a linear-piecewise-linear transform, or a triple frequency transform and a linear-piecewise-linear transform.
  - 34. The system of claim 33 wherein the triple time transform comprises:
    - a scaling factor for scaling the vocalized parts of the sound record in the time dimension;
      
      a spectrogram of the scaled vocalized parts of the sound record optimized for a reference frequency;
      
      an algorithm for scaling the spectrogram by the inverse of the scaling factor in the time dimension; and
      
      an algorithm for scaling the spectrogram by the scaling factor in the frequency dimension.
  - 35. The system of claim 34 further comprising:
    - an algorithm for calculating a characteristic pitch frequency of sound for the sound record; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 36. The system of claim 34 wherein the scaling the vocalized parts of the sound record comprises scaling a sound record of a voiced segment of a human utterance.
  - 37. The system of claim 36 further comprising:
    - a characteristic formant selected from within the voiced segment;
      
      an algorithm for calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 38. The system of claim 33 wherein the triple frequency transform comprises:
    - a scaled set of frequencies obtained by multiplying each frequency in a reference set of frequencies by a scaling factor; and
      
      a spectrogram of a vocalized segment of a sound record obtained using the scaled set of frequencies.
  - 39. The system of claim 38 further comprising:
    - an algorithm for calculating a characteristic pitch frequency of sound for the sound record; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 40. The system of claim 38 wherein the spectrogram of the vocalized segment of the sound record comprises a spectrogram of a sound record of a voiced segment of a human utterance.
  - 41. The system of claim 40 further comprising:
    - a characteristic formant within the voiced segment;
      
      an algorithm for calculating a characteristic frequency of the characteristic formant of the voiced segment; and
      
      an algorithm for calculating the scaling factor by comparing the characteristic pitch frequency with the reference pitch frequency.
  - 42. The system of claim 33 wherein the linear-piecewise-linear transform comprises:
    - an analyzed spectrogram sealed in the time and frequency dimensions by a scaling factor;
      
      an algorithm for dividing the scaled spectrogram into one or more non-overlapping formant areas, each formant area essentially spanning the duration of the scaled spectrograph;
      
      for each formant area, a calculated characteristic frequency;
      
      for each formant area, a reference frequency chosen from at least one reference frequency associated with a prototype continuous voiced segment;
      
      an algorithm-for moving each formant area along the frequency axis on the spectrogram so that the characteristic frequency of each formant area in its moved state is equal to the reference frequency chosen for the formant area;
      
      an algorithm for assigning to each overlap point on the spectrogram, where a plurality of formant areas overlap after the movement of the formant areas, a value equal to an average of the spectrogram values of the overlapping formant areas at the overlap point after the movement;
      
      an algorithm for locating on the spectrogram a gap point to which no value is assigned after the movement of the formant areas;
      
      for the gap point, a chosen non-gap point; and
      
      a value assigned to the gap point equal to the value of the spectrogram at the non-gap point.
  - 43. The system of claim 42 wherein the non-gap point chosen for the gap point comprises a non-gap point on the spectrogram at the same time as and at higher frequency than the gap point, so that all the points on a straight line connecting the gap point and the chosen non-gap point on the spectrogram are gap points.
  - 44. The system of claim 42 further comprising calculating the scaling factor by comparing the duration of the spectrogram with the duration of the prototype continuous voiced segment.
  - 45. The system of claim 42 wherein each formant area includes only one formant crest.
  - 46. The system of claim 42 wherein each formant crest spans the entire duration of the spectrogram.
  - 47. The system of claim 42 wherein the border between any two adjacent formant areas is equidistant from formant crests in the adjacent formant areas.
  - 48. The system of claim 33, wherein the comparison includes a comparison of spectrogram of a continuous voiced segment with a prototype continuous voiced segment by:
    - locating one or more formants on an analyzed spectrogram;
      
      calculating a characteristic frequency for each formant;
      
      assigning to each formant on the analyzed spectrogram a corresponding formant in a prototype continuous voiced segment; and
      
      for each characteristic frequency, determining whether the characteristic frequency falls within a frequency interval associated with the corresponding formant.

49. A computer program product comprising:
- a computer-usable medium; and
  
  a data processing method stored on the medium for recognizing a sound record of a human utterance, comprising computer instructions for;
  
  dividing the sound record into a sequence of one or more segments; and
  
  comparing a plurality of dictionary entries with the sound record, each dictionary entry being incrementally compared with a continuous stretch of segments of the sound record.

50. A computer program product, comprising:
- a computer-usable medium; and
  
  a data processing method stored on the medium for recognizing a sound record of a human utterance, comprising computer instructions for;
  
  dividing the sound record into a sequence of at least one segment;
  
  processing a plurality of stored dictionary entries against the sound record, comprising for each of a plurality of dictionary entries;
  
  determining an expected number of segments associated with the dictionary entry; and
  
  comparing the dictionary entry against the sound record by incrementally testing each continuous stretch of the sound record having a length substantially equal to the expected number of segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fain Systems, Inc.
Original Assignee
Fain Systems, Inc.
Inventors
Fain, Samuel V., Fain, Vitaliy S.

Granted Patent

US 7,233,899 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/246
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/10   using distance or distortio...

G10L 15/22   Procedures used during a sp...

G10L 2015/025   Phonemes, fenemes or fenone...

Speech recognition system using spectrogram analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition system using spectrogram analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links