Engine For Speech Recognition

US 20090216535A1
Filed: 02/22/2008
Published: 08/27/2009
Est. Priority Date: 02/22/2008
Status: Abandoned Application

First Claim

Patent Images

1. A computerized method for speech recognition in a computer system, the method comprising the steps of:

(a) storing a plurality of reference word segments, wherein said reference word segments when concatenated form a plurality of spoken words in a language;

wherein each of said reference word segments is a combination of at least two phonemes including at least one vowel sound in said language;

(b) inputting and digitizing a temporal speech signal, thereby producing a digitized temporal speech signal;

(c) transforming piecewise said digitized temporal speech signal into the frequency domain, thereby producing a time and frequency dependent transform function;

wherein the the energy spectral density of said temporal speech signal is proportional to the absolute value squared of said transform function;

(d) cutting the energy spectral density into a plurality of input time segments of the energy spectral density;

wherein each of said input time segments includes at least two phonemes including at least one vowel sound of the temporal speech signal; and

(e) for each of said input time segments;

(i) extracting a fundamental frequency from the energy spectral density during the input time segment;

(ii) selecting a target segment from the reference word segments thereby inputting a target energy spectral density of said target segment;

(iii) performing a correlation between the energy spectral density during said time segment and said target energy spectral density of said target segment after calibrating said fundamental frequency to said target energy spectral density thereby improving said correlation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computerized method for speech recognition in a computer system. Reference word segments are stored in memory. The reference word segments when concatenated form spoken words in a language. Each of the reference word segments is a combination of at least two phonemes, including a vowel sound in the language. A temporal speech signal is input and digitized to produced a digitized temporal speech signal The digitized temporal speech signal is transformed piecewise into the frequency domain to produce a time and frequency dependent transform function. The energy spectral density of the temporal speech signal is proportional to the absolute value squared of the transform function. The energy spectral density is cut into input time segments of the energy spectral density. Each of the input time segments includes at least two phonemes including at least one vowel sound of the temporal speech signal. For each of the input time segments, (i) a fundamental frequency is extracted from the energy spectral density during the input time segment, (ii) a target segment is selected from the reference segments and thereby a target energy spectral density of the target segment is input. A correlation between the energy spectral density during the time segment and the target energy spectral density of the target segment is performed after calibrating the fundamental frequency to the target energy spectral density thereby improving the correlation.

21 Citations

View as Search Results

22 Claims

1. A computerized method for speech recognition in a computer system, the method comprising the steps of:
- (a) storing a plurality of reference word segments, wherein said reference word segments when concatenated form a plurality of spoken words in a language;
  
  wherein each of said reference word segments is a combination of at least two phonemes including at least one vowel sound in said language;
  
  (b) inputting and digitizing a temporal speech signal, thereby producing a digitized temporal speech signal;
  
  (c) transforming piecewise said digitized temporal speech signal into the frequency domain, thereby producing a time and frequency dependent transform function;
  
  wherein the the energy spectral density of said temporal speech signal is proportional to the absolute value squared of said transform function;
  
  (d) cutting the energy spectral density into a plurality of input time segments of the energy spectral density;
  
  wherein each of said input time segments includes at least two phonemes including at least one vowel sound of the temporal speech signal; and
  
  (e) for each of said input time segments;
  
  (i) extracting a fundamental frequency from the energy spectral density during the input time segment;
  
  (ii) selecting a target segment from the reference word segments thereby inputting a target energy spectral density of said target segment;
  
  (iii) performing a correlation between the energy spectral density during said time segment and said target energy spectral density of said target segment after calibrating said fundamental frequency to said target energy spectral density thereby improving said correlation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 21)
- - 2. The computerized method, according to claim 1, wherein said time-dependent transform function is dependent on a scale of discrete frequencies, wherein said calibrating is performed by interpolating said fundamental frequency between said discrete frequencies to match the target fundamental frequency.
  - 3. The computerized method, according to claim 1, wherein said fundamental frequency and at least one harmonic frequency of said fundamental frequency form an array of frequencies, wherein said calibrating is performed using a single adjustable parameter which adjusts said array of frequencies, maintaining the relationship between the fundamental frequency and said at least one harmonic frequency, wherein said adjusting includes:
    - (A) multiplying said frequency array by the target energy spectral density of said target segment thereby forming a product; and
      
      (B) adjusting said single adjustable parameter until the product is a maximum.
  - 4. The computerized method, according to claim 1, wherein said fundamental frequency undergoes a monotonic change during the input time segment, wherein said calibrating includes compensating for said monotonic change.
  - 5. The computerized method, according to claim 1, further comprising the step of:
    - (f) classifying said reference word segments into a plurality of classes;
      
      (g) inputting a correlation result of said correlation;
      
      (h) second selecting a second target segment from at least one of said classes based on said correlation result.
  - 6. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on said at least one vowel sound.
  - 7. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on relative time duration of said reference word segments.
  - 8. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on relative energy levels of said reference word segments.
  - 9. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on energy spectral density ratio, wherein said energy spectral density is divided by into at least two frequency ranges, and said energy spectral density ratio is between the respective energies in said at least two frequency ranges.
  - 10. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on normalized peak energy of said reference word segments.
  - 11. The computerized method, according to claim 5, wherein said classifying said reference word segments into classes is based on relative phonetic distance between said reference word segments.
  - 21. A computer readable medium encoded with processing instructions for causing a processor to execute the method of claim 1.

12. A computerized method for speech recognition in a computer system, the method comprising the steps of:
- (a) storing a plurality of reference word segments, wherein said reference word segments when concatenated form a plurality of spoken words in a language;
  
  wherein each of said reference word segments is a combination of at least two phonemes including at least one vowel sound in said language;
  
  (b) classifying said reference word segments into a plurality of classes;
  
  (c) inputting and digitizing a temporal speech signal, thereby producing a digitized temporal speech signal;
  
  (d) transforming piecewise said digitized temporal speech signal into the frequency domain, thereby producing a time and frequency dependent transform function;
  
  wherein the the energy spectral density of said temporal speech signal is proportional to the absolute value squared of said transform function;
  
  (e) cutting the energy spectral density into a plurality of input time segments of the energy spectral density;
  
  wherein each of said input time segments includes at least two phonemes including at least one vowel sound of the temporal speech signal;
  
  (f) for each of said input time segments;
  
  (i) selecting a target segment from the reference word segments thereby inputting a target energy spectral density of said target segment;
  
  (ii) performing a correlation between the energy spectral density during said time segment and said target energy spectral density of said target segment;
  
  (g) based on a correlation result of said correlation, second selecting a second target segment from at least one of said classes.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 22)
- - 13. The computerized method, according to claim 12, wherein said cutting is based on at least two signals selected from the group consisting of:
    - (h) autocorrelation in time domain of temporal speech signal;
      
      (ii) average energy as calculated by integrating energy spectral density over frequency;
      
      (iii) normalized peak energy calculated by the peak energy as a function of frequency divided by the mean energy averaged over a range of frequencies.
  - 14. The computerized method, according to claim 12,(h) for each of said input time segments;
    - (i) extracting a fundamental frequency from the energy spectral density during the input time segment;
      
      (ii) performing said correlation between the energy spectral density during said time segment and said target energy spectral density of said target segment after calibrating said fundamental frequency to said target energy spectral density thereby improving said correlation.
  - 15. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on said at least one vowel sound.
  - 16. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on relative time duration of said reference word segments
  - 17. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on relative energy levels of said reference word segments.
  - 18. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on energy spectral density ratio, wherein said energy spectral density is divided into at least two frequency ranges, and said energy spectral density ratio is between the respective energies in said at least two frequency ranges.
  - 19. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on normalized peak energy of said reference word segments.
  - 20. The computerized method, according to claim 12, wherein said classifying said reference word segments into classes is based on relative phonetic distance between said reference word segments.
  - 22. A computer readable medium readable encoded with processing instructions for causing a processor to execute the method of claim 12.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LNTS Linguistech Solutions Limited
Original Assignee
LNTS Linguistech Solutions Limited
Inventors
Cohen-Tov, Rabin, Meller, Izhak, Bognim, Shlomi, Entlis, Avraham, Budovnich, Roman, Simone, Adam

Application Number

US12/035,715
Publication Number

US 20090216535A1
Time in Patent Office

Days
Field of Search
US Class Current

704/254
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 25/90 Pitch determination of spee...

Engine For Speech Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

21 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Engine For Speech Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links