System for recognizing speech

US 5,377,302 A
Filed: 09/01/1992
Issued: 12/27/1994
Est. Priority Date: 09/01/1992
Status: Expired due to Term

First Claim

Patent Images

1. A system for recognizing speech features in an input speech signal, said input speech signal changing over tame and containing tonotopic information, said system comprising:

first means for filtering the input speech signal provide an output having amplitudes that are functions of both tonotopy and time in a first two dimensional representation, said output indicating the tonotopic information of said input speech signal over a time period; and

second means for filtering said output to provide an output that, over time, indicates a second two dimensional representation in tonotopy and time of one or more elementary tonotopic features of the input speech signal, said features including onset, rise and fall of any significant tones of the input speech signal over time.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A pattern recognition system particularly useful for recognizing speech or handwriting. An input signal is first filtered by a filter bank having two stages where the outputs of the first stage is fed forward to the second stage of a significant number of filters and the output of the second stage is fed back to the first stage of a significant number of the filters. Such feedback enhances the signal-to-noise ratio and resembles the coupling between the different sections of the basilar membrane of the cochlear. The output of the filter bank is a two-dimensional frequency-time representation of the original signal. A second set of filters which takes as input two-dimensional signals, detects the presence of elementary tonotopic features such as the onset, rise, fall and frequency of any significant tones in a speech signal. A third set of filters detects any contrasts in the elementary features at various levels of resolution. After such filtering, a neural network is employed to learn patterns formed from the multi-resolution contrasts in the identified features so that the system recognizes symbols from an input signal that is continuous in time. In the case of speech, the system recognizes continuous speech in a speaker-independent manner, and is also tolerant of noise.

Citations

75 Claims

1. A system for recognizing speech features in an input speech signal, said input speech signal changing over tame and containing tonotopic information, said system comprising:
- first means for filtering the input speech signal provide an output having amplitudes that are functions of both tonotopy and time in a first two dimensional representation, said output indicating the tonotopic information of said input speech signal over a time period; and
  
  second means for filtering said output to provide an output that, over time, indicates a second two dimensional representation in tonotopy and time of one or more elementary tonotopic features of the input speech signal, said features including onset, rise and fall of any significant tones of the input speech signal over time.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The system of claim 1, wherein said output of the second filtering means indicates at least two of the three features, namely, the onset, rise and fall of any significant tones of the input speech signal over time.
  - 3. (Amended) The system of claim 2, wherein said output of the second filtering means also indicates frequency of any significant tones of the input speech signal.
  - 4. The system of claim 1, further comprising a neural network responsive to said second two dimensional representation of said input speech signal for identifying phonemes, groups of phonemes or suprasegmentals in said input signal.
  - 5. The system of claim 4, wherein said network includes at least one pair of phoneme- or suprasegmental-related formation and deformation layers to provide formation and deformation maps, said deformation layer performing a local-averaging function on said formation map to provide the deformation map.
  - 6. The system of claim 4, wherein said first filtering means is a filter bank having coefficients that are functions of selected dimensions of a representative biological cochlear.
  - 7. The system of claim 6, said representative biological cochlear having a basilar membrane with a breadth, wherein said coefficients are functions of the breadth of the basilar membrane.
  - 8. The system of claim 1, wherein said first filtering means includes a bank of M filters having different frequency pass bands, M being a positive integer greater than 1, each filter providing an output, said filters being indexed from 1 to M by pass band frequencies of the filters along a frequency axis and the outputs of the filters at N different times being indexed from 1 to N along a time axis, N being a positive integer greater than 1, forming an M by N array of filter outputs, said second filtering means including:
    - means for superposing at least one p by q array of filter coefficients upon a portion of said M by N array of filtered outputs, p and q being positive integers, so that each coefficient corresponds to a filter output in said M by N array; and
      
      means for multiplying each coefficient by the corresponding filter output to obtain products and summing said products to obtain a first processed value for said superposition upon said one portion.
  - 9. The system of claim 8, wherein said superposing means includes p(q-1) delay elements for delaying p outputs of the M filter outputs and pq gain or amplifying elements for amplifying the outputs.
  - 10. The system of claim 8, wherein said superposition means superposes said p by q array upon each of mn different portions of the M by N array, m and n being positive integers, where the portions form an m by n array along the frequency and time axes, and wherein, for each superposition upon a portion, said multiplying means multiplies each coefficient by the corresponding filter output to obtain products and summing said products to obtain a first processed value, thereby obtaining from the mn superpositions and multiplications an m by n array of first processed values along said two axes.
  - 11. The system of claim 10, wherein said portions are selected such that spacing between adjacent portions is the same for the m by n array of portions.
  - 12. The system of claim 10, wherein said superposing means superposes each of 2 p by q arrays of coefficients upon one or more portions of said M by N array successively so that each coefficient corresponds to a filter output in said M by N array, and wherein, for each p by q array and for each superposition, said multiplying means multiplies each coefficient of such p by q array by the corresponding filter output to obtain products and sums said products for such p by q array to obtain one processed value for such superposition, and to obtain one or more such processed values for each p by q array.
  - 13. The system of claim 12, wherein the coefficients of each p by q array are such that the processed values obtained for such array indicate at least two of the following four elementary tonotopic features:
    - the onset, frequency, rise and fall of any significant tones in the input signal.
  - 14. The system of claim 13, wherein said superposing means superposes each of 4 pairs of p by q arrays of coefficients upon one or more portions of said M by N array successively so that each coefficient corresponds to a filter output in said M by N array, and wherein the coefficients of each pair of p by q arrays are such that the processed values obtained for such pairs of arrays indicate respectively one of the following four elementary tonotopic features:
    - the onset, frequency, rise and fall of any significant tones in the input signal.
  - 15. The system of claim 8, wherein the coefficients of said p by q array are such that the processed value obtained indicates one of the following elementary tonotopic features:
    - the onset, frequency, rise or fall of any significant tones in the input signal.
  - 16. The system of claim 8, wherein said superposition means superposes said p by q array upon different portions of the M by N array so that each of the M by N filter outputs is superposed upon at least one coefficient, and wherein said multiplying means multiplies each of the M by N filter outputs by at least one coefficient.
  - 17. The system of claim 8, wherein said coefficients are symmetric or anti-symmetric functions of frequency and time.
  - 18. The system of claim 8, wherein said coefficients are symmetric or anti-symmetric functions of frequency and time modulated by a Gaussian function.
  - 19. The system of claim 18, wherein said coefficients are Gabor functions of frequency and time.
  - 20. The system of claim 1, said second filtering means including at least two filters at different resolutions for filtering the output of the first filtering means to provide speech context information for the recognition of phonemes.
  - 21. The system of claim 20, further comprising a neural network responsive to said two dimensional representation of said input speech signal for identifying phonemes, groups of phonemes or suprasegmentals in said input signal.
  - 22. The system of claim 21, wherein said network includes at least one pair of phoneme- or suprasegmental-related formation and deformation layers to provide formation and deformation maps, said deformation layer performing a local-averaging function on said formation map to provide the deformation map.

23. A filter bank for improving signal to noise ratio in processing an input signal comprising:
- a plurality of M filters arranged in parallel, M being a positive integer, each filter having a first stage and a second stage;
  
  wherein the first stage of each filter includes;
  
  (a) first delay means for delaying the input signal, and(b) means for subtracting from the input signal or a signal derived therefrom the delayed input signal or a signal derived therefrom and adding thereto feedback signals from at least some of the second stages of the M filters or signals derived therefrom to derive an output signal;
  
  wherein the second stage of each filter provides an output signal and includes;
  
  (c) first means for adding the output signals of the first stages of at least some of the filters or signals derived therefrom to obtain a first sum signal;
  
  (d) second means for delaying said first sum signal and supplying said delayed first sum signal or a signal derived therefrom to the first stages of at least some of the filters;
  
  (e) second means for adding the sum signal and the delayed sum signal or signals derived therefrom to obtain a second sum signal;
  
  (f) third means for delaying the output signal of the second stage and supplying said delayed output signal of the second stage or a signal derived therefrom to the first stages of at least some of the filters; and
  
  (g) means for adding to the second sum signal or a signal derived therefrom the delayed output signal of the second stage or a signal derived therefrom to derive the output signal of the second stage.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
- - 24. The filter bank of claim 23, wherein M is greater than or equal to 10, and wherein the subtracting and adding means (b) adds feedback signals from at least half of the second stages of the M filters or signals derived therefrom in deriving the output signal, and wherein the first adding means adds the output signals of the first stages of at least half of the filters or signals derived therefrom to obtain said first sum signal.
  - 25. The filter bank of claim 23, wherein said first, second and third delaying means introduce substantially the same delay.
  - 26. The filter bank of claim 23, further comprising means for sampling the input signal, the first and second sum signals, and the output signals of the two stages, at a substantially constant sampling rate and at sampling interval T to obtain samples of such signals, wherein at (n-1)th and at nth time instants when such signals are sampled, n being a positive integer, an (n-1)th sample and an nth sample of each of such signals are obtained.
  - 27. The filter bank of claim 26, i and j being integers in the range 1 to M, wherein the nth sample zⁱ (n) of said output signal of said first stage of the ith filter is given by:
    - space="preserve" listing-type="equation">z.sup.i (n)=a.sub.1.sup.i * z.sub.1.sup.i (n-1)+c.sup.i * Σ
      
      .sub.j d.sup.j Θ
      
      .sup.i.sub.j * z.sub.1.sup.j (n-1)+a.sub.2.sup.i * z.sub.2.sup.i (n-1)+b.sup.i * (u(n)-u(n-1))
      where the (n-1) th sample z₁ⁱ (n-1) and the nth sample z₁ⁱ (n) of the first sum signal for the ith filter are respectively given by;
      space="preserve" listing-type="equation">z.sub.1.sup.i (n-1)=Σ
      
      .sub.j e.sup.i.sub.j * Z.sup.j (n-1);
      space="preserve" listing-type="equation">z.sub.1.sup.i (n)=Σ
      
      .sub.j e.sup.i.sub.j * Z.sup.j (n);
      where the nth sample z₂ⁱ (n) of the output signal of the second stage of the ith filter is given by;
      space="preserve" listing-type="equation">z.sub.2.sup.i (n)=z.sub.2.sup.i (n-1)+(T/2) * (z.sub.1.sup.i (n)+z.sub.1.sup.i (n-1));
      where z₂ⁱ (n-1) is the (n-1)th sample of the output signal of the second stage of the ith filter;
      
      where the (n-1) th sample and nth sample of the input signal are respectively u(n-1) and u(n);
      
      where ##EQU7##
      space="preserve" listing-type="equation">Σ
      
      .sub.k [δ
      
      .sup.i.sub.k +c.sup.i Θ
      
      .sup.i.sub.k d.sup.k ]e.sup.k.sub.j =δ
      
      .sup.i.sub.j ; and
      where a₁ⁱ, a₂ⁱ, bⁱ, cⁱ, dⁱ are constants.
  - 28. The filter bank of claim 27, said bank suitable for use in recognizing speech features in speech input signals, wherein when aⁱ₁ is in the range between 0 and about 1, the coefficients a₂ⁱ, bⁱ, cⁱ, dⁱ are respectively in the ranges between about -1,000,000,000 and 0, about -100,000 and 0, 0 and about 10, and 0 and about 10,000, for all values of i between 1 and M.
  - 29. The filter bank of claim 27, said bank suitable for use in recognizing speech features in speech input signals, wherein a₁ⁱ, a₂ⁱ, bⁱ, cⁱ, dⁱ are related to characteristics of a representative human cochlear according to the following relations, where the height and width of the scalae vestibuli and tympani of the cochlear are H, where the density of a fluid in the scalae is ρ
    - , where the basilar membrane of length L of the cochlear comprises M sections, and where the ith section has mass mⁱ, breadth Bⁱ, damping coefficient qⁱ, stiffness coefficient Kⁱ ;
      space="preserve" listing-type="equation">μ
      
      .sup.i =m.sup.i +(1/3)ρ
      
      HB.sup.i +2ρ
      
      (D/H).sup.2 (B.sup.i).sup.2 * i+q.sup.i (T/2)+K.sup.i (T/2).sup.2 ;
      space="preserve" listing-type="equation">a.sup.i.sub.1 =(m.sup.i +(1/3)ρ
      
      HB.sup.i +2ρ
      
      (D/H).sup.2 (B.sup.i).sup.2 *i-q.sup.i (T/2)-K.sup.i (T/2).sup.2)/μ
      
      .sup.i ;
      space="preserve" listing-type="equation">a.sup.i.sub.2 =-2K.sup.i (T/2)/μ
      
      .sup.i ;
      space="preserve" listing-type="equation">b.sub.i =-2ρ
      
      DB.sup.i * i/μ
      
      .sup.i ;
      space="preserve" listing-type="equation">c.sup.i =2ρ
      
      D.sup.2 (B.sup.i /H)/μ
      
      .sup.i ;
      space="preserve" listing-type="equation">d.sup.i =B.sup.i /H;
      and
      space="preserve" listing-type="equation">D=L/M.
  - 30. The filter bank of claim 29, wherein the quantity Bⁱ varies linearly with the distance of the ith section from one end of the basilar membrane.
  - 31. The filter bank of claim 23, wherein said M filters have substantially non-overlapping frequency pass bands.

32. A system for recognizing speech features in an input speech signal that has time and frequency dependent amplitudes, said system comprising means for filtering said input speech signal or a signal derived therefrom in a two dimensional representation in tonotopy and time to provide an output indicating contrast information in the representation, said contrast information in turn indicating the presence of any significant speech features in the input speech signal.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
- - 33. The system of claim 32, said filtering means including at least two contrast filters at different resolutions.
  - 34. The system of claim 33, wherein the input speech signal contains significant local variations in the two dimensional representation of the input speech signal, said two dimensional representation having a tonotopic direction and a time direction, said filtering means further comprising an orientation filter providing an output indicating orientations of said local variations with respect to the two directions.
  - 35. The system of claim 34, wherein said orientation filter has a resolution different from those of the contrast filters.
  - 36. The system of claim 34, wherein said local variations define boundaries with said orientations, said contrast filters providing outputs indicating the presence of significant speech features in the input speech signal by indicating said boundaries.
  - 37. The system of claim 34, said two dimensional representation representing tonotopic characteristics of said input signal in tonotopy and time, said system further comprising:
    - a neural network including;
      
      at least one phoneme- or suprasegmental-related formation layer for processing the contrast information and the output of the orientation filter to provide a plurality of phoneme- or suprasegmental-related feature formation maps.
  - 38. The system of claim 32, wherein the input speech signal contains local variations in the two dimensional representation of the input speech signal, said variations having orientations and defining boundaries in said representation, said two dimensional representation having a tonotopic direction and a time direction, said filtering means comprising:
    - one or more orientation filters each providing an output indicating orientations of said local variations with respect to the two directions; and
      
      one or more contrast filters, each filtering the output of one of the orientation filters or a signal derived therefrom to provide an output indicating the locations of the boundaries.
  - 39. The system of claim 38, said system comprising two or more orientation filters providing outputs at different resolutions and two or more contrast filters, wherein each contrast filter filters the output of one of the orientation filters to provide an output indicating the locations of the boundaries.
  - 40. The system of claim 38, said system further comprising one or more smoothing filters connected in series in at least one sequence, each sequence of smoothing filters filtering the output of one of said one or more orientation filters or a signal derived therefrom, defining a corresponding orientation filter, to each provide an output, so that the output of each smoothing filter in a sequence contains average value information indicating orientations of said local variations at a coarser resolution than those indicated by the output of the corresponding orientation filter and by the output of a different smoothing filter upstream in the same sequence, and wherein each contrast filter filters either the output of an orientation filter or the output of a smoothing filter, or a signal derived therefrom to provide an output, said contrast filters each providing an output that is coarser in resolution than the output of the orientation filter or smoothing filter filtered by said each contrast filter, said output of each contrast filter containing contrast value information indicating the locations of the boundaries at different resolutions.
  - 41. The system of claim 40, wherein the output of each smoothing filter in the sequence indicates orientations of said local variations at a resolution coarser than those indicated by the output of the immediately preceding orientation filter or smoothing filter by a factor of 2, and said contrast filters each provides an output that is coarser in resolution than the output of the orientation filter or smoothing filter filtered by said each contrast filter by a factor of 2.
  - 42. The system of claim 40, wherein said sequence includes at least three smoothing filters, said system including at least three contrast filters, one of said contrast filters filtering the output of the orientation filters and each of the remaining contrast filters filtering the output of a corresponding smoothing filter, except for the last smoothing filter in the sequence.
  - 43. The system of claim 40, said system comprising at least a first and a second orientation filter for detecting local variations with a first and a second orientation respectively, wherein the output of the contrast filter filtering the output of the first orientation filter indicates the presence of a boundary with said first orientation, defining a first type of boundary, different from a second type of boundary, the presence of which is indicated by the output of the contrast filter filtering the output of the second orientation filter.
  - 44. The system of claim 43, wherein the orientation of each of the two types of orientation filters is vertical, horizontal, or inclined with respect to the vertical and horizontal directions of the two dimensional representation.
  - 45. The system of claim 43, wherein said smoothing filters are connected in series in at least two sequences, each sequence filtering the output of its corresponding orientation filter to provide outputs indicating orientation of the first or second type of boundaries at coarser resolutions, said contrast filters filtering the outputs of said orientation or smoothing filters in each sequence providing outputs indicating the locations of said first or second type of boundaries.
  - 46. The system of claim 40, said two variables being frequency and time, said two dimensional representation representing tonotopic characteristics of said input signal in tonotopy and time, said system further comprising:
    - a neural network including;
      
      at least one phoneme- or suprasegmental-related formation layer for processing the contrast value information provided by at least some of the contrast filters and the average value information provided by at least one smoothing filter in the sequence to provide a plurality of phoneme- or suprasegmental-related feature formation maps.
  - 47. The system of claim 46, wherein said neural network processes the average value information provided by the last smoothing filter in the sequence to provide said plurality of phoneme- or suprasegmental-related feature formation maps.
  - 48. The system of claim 47, wherein said layer provides said phoneme- and suprasegmental-related feature formation maps by processing contrast value information derived by the contrast filters of all the resolutions except the contrast filter of the highest resolution and the average value information derived by the last smoothing filter of the sequence, and provides consonant-related feature formation maps by processing contrast value information provided by all of the contrast filters and the average value information provided by the last smoothing filter in the sequence.
  - 49. The system of claim 48, said neural network further comprising at least a first phoneme- or suprasegmental-related deformation layer which performs a local-averaging operation on each phoneme- or suprasegmental-related feature formation map to provide a phoneme- or suprasegmental-related deformation map, and on each consonant-related formation feature map to provide a consonant-related deformation map, said formation and deformation layers defining a first pair of formation and deformation layers.
  - 50. The system of claim 49, said neural network further comprising additional pairs of formation and deformation layers, said pairs connected in a sequence of pairs with the first pair as the first in the sequence, wherein at least one additional pair including:
    - an additional phoneme- or suprasegmental-related formation layer for processing said phoneme- or suprasegmental-related deformation map provided by the deformation layer of a preceding pair to provide an additional set of phoneme- or suprasegmental-related feature formation maps and for processing said consonant-related and phoneme- or suprasegmental-related deformation maps provided by the deformation layer of a preceding pair to provide an additional group of consonant-related feature formation maps.
  - 51. The system of claim 50, said at least one additional pair further comprising:
    - an additional phoneme- or suprasegmental-related deformation layer for processing said additional set of phoneme-related or suprasegmental-related feature formation maps to provide an additional phoneme- or suprasegmental-related deformation map.
  - 52. The system of claim 51, wherein the last pair of formation and deformation layers in the sequence of pairs of formation and deformation layers provide a vowel-related deformation map, a suprasegmental-related deformation map, and a consonant-related deformation map that together provide information concerning a phoneme, a group of phonemes, or a suprasegmental.
  - 53. The system of claim 32, said filtering means providing an output indicating the presence of one of three types of boundaries:
    - vertical, horizontal or inclined boundaries.
  - 54. The system of claim 32, wherein the filtering means output indicates at least one of the following:
    - onset, frequency, rise or fall of significant tones in the input signal.

55. A system for recognizing speech features an input speech signal, said input speech signal containing tonotopic information, said system comprising:
- means for filtering the input speech signal to provide a filtered output, said output indicating the tonotopic information of said input speech signal over a time period and identifying any significant speech features therein; and
  
  a neural network comprising;
  
  at least one pair of phoneme- or suprasegmental-related formation and deformation layers for processing the output of the filtering means to provide formation and deformation maps, said formation layer processing the output of the filtering means or of the deformation maps to identify phoneme- or suprasegmental-related features in said input speech signal, said deformation layer performing a local-averaging function on said formation map to provide the deformation map to enable the recognition of phonemes or suprasegmentals in said input speech signal irrespective of variability of speech of different speakers.
- View Dependent Claims (56, 57, 58, 59, 60, 61, 62)
- - 56. The system of claim 55, said phoneme- or suprasegmental-related formation layer processing the output of the filtering means to provide a plurality of phoneme- or suprasegmental-related feature formation maps;
    - and said phoneme- or suprasegmental-related deformation layer performing a local averaging operation on said plurality of phoneme- or suprasegmental-related feature formation maps to provide invariant phoneme- or suprasegmental-related feature maps.
  - 57. The system of claim 56, wherein said filtering means includes at least one bank of R filters, R being a positive integer greater than 1, each filter providing an output, said filters being indexed from 1 to R along a tone axis and the outputs of the filters at S different times being indexed from 1 to S along a time axis, forming an R by S array of filter outputs;
    - wherein said at least one formation layer includes a plurality of tracts, each tract including;
      
      means for superposing at least one t by u array of weights upon a portion of said R by S array, t and u being positive integers, so that each weight corresponds to a filter output in said R by S array; and
      
      means for multiplying each weight by the corresponding filter output to obtain products and summing said products to obtain a first processed value for said superposition upon said one portion.
  - 58. The system of claim 57, wherein said superposition means superposes said t by u array upon each of rs different portions of the R by S array, r and s being positive integers smaller than R and S respectively, where the portions form an r by s array along the frequency and time axes, and wherein, for each superposition upon a portion, said multiplying means multiplies each weight by the corresponding filter output to obtain products and summing said products to obtain a first processed value, thereby obtaining from the rs superpositions an r by s array of first processed values along said two axes.
  - 59. The system of claim 58, wherein said at least one deformation layer includes a plurality of tracts each comprising:
    - second means for superposing at least one v by w second array of coefficients upon a portion of said r by s array of first processed values, v and w being positive integers smaller than r and s respectively, so that each weight in the second array corresponds to a processed value in said r by s array of first processed values; and
      
      second means for multiplying each coefficient in the second array by the corresponding first processed value to obtain products and summing said products to obtain a second processed value for said superposition upon said one portion, said coefficients of the second array being such that said second processed value is an average value of the corresponding first processed values.
  - 60. The system of claim 59, wherein said v by w second array of coefficients is a matrix with most of its coefficients substantially equal to each other.
  - 61. The system of claim 57, wherein said superposing means includes t(u-1) delay elements for delaying t outputs of the R filter outputs and tu amplifying elements for amplifying the tu outputs of the R filter outputs.
  - 62. The system of claim 55, said output of the filtering means indicating one or more elementary tonotopic features of the input speech signal in a two dimensional representation of the signal in tonotopy and time, said features including onset, rise and fall of any significant tones in the input speech signal over time.

63. A method for recognizing speech features in an input speech signal, said input speech signal changing over time and containing tonotopic information, said method comprising:
- (a) filtering the input speech signal to provide an output having amplitudes that are functions of both tonotopy and time in a first two dimensional representation, said output indicating the tonotopic information of said input speech signal over a time period; and
  
  (b) filtering said output to provide an output that, over time, indicates a second two dimensional representation in tonotopy and time of one or more elementary tonotopic features of the input speech signal, said features including onset, rise and fall of any significant tones of the input signal over time.
- View Dependent Claims (64, 65, 66, 67, 68)
- - 64. The method of claim 63, wherein said filtering step in step (b) provides an output that indicates at least two of the three features, namely, the onset, rise and fall of any significant tones of the input signal over time.
  - 65. The method of claim 63, wherein said filtering step in step (b) provides an output that indicates also frequency of any significant tones of the input signal.
  - 66. The method of claim 63, wherein said filtering step in step (a) employs a bank of M filters having different frequency pass bands, M being a positive integer greater than 1, each filter providing an output, said filters being indexed from 1 to M by pass band frequencies of the filters along a frequency axis and the outputs of the filters at N different times being indexed from 1 to N along a time axis, N being a positive integer greater than 1, forming an M by N array of filter outputs, said filtering step in step (b) including:
    - superposing at least one p by q array of filter coefficients upon a portion of said M by N array of filtered outputs, p and q being positive integers, so that each coefficient corresponds to a filter output in said M by N array; and
      
      multiplying each coefficient by the corresponding filter output to obtain products and summing said products to obtain a first processed value for said superposition upon said one portion.
  - 67. The method of claim 66, wherein the coefficients of said p by q array in the superposing step indicates one of the following elementary tonotopic features:
    - the onset, frequency, rise or fall of any significant tones in the input signal.
  - 68. The method of claim 63, said filtering step in step (b) including filtering the output of the filtering step in step (a) at different resolutions to provide speech context information for the recognition of phonemes.

69. A method for recognizing speech features in an input speech signal that has time and frequency dependent amplitudes, said method comprising filtering said signal or a signal derived therefrom in a two dimensional representation in tonotopy and time to provide an output indicating contrast information in the representation, said contrast information in turn indicating the presence of any significant speech features in the input speech signal.
- View Dependent Claims (70, 71, 72, 73)
- - 70. The method of claim 69, said filtering including filtering at different resolutions to derive contrast information.
  - 71. The method of claim 70, wherein the input speech signal contains significant local variations in the two dimensional representation of the input speech signal, said two dimensional representation having a vertical tonotopic direction and a horizontal time direction, said filtering further comprising providing an output indicating orientations of said local variations with respect to the two directions.
  - 72. The method of claim 71, wherein said providing step provides an output at a resolution different from those of the contrast information derived by the filtering.
  - 73. The method of claim 71, wherein said local variations define boundaries with said orientations, said filtering at different resolutions generating outputs that indicate the presence of significant speech features in the input speech signal by indicating said boundaries.

74. A method for recognizing speech features in an input speech signal, said input speech signal containing tonotopic information, said system comprising:
- filtering the input speech signal to provide a filtered output, said output indicating the tonotopic information of said input speech signal over a time period and identifying any significant speech features therein; and
  
  processing the output of the filtering step to provide a formation map to enable the identification of phoneme- or suprasegmental-related features in said input speech signal, and performing a local-averaging function on said formation map to provide a deformation map to enable the recognition of phonemes or suprasegmentals in said input speech signal irrespective of variability of speech of different speakers.
- View Dependent Claims (75)
- - 75. The method of claim 74, said processing and performing step processing the output of the filtering step to provide a plurality of phoneme- or suprasegmental-related feature formation maps and performing a local averaging operation on said plurality of phoneme- or suprasegmental-related feature formation maps to provide invariant phoneme- or suprasegmental-related feature maps.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Monowave Partners LP
Original Assignee
Monowave Corporation LP
Inventors
Tsiang, Elaine Y. L.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Doerrler, Michelle

Application Number

US07/938,862
Time in Patent Office

847 Days
Field of Search

381/41-49, 395/2, 395/2.4, 395/2.44, 395/2.6, 395/2.11, 395/2.41, 395/2.64, 382/22, 382/30
US Class Current

704/235
CPC Class Codes

G10L 15/16 using artificial neural net...

System for recognizing speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

75 Claims

Specification

Solutions

Use Cases

Quick Links

System for recognizing speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

75 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links