Method and apparatus for determining articulatory parameters from speech data

US 4,980,917 A
Filed: 12/27/1988
Issued: 12/25/1990
Est. Priority Date: 11/18/1987
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining the values of a series of N articulatory parameters from speech data, comprising the steps of:

creating a plurality of speech phoneme classes, each of said speech phoneme classes including a plurality of speech phonemes sharing similar spectral and articulatory characteristics;

providing a digital speech data signal representative of speech;

selecting data segments of said speech data signal at predetermined sampling intervals according to predefined changes in energy levels in said speech data signal;

transforming said selected data segments into spectral data segments;

converting each of said spectral data segments into said speech phoneme classes so as to generate a weight for the probability that said segment corresponds to phonemes within each of said classes;

converting each of said spectral data segments into a plurality of articulatory parameters for each of said speech phoneme classes so as to generate a series of N parameter values representative of articulatory characteristics in each speech phoneme class; and

combining the weight for the probability that spectral data segments correspond to a given speech phoneme class with the output parameter values from each speech phoneme class so as to form a single series of N parameter values for selected data segments.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for determining from continuous speech, the instantaneous values of a set of articulatory parameters. The continuous speech data is a sequence of spectral profiles obtained by spectrally sampling continuous speech. The spectral samples are presented in sequence to a plurality of class transforms, each establishing a respective speech phoneme class which includes plurality of speech phoneme having similar spectral and articulatory characteristics. Each class transform converts a speech segment included in its class and contained in a spectral sample into a predetermined set of articulatory parameter values. A class-discriminating transform operates in parallel with the class transforms to produce a set of probability values, each indicating the probability that the spectral sample being transformed represents a phoneme in a respective speech phoneme class. An array of multipliers adjusts the predetermined values of the sets produced by the class transforms by multiplying the values of each set by the probability value produced for that set by the class-discriminating transform. The adjusted articulatory parameter value sets are combined by adding corresponding elements to produce a set of adjusted articulatory parameter values indicative of an articulatory tract configuration appropriate for producing the sampled speech.

Citations

38 Claims

1. A method of determining the values of a series of N articulatory parameters from speech data, comprising the steps of:
- creating a plurality of speech phoneme classes, each of said speech phoneme classes including a plurality of speech phonemes sharing similar spectral and articulatory characteristics;
  
  providing a digital speech data signal representative of speech;
  
  selecting data segments of said speech data signal at predetermined sampling intervals according to predefined changes in energy levels in said speech data signal;
  
  transforming said selected data segments into spectral data segments;
  
  converting each of said spectral data segments into said speech phoneme classes so as to generate a weight for the probability that said segment corresponds to phonemes within each of said classes;
  
  converting each of said spectral data segments into a plurality of articulatory parameters for each of said speech phoneme classes so as to generate a series of N parameter values representative of articulatory characteristics in each speech phoneme class; and
  
  combining the weight for the probability that spectral data segments correspond to a given speech phoneme class with the output parameter values from each speech phoneme class so as to form a single series of N parameter values for selected data segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising the steps of:
    - digitizing speech data at a predetermined sampling rate to form digital speech data;
      
      monitoring the energy level of said digital speech data;
      
      selecting segments of said digital speech data for processing at predetermined intervals according to the energy level of said digital signal, said segments comprising a plurality of digital data samples;
      
      boosting the high frequency level of said selected data segments;
      
      applying a window function to said selected data segments;
      
      applying a Fast Fourier Transform to samples in said selected data segments so as to form spectral data segments;
      
      generating the log of the magnitude of said spectral data segments to produce log segments having a logarithmic amplitude scale;
      
      applying a threshold condition to said log segments;
      
      converting each of said log segments into a plurality of speech phoneme vectors so as to generate a weight for the probability that said log segments correspond to spectra within each of said speech phoneme classes;
      
      converting each of said log segments into a plurality of articulatory parameter values for each of said plurality of speech phoneme classes so as to generate a series of N parameter values representative of parameters in each speech phoneme class to which spectra represented by said log segments correspond; and
      
      combining the weight for the probability that log segments correspond to a given speech formant class with the output parameter values from each speech formant class so as to form a single series of N parameter values.
  - 3. The method of claim 2 wherein said step of digitizing speech data further comprises the steps of:
    - receiving audio speech and converting it to analog speech signals;
      
      applying a high-frequency boosting filter to said analog speech signals; and
      
      applying a low frequency filter to said analog speech signals to remove frequencies below about 50 Hz.
  - 4. The method of claim 2 wherein said steps of monitoring the energy level and selecting segments of said digital signal for analysis further comprises the steps of:
    - rectifying said digital speech signal to form an absolute value rectified speech signal;
      
      smoothing said rectified speech signal;
      
      generating a log signal representing the log of the magnitude of said rectified speech signal;
      
      applying said log signal to a delay element to generate an output delayed by a predetermined period;
      
      subtracting the delay element output from the log signal to form a difference signal; and
      
      selecting segments from said digital speech signals when said difference signal increases for a predetermined period.
  - 5. The method of claim 4 wherein said step of selecting segments further comprises the steps of:
    - establishing a sample count;
      
      setting said sample count to zero;
      
      detecting a relative change in said difference signal as each digital sample is presented for analysis;
      
      incrementing the sample count by one;
      
      establishing a rise time count;
      
      recording a count in said rise time count each time said difference indicates an increase in level;
      
      comparing the values of said sample and rise time counts to predetermined count limits; and
      
      establishing a predetermined number of digital samples as a segment when said limits are reached; and
      
      resetting said counts to zero.
  - 6. The method of claim 4 wherein said smoothing step further comprises applying a relationship defined by:
    - space="preserve" listing-type="equation">Y.sub.n =(15/16)Y.sub.n-1 +Xn/16
      where X_n represents an input digital signal and Y_n represents an output digital signal.
  - 7. The method of claim 2 wherein the step of digitizing comprises sampling an analog speech signal at a sampling rate on the order of at least twice the frequency of interest.
  - 8. The method of claim 2 wherein the step of applying a threshold condition comprises applying a condition that for any input signal {Z_k } having a maximum value over a given period of max {Z_k }, there is a corresponding output signal P_k which is defined as
    
    space="preserve" listing-type="equation">P.sub.k =Z.sub.k -(max {Z.sub.k }N)
    where N represents a dynamic range relative to the maximum value to be retained.
- 9. The method of claim 2 wherein the step of converting each of said log segments into a plurality of speech phoneme classes comprises the step of multiplying spectral samples in each log segment by a class distinction matrix in the form of a linear transformation matrix having Q columns by R rows, Q being a predetermined number of spectral ranges for sampling purposes and R being a number of spectral classes used, and each element representing a weighting factor for the probability that a given spectral component falls within a given one of said speech phoneme classes, said multiplying producing a raw class vector.
- 10. The method of claim 9 wherein the step of converting each of said log segments into a plurality of articulation parameters, comprises the step of multiplying spectral samples in each log segment by a plurality of class matrixes, each class matrix being in the form of a linear transformation matrix having S columns by P rows, S being a number of predetermined spectral ranges for sampling purposes and P being equal to the number of articulatory parameters used, and each element represents a weighting factor proportional to the probability that a given spectral component represents a given one of said parameters in the class, said multiplying producing a plurality of class parameter vectors.
- 11. The method of claim 10 wherein the step of combining comprises the steps of:
  - normalizing the raw class vector;
    
    multiplying log segments by each of said normalized raw class vector elements separately before multiplying by a class matrix corresponding to said normalized vector element so as to produce a weighted segment input for each class matrix; and
    
    adding all of said parameter vectors to form a single output parameter vector.
- 12. The method of claim 10 wherein the step of combining further comprises the steps of:
  - normalizing the raw class vector;
    
    multiplying each of said class parameter vectors by a single element of said normalized raw class vector elements corresponding to a class matrix the parameter vector originates from to produce a plurality of weighted parameter vectors; and
    
    adding all of said weighted parameter vectors to form a single output parameter vector.
- 13. The method of claim 2 wherein the step of boosting high frequency comprises the step of applying a relationship:
  - space="preserve" listing-type="equation">Y.sub.n =X.sub.n -α
    
    X.sub.n-1
    where Y_n is an output signal, X_n is an input signal and α
    
    is typically between 0.5 and 0.7.
- 14. The method of claim 2 wherein the step of transforming comprises the step of applying a function defined by
  
  space="preserve" listing-type="equation">W.sub.n =0.5-0.49 Cos [(π
  
  /16)n]for n=0 . . . 31
15. The method of claim 2 wherein said step of transforming comprises the step of transforming data samples according to relationship defined by:
- ##EQU6## where Z_k represents an output signal and Y_n represents an input signal.
16. The method of claim 1 wherein the step of monitoring the energy level of said digital speech signal further comprises the step of tracking pitch variations in said digital speech signal.
17. The method of claim 1 wherein said step of selecting comprises the step of transferring a predetermined number, D, of digital samples at a time.
18. The method of claim 17 wherein D=32.
19. The method of claim 1 further comprising the steps of:
- generating an image representative of a mid-sagital view of a human articulatory tract;
  
  associating said articulatory parameters with corresponding anatomical points on said image; and
  
  altering said image according to variations in said articulatory parameter values.

20. An apparatus for determining the status of a plurality of articulatory parameters from speech data, comprising:
- sampling means for sampling speech data at a predetermined sampling rate and for providing speech data sample segments of predetermined length at predetermined sampling intervals based upon changes in energy in said speech data;
  
  a transformation processor connected in series with said sampling means for receiving said speech data sample segments and transforming them from time varying amplitude data into spectral data segments;
  
  first mapping means connected to said transformation processor for associating spectral data in each of said spectral data segments with one or more of a plurality of predefined speech phoneme classes so as to generate a weight for the probability that said segments correspond to spectra within each of said classes;
  
  second mapping means connected in series with said transformation processor and in parallel with said first mapping means for transforming spectral data in each of said spectral data segments into a plurality of articulatary parameters for each of said plurality of classes so as to generate a series of N articulatory parameter values representative of parameters in each class to which spectra represented by said segments correspond; and
  
  combination means connected to said first and second mapping means for combining said weight for the probability of a given class with the series of N articulatory parameters so as to generate a single weighted N parameter output.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 21. The apparatus of claim 20 wherein said sampling means comprises:
    - digitizing means for sampling speech data at a predetermined sampling rate and for forming digital speech data therefrom;
      
      energy monitoring means connected to receive said digital speech data for monitoring changes in energy therein; and
      
      segment selection means connected to said energy monitoring means for selecting segments of said digital speech data of predetermined length at predetermined sampling intervals based upon changes in energy of said digital speech data.
  - 22. The apparatus of claim 21 wherein said energy monitoring means comprises a pitch tracker for tracking pitch variations in the digital speech signals and for providing an output in response to predetermined pitch variations.
  - 23. The apparatus of claim 21 wherein said energy monitoring means comprises:
    - scaling means for converting said digital speech data to a logarithmic amplitude scale;
      
      a delay line in series with said scaling means for receiving logarithmic scaled digital speech signals and applying a predetermined delay thereto;
      
      summation means connected to an output of said delay means and to said scaling means for adding speech signals to delayed speech data segments; and
      
      trigger means connected between said summation means and said segment selection means for providing a selection signal to said selection means in response to an increase in the energy of said data segments for predetermined numbers of sampling periods.
  - 24. The apparatus of claim 21 further comprising frequency boosting means connected between said segment selection means and said energy monitor means for boosting high frequency components of said speech signals over a predetermined frequency range.
  - 25. The apparatus of claim 24 further comprising windowing means connected in series with said frequency boosting means for applying a predefined windowing function to said selected data segments.
  - 26. The apparatus of claim 24 further comprising log means connected between said transformation means and said first and second mapping means for converting an amplitude of spectral data segments to a logarithmic amplitude scale.
  - 27. The apparatus of claim 20 further comprising threshold means connected between said transformation means and said mapping means for removing spectral data outside of a predefined dynamic range which is measured from a maximum value for data in each group of said spectral values.
  - 28. The apparatus of claim 20 wherein said first mapping means comprises first matrix multiplication means for multiplying said spectral data segments by a predefined class distinction matrix.
  - 29. The apparatus of claim 28 further comprising vector normalization means for receiving an output from said first mapping means and generating a normalized class vector therefrom.
  - 30. The apparatus of claim 20 wherein said second mapping means comprises second matrix multiplication means for multiplying said spectral data segments substantially simultaneously by a plurality of predefined class matrixes.
  - 31. The apparatus of claim 30 wherein said summation means comprises:
    - a plurality of digital multipliers connected at a first input to said first mapping means and at a second input to said second mapping means so as to receive results of multiplying spectra data by said class association matrix at said first input and of multiplying by each of said class matrixes at a second input with one adder being connected to receive its second input from one class multiplication; and
      
      a digital adder connected to an output of all of said plurality of digital multipliers.
  - 32. The apparatus of claim 20 further comprising third mapping means connected between said transformation processor and said first mapping means for associating spectral data in each of said spectral data segments with one or more of a plurality of predefined spectral subclasses before association with said classes.
  - 33. The apparatus of claim 20 further comprising visual display means connected to said combination means for receiving said articulation parameters and displaying alterations in magnitudes of said parameters substantially simultaneously with an animated visual representation of an anatomical view of a vocal tract.
  - 34. The apparatus of claim 33 wherein said visual display means comprises:
    - graphics display means for displaying a predefined graphic pattern in the form of a human articulatory system on a visual screen; and
      
      animation means for altering said graphic pattern in response to changes in said articulatory parameters.
  - 35. The apparatus of claim 34 wherein said display means further comprises;
    - a display area for displaying said human articulatory system in a sectional view.
  - 36. The apparatus of claim 34 further comprising means for displaying numerical values for said articulatory parameters.
  - 37. The apparatus of claim 34 further comprising recording means for storing speech data and for replaying said data when desired.

38. A system for determining values of articulatory parameters that are representative of articulation tract configuration during the production of speech, comprising;
- a speech converter for generating a series of speech spectral samples representative of continuous speech;
  
  a plurality of spectral transform means connected in parallel to said speech converter, each of said spectral transform means for establishing a respective speech phoneme class including a plurality of speech phonemes having corresponding spectral and articulatory characteristics and for converting a speech spectrum in its established class into a predetermined set of articulatory parameter values;
  
  a class distinction transform means connected to said speech converter for producing a set of probability values, each probability value of said set representing the probability that a respective speech phoneme class has a speech phoneme represented by said speech spectral sample;
  
  an arrayed combinatory modality connected to said plurality of spectral transform means and to said class distinction transform means for combining each of said articulatory parameter value sets with a respective probability value to produce a plurality of adjusted articulatory parameter value sets; and
  
  a single combinatory modality for combining said plurality of adjusted articulatory parameter value sets into a set of adjusted articulatory parameter values representative of an articulatory tract configuration.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emerson & Stern Associates, Inc.
Original Assignee
Emerson & Stern Associates, Inc.
Inventors
Hutchins, Sandra E.
Primary Examiner(s)
Budd, Mark O.
Assistant Examiner(s)
Voeltz, Emanuel Todd

Application Number

US07/289,540
Time in Patent Office

728 Days
Field of Search

381/36-43, 364/513.5
US Class Current

704/254
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 2015/025 Phonemes, fenemes or fenone...

Method and apparatus for determining articulatory parameters from speech data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for determining articulatory parameters from speech data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links