SYSTEM-EFFECTED METHODS FOR ANALYZING, PREDICTING, AND/OR MODIFYING ACOUSTIC UNITS OF HUMAN UTTERANCES FOR USE IN SPEECH SYNTHESIS AND RECOGNITION

US 20100161327A1
Filed: 12/16/2009
Published: 06/24/2010
Est. Priority Date: 12/18/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition, the method comprising:

(a) initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data, the acoustic wave data being in constrained or unconstrained form;

(b) using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, one or more measurable acoustic parameters selected from the group consisting of pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals, two or more of the acoustic parameters optionally being analyzed and/or modified simultaneously;

(c) analyzing acoustic wave data representing a selected one of the acoustic units to determine the phase state of the acoustic unit; and

(d) analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for automatically analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition. Possible steps include: initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data; using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals; analyzing acoustic wave data representing a selected acoustic unit to determine the phase state of the acoustic unit; and analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit. Also included are systems for implementing the described and related methods.

Citations

29 Claims

1. A computer-implemented method for analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition, the method comprising:
- (a) initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data, the acoustic wave data being in constrained or unconstrained form;
  
  (b) using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, one or more measurable acoustic parameters selected from the group consisting of pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals, two or more of the acoustic parameters optionally being analyzed and/or modified simultaneously;
  
  (c) analyzing acoustic wave data representing a selected one of the acoustic units to determine the phase state of the acoustic unit; and
  
  (d) analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 21, 22, 24, 25, 26, 27, 28, 29)
- - 2. A method according to claim 1 comprising concatenating the selected acoustic unit with an additional acoustic unit having at least one acoustic parameter compatible with the respective selected acoustic unit parameter as determined in a phase state of the additional acoustic unit similar to, or identical with, the phase state of the selected acoustic unit.
  - 3. A method according to claim 1 comprising concatenating the selected acoustic unit with an additional acoustic unit having a pitch, amplitude and duration compatible with the pitch, amplitude and duration of the selected acoustic unit, as determined in a phase state of the additional acoustic unit similar to, or identical with, the phase state of the selected acoustic unit.
  - 4. A method according to claim 1 comprising matching a sequence of the acoustic units with a sequence of text capable of visually representing the speech utterances.
  - 5. A method according to claim 4 wherein the speech utterances have an identifiable prosody and the method comprises labeling the text with prosodic phonetic units to represent the sequence of text with the prosody identified in the speech signal.
  - 6. A method according to claim 5 comprising tagging each prosodic phonetic unit with a bundle of acoustic feature values to describe the prosodic phonetic unit acoustically.
  - 7. A method according to claim 6 wherein the acoustic feature values comprise values for context-independent prosodic phonetic unit features and, optionally, context-dependent prosodic phonetic unit features determined by applying linguistic rules to the sequence of text
  - 8. A method according to claim 1 comprising assembling the sequence of acoustic units from available acoustic units in a database of acoustic units, and wherein, optionally, each available acoustic unit in the database comprises a recorded element of speech voiced by a human speaker.
  - 9. A method according to claim 1 comprising determining a desired acoustic unit pathway comprising a sequence of acoustic feature vectors corresponding with a sequence of the acoustic units or with a sequence of text representing the speech utterances.
  - 10. A method according to claim 9 wherein the acoustic feature vectors each comprise a bundle of feature values selected for closeness to a statistical mean of the values of candidate acoustic units to be matched with a prosodic phonetic unit or one of the prosodic phonetic units and, optionally, for acoustic continuity with at least one adjacent acoustic feature vector.
  - 20. A method according to claim 1 comprising modifying a candidate acoustic unit by analyzing the candidate acoustic unit wave data into a vocal tract resonance signal and a residual glottal signal, modifying an acoustic parameter of the glottal signal and recombining the vocal tract resonance signal with the modified glottal signal to provide a modified candidate acoustic unit.
  - 21. A method according to claim 20 comprising analyzing the glottal signal to determine the time-dependent amplitude of the glottal signal with reference to the phase state of the glottal signal, determining the fundamental frequency of the glottal signal in the phase state and modifying the fundamental frequency of the glottal signal in the phase state to have a desired value.
  - 22. A method according to claim 20 wherein the vocal tract resonance signal comprises partial correlation coefficients and the method comprises modifying the vocal tract resonance signal by converting the partial correlation coefficients to log area ratios and altering or interpolating the log area ratios.
  - 24. A method according to claim 1 wherein the acoustic metrics or acoustic parameters comprise one or more metrics or parameters selected from the group consisting of pitch, amplitude, duration, fundamental frequency, formants, mel-frequency cepstral coefficients, energy, and time.
  - 25. A method according to claim 1 wherein the speech signal is a synthesized speech signal comprising a sequence of acoustic units concatenated to audibilize the sequence of text.
  - 26. A method according to claim 1 wherein the acoustic units are derived from prosodic speech recordings generated by human speakers pronouncing text annotated according to specific rules for a defined prosody.
  - 27. A method according to claim 1 wherein the sequence of text is generated to visually represent the speech recognized in the speech signal.
  - 28. A method according to claim 1 wherein the sequence of text is selected from the group consisting of a phrase, a sentence, multiple sentences, a paragraph, multiple paragraphs, a discourse, and a written work.
  - 29. A computerized system comprising software for performing a method according to claim 1, the software being stored in, or resident in, computer readable media.

11. A method for categorically mapping the relationship of at least one text unit in a sequence of text to at least one corresponding prosodic phonetic unit, to at least one linguistic feature category in the sequence of text, and to at least one speech utterance represented in a synthesized speech signal, the method comprising:
- (a) identifying, and optionally modifying, acoustic data representing the at least one speech utterance, to provide the synthesized speech signal;
  
  (b) identifying, and optionally modifying, the acoustic data representing the at least one utterance to provide the at least one speech utterance with an expressive prosody determined according to prosodic rules; and
  
  (c) identifying acoustic unit feature vectors for each of the at least one prosodic phonetic units, each acoustic unit feature vector comprising a bundle of feature values selected according to proximity to a statistical mean of the values of acoustic unit candidates available for matching with the respective prosodic phonetic unit and, optionally, for acoustic continuity with at least one adjacent acoustic feature vector.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. A method according to claim 11 comprising selecting an acoustic unit for the sequence of acoustic units from the database of candidate acoustic units according to the proximity of the feature values of the selected acoustic unit to a corresponding acoustic feature vector on the acoustic unit pathway and, optionally, for acoustic continuity of the selected acoustic unit with the acoustic feature values of one or more neighboring acoustic units in the sequence of acoustic units.
  - 13. A method according to claim 11 comprising selecting each acoustic unit for the sequence of acoustic units according to a rank ordering of acoustic units available to represent a specific prosodic phonetic unit in the sequence of text, the rank ordering being determined by the differences between the feature values of available acoustic units and the feature values of an acoustic feature vector on the acoustic unit pathway.
  - 14. A method according to claim 11 comprising determining individual linguistic and acoustic weights for each prosodic phonetic unit according to linguistic feature hierarchies related to a prior adjacent prosodic phonetic unit and to a next adjacent prosodic phonetic unit, wherein each candidate acoustic unit can have different target and join weights for each respective end of the candidate acoustic unit.
  - 15. A method according to claim 14 comprising:
    - (d) measuring one or more acoustic parameters, optionally F0, F1, F2, F3, energy, and the like, across a particular acoustic unit corresponding to a particular prosodic phonetic unit to determine time related changes in the one or more acoustic parameters; and
      
      ,(e) modeling the particular acoustic unit and the relevant acoustic parameter values of the prior adjacent prosodic phonetic unit and the next adjacent prosodic phonetic unit.
  - 16. A method according to claim 15 wherein the modeling comprises applying combinations of fourth-order polynomials and second- and third-order polynomials to represent n-dimensional trajectories of the modeled acoustic units through unconstrained acoustic space;
    - or comprises applying a lower-order polynomial to constrained acoustic space; and
      
      optionally diphones and triphones.
  - 17. A method according to claim 11 comprising using linguistically selected acoustical candidates for calculating acoustic features for synthesizing speech utterances from text by:
    - calculating an absolute and/or relative desired acoustic parameter value, optionally in terms of fundamental frequency and/or a change in fundamental frequency over the duration of the acoustic unit, the duration optionally being represented by a single point, multiple points, Hermite splines or another suitable representation, the desired acoustic parameter value being based on a weighted average of the actual acoustic parameters for a set of candidate acoustic units selected according to their correspondence to a particular linguistic context, wherein the weighting favors acoustic unit candidates more closely corresponding to the particular linguistic context.

18. A method for assigning linguistic and acoustic weights to prosodic phonetic units useful for concatenation into synthetic speech or for speech recognition, the method comprising:
- determining individual linguistic and acoustic weights for each prosodic phonetic unit according to linguistic feature hierarchies related to a prior adjacent prosodic phonetic unit and to a next adjacent prosodic phonetic unit, wherein each candidate acoustic unit can have different target and join weights for each respective end of the candidate acoustic unit;
  
  measuring one or more acoustic parameters, optionally F0, F1, F2, F3, energy, and the like, across a particular acoustic unit corresponding to a particular prosodic phonetic unit to determine time related changes in the one or more acoustic parameters; and
  
  , modeling the particular acoustic unit and the relevant acoustic parameter values of the prior adjacent prosodic phonetic unit and the next adjacent prosodic phonetic unit.

19. A method for deriving a path through acoustic space, the acoustic path comprising desired acoustic feature values for each sequential unit of a sequence of acoustic units to be employed in synthesizing speech from text, the method comprising:
- calculating the acoustic path in absolute and/or relative coordinates, optionally in terms of fundamental frequency and/or a change in fundamental frequency over the duration of the synthesizing of the text, for the sequence of acoustic units, each desired sequential acoustic unit being represented by a representation, optionally a single point, multiple points, Hermite splines or another suitable acoustic unit representation, according to a weighted average of the acoustic parameters of the acoustic unit representation, wherein the weighted average is based on a degree of accuracy with which the acoustic parameters for each such sequentially desired acoustic unit are known, and on a degree of influence ascribed to each sequential acoustic unit according to the context of the acoustic unit in the sequence of desired acoustic units.

23. A method of deriving an acoustic path comprising a sequence of desired acoustic units extending through unconstrained acoustic space, the acoustic path being useful for synthesizing speech from text with a desired style of speech prosody by concatenating the sequence of desired acoustic units, the method comprising:
- (a) providing a database of acoustic units wherein each acoustic unit is identified according to a prosodic phonetic unit name and at least one additional linguistic feature;
  
  and wherein each acoustic unit has been analyzed according to phase-state metrics so that pitch, energy, and spectral wave data can be modified simultaneously at one or more instants in time;
  
  (b) mapping each acoustic unit to prosodic phonetic unit categorizations and additional linguistic categorizations enabling the acoustic unit to be specified and/or altered to provide one or more acoustic units for incorporation into expressively synthesized speech according to prosodic rules;
  
  (c) calculating weighted absolute and/or relative acoustic values for a set of candidate acoustic units to match each desired acoustic unit, one candidate set per desired acoustic unit, matching being in terms of linguistic features for the corresponding mapped prosodic phonetic unit or a substitute for the corresponding mapped prosodic phonetic unit;
  
  (d) calculating an acoustic path through n-dimensional acoustic space to be sequenced as an utterance of synthesized speech, the acoustic path being defined by the weighted average values for each candidate set of acoustic units;
  
  (e) selecting and modifying, as needed, a sequence of acoustic units, or sub-units, for the synthesized speech according to the differences between the weighted acoustic values for a candidate acoustic unit, or sub-unit, and the weighted acoustic values of a point on the calculated acoustic path.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lessac Technologies, Inc.
Original Assignee
Lessac Technologies, Inc.
Inventors
Reichenbach, John B., Marple, Gary A., Wilhelms-Tricarico, Reiner, Mottershead, Brian, CHANDRA, Nishant, Nitisaroj, Rattima

Granted Patent

US 8,401,849 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/06   Elementary speech units use...

G10L 13/10   Prosody rules derived from ...

G10L 15/1807   using prosody or stress

G10L 17/02   Preprocessing operations, e...

G10L 25/48   specially adapted for parti...

SYSTEM-EFFECTED METHODS FOR ANALYZING, PREDICTING, AND/OR MODIFYING ACOUSTIC UNITS OF HUMAN UTTERANCES FOR USE IN SPEECH SYNTHESIS AND RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM-EFFECTED METHODS FOR ANALYZING, PREDICTING, AND/OR MODIFYING ACOUSTIC UNITS OF HUMAN UTTERANCES FOR USE IN SPEECH SYNTHESIS AND RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links