Method and apparatus for improved duration modeling of phonemes

US 6,785,652 B2
Filed: 12/19/2002
Issued: 08/31/2004
Est. Priority Date: 12/18/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method comprising:

identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and

incorporating the functional transformation into a generalized additive model for modeling phoneme durations.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.

39 Citations

View as Search Results

41 Claims

1. A method comprising:
- identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
  
  incorporating the functional transformation into a generalized additive model for modeling phoneme durations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the functional transformation comprises:
    - $F (x) = {{\frac{B - A}{2} [\cos (π \frac{x - A}{B - A})]}^{α} + \frac{A + B}{2}}^{β}$
3. The method of claim 1 further comprising:
- determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
4. The method of claim 3 further comprising:
- measuring a duration range for each phoneme in the training data.
5. The method of claim 3 further comprising:
- measuring a duration range for a plurality of phonemes in the training data.
6. The method of claim 1, wherein the shape contains a plurality of inflection points.
7. The method of claim 1 further comprising:
- selecting a contextual factor that influences phoneme durations.
8. The method of claim 7, wherein selecting a contextual factor comprises:
- choosing at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
9. The method of claim 1 further comprising:
- generating a duration for a phoneme using the generalized additive model.

10. A computer-readable medium having executable instructions to cause a processor to perform a method comprising:
- identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
  
  incorporating the functional transformation into a generalized additive model for modeling phoneme durations.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer-readable medium of claim 10, wherein the functional transformation comprises:
    - $F (x) = {{\frac{B - A}{2} [\cos (π \frac{x - A}{B - A})]}^{α} + \frac{A + B}{2}}^{β}$
12. The computer-readable medium of claim 10, wherein the method further comprises:
- p1 determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
13. The computer-readable medium of claim 12, wherein the method further comprises:
- measuring a duration range for each phoneme in the training data.
14. The computer-readable medium of claim 12, wherein the method further comprises:
- measuring a duration range for a plurality of phonemes in the training data.
15. The computer-readable medium of claim 10, wherein the shape contains a plurality of inflection points.
16. The computer-readable medium of claim 10, wherein the method further comprises:
- selecting a contextual factor that influences phoneme durations.
17. The computer-readable medium of claim 16, wherein selecting a contextual factor comprises:
- choosing at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
18. The computer-readable medium of claim 10, wherein the method further comprises:
- generating a duration for a phoneme using the generalized additive model.

19. A system comprising:
- a processor coupled to a memory through a bus; and
  
  a process executed from the memory by the processor to cause the processor to identify a non-exponential functional transformation that defines a shape containing an inflection point, and incorporate the functional transformation into a generalized additive model for modeling phoneme durations, wherein the functional transformation comprises a root sinusoidal transformation.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The system of claim 19, wherein the functional transformation comprises:
    - $F (x) = {{\frac{B - A}{2} [\cos (π \frac{x - A}{B - A})]}^{α} + \frac{A + B}{2}}^{β}$
21. The system of claim 19, wherein the process further causes the processor to determine control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
22. The system of claim 21, wherein the process further causes the processor to measure a duration range for each phoneme in the training data.
23. The system of claim 21, wherein the process further causes the processor to measure a duration range for a plurality of phonemes in the training data.
24. The system of claim 19, wherein the shape contains a plurality of inflection points.
25. The system of claim 19, wherein the process further causes the processor to select a contextual factor that influences phoneme durations.
26. The system of claim 25, wherein the process further causes the processor, when selecting a contextual factor, to choose at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
27. The system of claim 19, wherein the process further causes the processor to generate a duration for a phoneme using the generalized additive model.

28. An apparatus comprising:
- means for identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
  
  means for incorporating the functional transformation into a generalized additive model for modeling phoneme durations.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36)
- - 29. The apparatus of claim 28, wherein the functional transformation comprises:
    - $F (x) = {{\frac{B - A}{2} [\cos (π \frac{x - A}{B - A})]}^{α} + \frac{A + B}{2}}^{β}$
30. The apparatus of claim 28 further comprising:
- means for determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
31. The apparatus of claim 30 further comprising:
- means for measuring a duration range for each phoneme in the training data.
32. The apparatus of claim 30 further comprising means for measuring a duration range for a plurality of phonemes in the training data.
33. The apparatus of claim 28, wherein the shape contains a plurality of inflection points.
34. The apparatus of claim 28 further comprising:
- means for selecting a contextual factor that influences phoneme durations.
35. The apparatus of claim 34, wherein the means for selecting a contextual factor chooses at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
36. The apparatus of claim 28 further comprising:
- means for generating a duration for a phoneme using the generalized additive model.

37. An apparatus comprising:
- means for receiving text signals;
  
  means for synthesizing an acoustic sequence from the text signals using a phoneme duration model, the phoneme duration model produced by incorporating a functional transformation form with an inflection point into a generalized additive model that calculates phoneme durations, wherein the functional transformation form comprises a root sinusoidal transformation, the root sinusoidal transformation controlled in response to a minimum phoneme duration and a maximum phoneme duration; and
  
  means for providing speech signals representative of the received text.
- View Dependent Claims (38, 39, 40, 41)
- - 38. The apparatus of claim 37, wherein the means for synthesizing comprises:
39. The apparatus of claim 37, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
40. The apparatus of claim 37, wherein the phoneme duration model is a sum of products model, and wherein the means for synthesizing further comprises means for modeling phoneme pitch.
41. The apparatus of claim 37, wherein the generalized additive model is expressed by $D$
- (f1,f2,…
  
  
  
  
  
  fN)=F
  
  [∑
  
  i=1N
  
  ∏
  
  j=1Mi
  
  
  
  ai,j
  
  fi
  
  (j)],where D is the duration of a phoneme, f_i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, M_iis the number of values that f_ican take, α
  
  _i,jis a factor scale corresponding to the jth value of factor f_idenoted by f_i(j), and F is the functional transformation form.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Computer Incorporated (Apple Inc.)
Inventors
Bellegarda, Jerome R., Silverman, Kim
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Lerner, Martin

Application Number

US10/325,425
Publication Number

US 20030093277A1
Time in Patent Office

621 Days
Field of Search

704/211, 704/236, 704/258, 704/266, 704/267, 704/269, 704/260
US Class Current

704/266
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

Method and apparatus for improved duration modeling of phonemes

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

39 Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for improved duration modeling of phonemes

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links