Speech synthesis using concatenation of speech waveforms

US 6,665,641 B1
Filed: 11/12/1999
Issued: 12/16/2003
Est. Priority Date: 11/13/1998
Status: Expired due to Term

First Claim

Patent Images

1. A speech synthesizer comprising:

a. a large speech database referencing speech waveforms;

b. a speech waveform selector in communication with the speech database that selects waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having pitch within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and

c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

View all claims

14 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database. Speech quality is further improved by speech unit selection and concatenation smoothing.

458 Citations

108 Claims

1. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms;
  
  b. a speech waveform selector in communication with the speech database that selects waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having pitch within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

2. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms;
  
  b. a speech waveform selector in communication with the speech database that selects waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having a duration within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

3. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms;
  
  b. a speech waveform selector in communication with the speech database that selects waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having coarse pitch continuity within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

4. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a target generator for generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. a waveform selector that selects a sequence of waveforms referenced by the database, each waveform in the sequence corresponding to a first non-null set of target feature vectors, wherein the waveform selector attributes, to any waveform candidate, a node cost, wherein the node cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined using a cost function that varies nontrivially according to a second non-null set of target feature vectors in the sequence; and
  
  d. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.
- View Dependent Claims (5, 6, 7)
- - 5. A synthesizer according to claim 4, wherein the first and second sets are identical.
  - 6. A synthesizer according to claim 4, wherein the second set is proximate to the first set in the sequence.
  - 7. A synthesizer according to claim 4, wherein the second set is a function of the first set.

8. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a target generator for generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. a waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to pairs of adjacent waveform candidates, a transition cost, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined using a cost function that varies nontrivially according to the features of a region in the phonetic transcription input that corresponds to adjacent waveform candidates; and
  
  d. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

9. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function having a plurality of steep sides; and
  
  database that concatenates the waveforms selected by the speech waveform selectorc. a speech waveform concatenator in communication with the speech datebase that concatenates the waveforms selected by the sppech waveform selector to produce a speech signal outpup.
- View Dependent Claims (10, 11, 12)
- - 10. A speech synthesizer according to claim 9, wherein the at least one individual cost function is piecewise linear.
  - 11. A speech synthesizer according to claim 9, wherein the at least one individual cost function is asymmetric.
  - 12. A speech synthesizer according to claim 9, wherein the cost function includes a region that approximates a flat bottom.

13. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a piecewise linear cost function that has a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

14. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using an asymmetric cost function that has a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

15. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost of a symbolic feature is determined using a non-binary numeric function determined by recourse to a table; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

16. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost of a symbolic feature is determined using a non-binary numeric function determined by recourse to a set of rules; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

17. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a target generator for generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. a waveform selector that selects a sequence of waveforms referenced by the database, each waveform in the sequence corresponding to a first non-null set of target feature vectors, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of weighted individual costs associated with each of a plurality of features, and wherein the weight associated with at least one of the individual costs varies nontrivially according to a second non-null set of target feature vectors in the sequence, such target features including at least one feature other than target phoneme identity; and
  
  d. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.
- View Dependent Claims (18, 19, 20)
- - 18. A synthesizer according to claim 17, wherein the first and second sets are identical.
  - 19. A synthesizer according to claim 17, wherein the second set is proximate to the first set in the sequence.
  - 20. A synthesizer according to claim 17, wherein the second set is a function of the first set.

21. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a waveform cost, wherein the waveform cost is a function of individual costs associated with each of a plurality of features, and wherein calculation of the waveform cost is aborted after it is determined that the waveform cost will exceed a threshold; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

22. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms and associated symbolic prosodic features, wherein the database is accessed by speech waveform designators, each designator being associated with a sequence of diphones, the sequence having at least one diphone;
  
  b. a speech waveform selector, in communication with the speech database, that selects, based, at least in part, on the symbolic prosodic features, waveforms referenced by the database using speech waveform designators that correspond to a phonetic transcription input wherein the waveform selector attributes, to pairs of adjacent waveform candidates, a transition cost, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.
- View Dependent Claims (23, 24)
- - 23. A speech synthesizer according to claim 22, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.
  - 24. A speech synthesizer according to claim 22, wherein the first set of tables is the result of vector quantization of spectra.

25. A speech synthesizer comprising:
- a. a speech database referencing speech waveforms;
  
  b. a speech waveform selector, in communication with the speech database, that selects waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. a speech waveform concatenator, in communication with the speech database, that concatenates waveforms selected by the speech waveform selector to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the concatenator selects (i) a location of a trailing edge of the first waveform and (ii) a location of a leading edge of the second waveform, each location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the locations, the optimization being determined in a plurality of successive stages in which time resolution associated. with the first and second waveforms is made successively finer.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 28. A speech synthesizer according to any of claims 25 through 27, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 29. A speech synthesizer according to any of claims 25 through 27, wherein the optimization is determined on the basis of similarity in shape of the first and second waveforms in the regions near the locations.
  - 30. A speech synthesizer according to claim 29, wherein the optimization is determined using at least one non-rectangular window.
  - 31. A speech synthesizer according to claim 29, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 32. A speech synthesizer according to claim 31, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 33. A speech synthesizer according to 29, wherein similarity is determined using a cross-correlation technique.
  - 34. A speech synthesizer according to claim 33, wherein the optimization is determined using at least one non-rectangular window.
  - 35. A speech synthesizer according to claim 33, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 36. A speech synthesizer according to claim 31, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 37. A speech synthesizer according to claim 33, wherein the technique is normalized cross correlation.
  - 38. A speech synthesizer according to claim 37, wherein the optimization is determined using at least one non-rectangular window.
  - 39. A speech synthesizer according to claim 37, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 40. A speech synthesizer according to claim 39, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.

26. A speech synthesizer comprising:
- a. a speech database referencing speech waveforms;
  
  b. a speech reform selector, in communication with the speech database, that selects waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. a speech waveform concatenator, in communication with the speech database, that concatenates waveforms selected by the speech waveform selector to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the second waveform having a leading edge, the concatenator selects the location of a trailing edge of the first waveform, the location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the location and the leading edge, the optimization being determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.

27. A speech synthesizer comprising:
- a. a speech database referencing speech waveforms;
  
  b. a speech waveform selector, in communication with the speech database, that selects waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. a speech waveform concatenator, in communication with the speech database, that concatenates waveforms selected by the speech waveform selector to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the first waveform having a trailing edge, the concatenator selects the location of a leading edge of the second waveform, the location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the location and the trailing edge, the optimization being determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.

41. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms and associated symbolic prosodic features, wherein the database is accessed by speech waveform designators, each designator being associated with a sequence of diphones, the sequence having at least one diphone;
  
  b. speech waveform selecting means, in communication with the speech database, for selecting, based, at least in part, on the symbolic prosodic features, waveforms referenced by the database using speech waveform designators that correspond to a phonetic transcription input, and wherein the waveform selecting means attributes, to pairs of adjacent waveform candidates, a transition cost, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.
- View Dependent Claims (42, 43)
- - 42. A speech synthesizer according to claim 41, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.
  - 43. A speech synthesizer according to claim 41, wherein the first set of tables is the result of vector quantization of spectra.

44. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms, wherein the database is accessed by speech waveform designators;
  
  b. speech waveform selecting means, in communication with the speech database, for selecting, waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having pitch within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.

45. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms, wherein the database is accessed by speech waveform designators;
  
  b. speech waveform selecting means, in communication with the speech database, for selecting, waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having a duration within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.

46. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms, wherein the database is accessed by speech waveform designators;
  
  b. speech waveform selecting means, in communication with the speech database, for selecting, waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the criteria include a requirement favoring waveform candidates having coarse pitch continuity within a range determined as a function of high-level linguistic features, and wherein the criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.

47. A speech synthesizer comprising:
- a. a large speech database;
  
  b. target generating means for generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. waveform selecting means for selecting a sequence of waveforms referenced by the database, each waveform in the sequence corresponding to a first non-null set of target feature vectors, wherein the waveform selecting means attributes, to any waveform candidate, a node cost, wherein the node cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined using a cost function that varies nontrivially according to a second non-null set of target feature vectors in the sequence; and
  
  d. speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.
- View Dependent Claims (48, 49, 50)
- - 48. A synthesizer according to claim 47, wherein the first and second sets are identical.
  - 49. A synthesizer according to claim 47, wherein the second set is proximate to the first set in the sequence.
  - 50. A synthesizer according to claim 47, wherein the second set is a function of the first set.

51. A method of speech synthesis comprising:
- a. providing a large speech database referencing speech waveforms and associated symbolic prosodic features, wherein the database is accessed by speech waveform designators, each designator being associated with a sequence of diphones, the sequence having at least one diphone;
  
  b. selecting, based, at least in part, on the symbolic prosodic features, waveforms referenced by the database using speech waveform designators that correspond to a phonetic transcription input, wherein the selecting attributes a transition cost to pairs of adjacent waveform candidates, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  c. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (52, 53)
- - 52. A method of speech synthesis according to claim 51, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.
  - 53. A method of speech synthesis according to any of claim 51, wherein the first set of tables is the result of vector quantization of spectra.

54. A method of speech synthesis comprising:
- a. providing a large speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the selecting criteria include a requirement favoring waveform candidates having pitch within a range determined as a function of high-level linguistic features, and wherein the selecting criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

55. A method of speech synthesis comprising:
- a. providing a large speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the selecting criteria include a requirement favoring waveform candidates having a duration within a range determined as a function of high-level linguistic features, and wherein the selecting criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

56. A method of speech synthesis comprising:
- a. providing a large speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, wherein the selecting criteria include a requirement favoring waveform candidates having coarse pitch continuity within a range determined as a function of high-level linguistic features, and wherein the selecting criteria are implemented by cost functions, and the requirement is implemented using a function having steep sides and a region that approximates a flat bottom; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

57. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. selecting a sequence of waveforms referenced by the database, each waveform in the sequence corresponding to a first non-null set of target feature vectors, wherein the selecting attributes a node cost to any waveform candidate, wherein the node cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined using a cost function that varies nontrivially according to a second non-null set of target feature vectors in the sequence; and
  
  d. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (58, 59, 60)
- - 58. A synthesizer according to claim 57, wherein the first and second sets are identical.
  - 59. A synthesizer according to claim 57, wherein the second set is proximate to the first set in the sequence.
  - 60. A synthesizer according to claim 57, wherein the second set is a function of the first set.

61. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a transition cost to pairs of adjacent waveform candidates, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined using a cost function that varies nontrivially according to the features of a region in the phonetic transcription input that corresponds to adjacent waveform candidates; and
  
  d. concatenating the selected waveforms to produce a speech signal output.

62. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function that has at least one steep side; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

63. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function that has a plurality of steep sides; and
  
  c. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (64, 65, 66)
- - 64. A method of speech synthesis according to claim 63, wherein the at least one individual cost function is piecewise linear.
  - 65. A method of speech synthesis according to claim 63, wherein the at least one individual cost function is asymmetric.
  - 66. A method of speech synthesis according to claim 63, wherein the at least one individual cost function has a region that approximates a flat bottom.

67. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function that has a region that approximates a flat bottom; and
  
  c. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (68, 69)
- - 68. A method of speech synthesis according to claim 67, wherein the at least one individual cost function is piecewise linear.
  - 69. A method of speech synthesis according to claim 67, wherein the at least one individual cost function is asymmetric.

70. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost of a symbolic feature is determined using a non-binary numeric function; and
  
  c. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (71)
- - 71. A method of speech synthesis according to claim 70, wherein the symbolic feature is one of the following:
    - (i) prominence, (ii) stress, (iii) syllable position in the phrase;
      
      (iv) sentence type, (v) boundary type, and (vi) phonetic context.

72. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost of a symbolic feature is determined using a non-binary numeric function determined by recourse to a table; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

73. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost of a symbolic feature is determined using a non-binary numeric function determined by recourse to a set of rules; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

74. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. generating a sequence of target feature vectors responsive to a phonetic transcription input;
  
  c. selecting a sequence of waveforms referenced by the database, each waveform in the sequence corresponding to a first non-null set of target feature vectors, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of weighted individual costs associated with each of a plurality of features, and wherein the weight associated with at least one of the individual costs varies nontrivially according to a second non-null set of target feature vectors in the sequence, such target features including at least one feature other than target phoneme identity; and
  
  d. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (75, 76, 77)
- - 75. A method of speech synthesis according to claim 74, wherein the first and second sets are identical.
  - 76. A method of speech synthesis according to claim 74, wherein the second set is proximate to the first set in the sequence.
  - 77. A method of speech synthesis according to claim 74, wherein the second set is a function of the first set.

78. A method of speech synthesis comprising:
- a. providing a speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. concatenating the selected waveforms to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the concatenating selects (i) a location of a trailing edge of the first waveform and (ii) a location of a leading edge of the second waveform, each location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the locations, the optimization being determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
- View Dependent Claims (81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93)
- - 81. A method of speech synthesis according to any of claims 78 through 80, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 82. A method of speech synthesis according to any of claims 78 through 80, wherein the optimization is determined on the basis of similarity in shape of the first and second waveforms in the regions near the locations.
  - 83. A method of speech synthesis according to claim 82, wherein the optimization is determined using at least one non-rectangular window.
  - 84. A method of speech synthesis according to claim 82, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 85. A method of speech synthesis according to claim 84, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 86. A method of speech synthesis according to 82, wherein similarity is determined using a cross-correlation technique.
  - 87. A method of speech synthesis according to claim 86, wherein the optimization is determined using at least one non-rectangular window.
  - 88. A method of speech synthesis according to claim 86, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 89. A method of speech synthesis according to claim 88, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.
  - 90. A method of speech synthesis according to claim 86, wherein the technique is normalized cross correlation.
  - 91. A method of speech synthesis according to claim 90, wherein the optimization is determined using at least one non-rectangular window.
  - 92. A method of speech synthesis according to claim 90, wherein the optimization is determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.
  - 93. A method of speech synthesis according to claim 92, wherein the time resolution associated with the first and second waveforms in an initial one of the stages is downsampled by a factor that is a power of 2.

79. A method of speech synthesis comprising:
- a. providing a speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. concatenating the selected waveforms to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the second waveform having a leading edge, the concatenating selects the location of a trailing edge of the first waveform, the location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the location and the leading edge, the optimization being determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.

80. A method of speech synthesis comprising:
- a. providing a speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using designators that correspond to a phonetic transcription input; and
  
  c. concatenating the selected waveforms to produce a speech signal output, wherein, for at least one ordered sequence of a first waveform and a second waveform, the first waveform having a trailing edge, the concatenating selects the location of a leading edge of the second waveform, the location being selected so as to produce an optimization of a phase match between the first and second waveforms in regions near the location and the trailing edge, the optimization being determined in a plurality of successive stages in which time resolution associated with the first and second waveforms is made successively finer.

94. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a piecewise linear cost function that has at least one steep side; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

95. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using an asymmetric cost function that has at least one steep side; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

96. A speech synthesizer comprising:
- a. a large speech database;
  
  b. a speech waveform selector that selects a sequence of waveforms referenced by the database, wherein the waveform selector attributes, to any waveform candidate, a cost, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function that has at least one steep side and a region that approximates a flat bottom; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.

97. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a piecewise linear cost function that has at least one steep side; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

98. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using an asymmetric cost function that has at least one steep side; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

99. A method of speech synthesis comprising:
- a. providing a large speech database;
  
  b. selecting a sequence of waveforms referenced by the database, wherein the selecting attributes a cost to any waveform candidate, wherein the cost is a function of individual costs associated with each of a plurality of features, and wherein, for at least one numeric feature, an individual cost is determined using a cost function that has at least one steep side and a region that approximates a flat bottom; and
  
  c. concatenating the selected waveforms to produce a speech signal output.

100. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms;
  
  b. a speech waveform selector in communication with the speech database that selects waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, and wherein the waveform selector attributes, to pairs of adjacent waveform candidates, a transition cost, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  c. a speech waveform concatenator in communication with the speech database that concatenates the waveforms selected by the speech waveform selector to produce a speech signal output.
- View Dependent Claims (102)
- - 102. A speech synthesizer according to claim 100, wherein the first set of tables is the result of vector quantization of spectra.

101. A speech synthesizer according to claim 172, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.

103. A speech synthesizer comprising:
- a. a large speech database referencing speech waveforms, wherein the database is accessed by speech waveform designators;
  
  b. speech waveform selecting means, in communication with the speech database, for selecting, waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, and wherein the waveform selector attributes, to pairs of adjacent waveform candidates, a transition cost, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  c. speech waveform concatenating means in communication with the speech database for concatenating the waveforms selected by the speech waveform selecting means to produce a speech signal output.
- View Dependent Claims (104, 105)
- - 104. A speech synthesizer according to claim 103, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.
  - 105. A speech synthesizer according to claim 103, wherein the first set of tables is the result of vector quantization of spectra.

106. A method of speech synthesis comprising:
- a. providing a large speech database referencing speech waveforms;
  
  b. selecting waveforms referenced by the database using criteria that (i) favor waveform candidates based, at least in part, directly on high-level linguistic features, and (ii) favor approximately equally all waveform candidates in respect to low-level prosody features except those wherein the low-level prosody features are unlikely, and wherein the selecting attributes a transition cost to any waveform candidate, wherein the transition cost is a function of individual costs associated with each of a plurality of features, and wherein at least one individual cost is determined by using, as an argument, an acoustic distance value selected from one of a first set of tables, each table in the first set corresponding to a non-null set of phonemes; and
  
  c. concatenating the selected waveforms to produce a speech signal output.
- View Dependent Claims (107, 108)
- - 107. A method of speech synthesis according to claim 106, wherein the acoustic distance is spectral distance and each table in the first set corresponds to a different phoneme.
  - 108. A method of speech synthesis according to claim 106, wherein the first set of tables is the result of vector quantization of spectra.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
ScanSoft, Inc. n/k/a Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Deprez, Filip, DeMoortel, Jan, Schenk, Andre, Fackrell, Justin, Leys, Steven, De Bock, Mario, Rutten, Peter, Coorman, Geert, Coile, Bert Van
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/438,603
Time in Patent Office

1,495 Days
Field of Search

704/258, 704/260, 704/259, 704/263, 704/243, 704/245, 704/254
US Class Current

704/260
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 13/07 Concatenation rules

Speech synthesis using concatenation of speech waveforms

First Claim

14 Assignments

0 Petitions

Accused Products

Abstract

458 Citations

108 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis using concatenation of speech waveforms

First Claim

14 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

458 Citations

108 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links