Corpus-based speech synthesis based on segment recombination

US 20050182629A1
Filed: 01/18/2005
Published: 08/18/2005
Est. Priority Date: 01/16/2004
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis system for producing synthesized speech comprising:

a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;

a segmental transcription database referencing segmental transcriptions associated with sequences of one or more segment designators and accessed by message designators, each message designator being associated with a fixed message;

a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of a sequence of segment designators corresponding to a segmental transcription generated responsive to a message designator input; and

a speech segment concatenator in communication with the large speech segment database for concatenating the sequence of speech segments selected by the speech segment selector to produce a speech signal output corresponding to the message designator input.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method generate synthesized speech through concatenation of speech segments that are derived from a large prosodically-rich corpus of speech segments including using an additional dictionary of speech segment identifier sequences.

Citations

74 Claims

1. A speech synthesis system for producing synthesized speech comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a segmental transcription database referencing segmental transcriptions associated with sequences of one or more segment designators and accessed by message designators, each message designator being associated with a fixed message;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of a sequence of segment designators corresponding to a segmental transcription generated responsive to a message designator input; and
  
  a speech segment concatenator in communication with the large speech segment database for concatenating the sequence of speech segments selected by the speech segment selector to produce a speech signal output corresponding to the message designator input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A speech synthesis system according to claim 1, in which the segment designators are selected from the group including (i) diphone designators, (ii) demi-phone designators, (iii) phone designators, (iv) triphone designators, (v) demi-syllable designators, and (vi) syllable designators.
  - 3. A speech synthesis system according to claim 1, in which the speech segment concatenator concatenates the sequence of speech segments without altering their prosody.
  - 4. A speech synthesis system according to claim 1, in which the speech segment concatenator smoothes energy at concatenation boundaries of the speech segments when concatenating the sequence of speech segments.
  - 5. A speech synthesis system according to claim 1, in which the speech segment concatenator smoothes pitch at concatenation boundaries of the speech segments when concatenating the sequence of speech segments.
  - 6. A speech synthesis system according to claim 1, in which the speech segment selector is tunable and alternative speech segments can be selected by a user for the selected sequence of speech segments.
  - 7. A speech synthesis system according to claim 1, in which the segment selector is trained on a given segment transcriptor database and alternative speech segments can be selected by a user for the selected sequence of speech segments.
  - 8. A speech synthesis system according to claim 1, adapted for use in a talking dictionary application.

9. A speech synthesis system for producing synthesized speech from input text and from input message designators, the system comprising:
- first and second large speech segment databases referencing speech segments and accessed by segment designators, each speech segment designator being associated with a sequence of one or more speech segments;
  
  a segmental transcription database referencing segmental transcriptions associated with sequences of one or more segment designators of the first large speech segment database and accessed-by message designators, each message designator being associated with a fixed message;
  
  a text message database referencing text messages that correspond to orthographic representations of the segmental transcriptions referenced by the segmental transcription database;
  
  a first speech segment selector for selecting a sequence of speech segments referenced by the first large speech segment database and representative of a sequence of segment designators corresponding to a segmental transcription generated responsive to a message designator input;
  
  a text analyzer for converting an input text into a representative sequence of symbolic segment identifiers;
  
  a second speech segment selector for selecting, based at least in part on prosodic and acoustic features, a sequence of speech segments from the second large speech segment database and representative of a sequence of symbolic identifiers generated responsive to a text input;
  
  a message decoder for activating the first speech segment selector if a text input corresponds to a text message referenced by the text message database, or the second speech segment selector if a text input does not correspond to a message from the text message database; and
  
  a speech segment concatenator in communication with the first and second large speech segment databases for concatenating the sequence of speech segments designated by a segmental transcription from the segmental transcription database to produce a speech signal output.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. A speech synthesis system according to claim 9, in which the first and second large speech segment databases are the same.
  - 11. A speech synthesis system according to claim 9, in which the first large speech segment database is a subset of the second large speech segment database.
  - 12. A speech synthesis system according to claim 9, in which the first and second large speech segment databases are disjoint.
  - 13. A speech synthesis system according to claim 9, wherein the first and second large speech segment databases are in different locations and an output data stream of segment transcriptions, speech transformation descriptors, and control codes from one location to the other allows distributed speech synthesis.
  - 14. A speech synthesis system according to claim 9 adapted for use in a talking dictionary application.

15. A system to create compound speech units from an input text comprising:
- a speech segment database referencing speech waveform segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the speech segment database and representative of an input text; and
  
  a speech segment sequence validator for validating the selected sequence of speech segments; and
  
  a linguistic feature vector extractor for extracting linguistic feature vectors from the validated sequence of speech segments; and
  
  a segment descriptor generator for linking an extracted linguistic feature vector to a speech waveform segment from the speech segment database.
- View Dependent Claims (16, 17, 18)
- - 16. A system according to claim 15, wherein the validated synthesized speech comes from a dataset of synthesized messages classified according to one or more perceptual distance measurements.
  - 17. A speech segment database enhancing system to increase feature variation comprising:
    - a system according to claim 15 to generate compound speech units from a text corpus; and
      
      a database engine for creating a database of compound speech units
  - 18. A speech segment database enhancing system according to claim 17, wherein a single set of acoustic features is stored for each speech waveform segment referenced by the speech segment database and wherein at least one speech waveform segment has two or more associated linguistic feature vectors.

19. A speech synthesis system for producing synthesized speech from input text comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a basic speech unit descriptor database including linguistic feature vectors descriptive of individual speech segments referenced by the speech segment database;
  
  a compound speech unit database including linguistic feature vectors descriptive of speech segments referenced by the speech segment database, at least one speech segment from the speech segment database has two or more linguistic feature vectors as linguistic descriptors;
  
  a speech segment selector for selecting, based on a reduced set of features and cost functions, a sequence of speech segments referenced by the speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.
- View Dependent Claims (20, 21, 22)
- - 20. A first speech synthesis system according to claim 19, wherein the speech segment selector is adapted to imitate the unit selection behavior of a second more complex speech synthesis system based on at least one of a richer feature set and more complex cost functions, by integrating into the compound speech unit database of the first synthesis system data derived from the output of the second more complex speech synthesis system.
  - 21. A speech synthesis system according to claim 20, wherein the compound speech unit database includes linguistic feature vectors from compound speech units derived from synthesized speech validated by an algorithm of perceptual measures.
  - 22. A speech synthesis system according to claim 21, wherein the validation takes into account as side products from the speech segment selector at least one cost selected from the group of a normalized path cost, a peak cost, and a cost distribution along a best path.

23. A method for training a corpus-based speech synthesizer comprising:
- feeding at least one text corpus to the corpus-based speech synthesizer to produce synthesized speech; and
  
  validating speech synthesis data based on at least one of listening experiments and automatic perceptual distance measures; and
  
  augmenting a compound speech unit database with compound speech units derived from the validated speech synthesis data.

24. A method for minimizing the size of a speech segment database comprising:
- determining acoustically redundant speech segment in the speech segment database; and
  
  removing acoustically redundant speech segments that have the same linguistic feature vector replacing the acoustically redundant speech segments from a speech segment database and their descriptors by compound speech unit representations and their descriptors.
- View Dependent Claims (25)
- - 25. A method according to claim 24, wherein the redundancy is determined by means of acoustical clustering techniques, where speech segment clusters are represented by a smaller set of representative speech segments.

26. A speech synthesis system for producing more than one alternative of synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments; and
  
  a set of two or more speech segment selectors selecting two or more sequences of speech segments referenced by the large speech segment database and representative an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating one of the selected sequence of speech segments to produce a speech signal output corresponding to the input text.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34)
- - 27. A speech synthesis system according to claim 26, wherein each unit selector uses a different set of weights.
  - 28. A speech synthesis system according to claim 26, wherein each unit selector uses different cost functions.
  - 29. A speech synthesis system according to claim 26, wherein each unit selector uses a different set of weights and cost functions.
  - 30. A speech synthesis system according to claim 26, wherein only one alternative segment sequence is selected from a number of alternatives based upon an automatic measure.
  - 31. A speech synthesis system according to claim 30, wherein the automatic measure is based on a classifier which is trained on data generated by validating numerous synthesis results.
  - 32. A speech synthesis system according to claim 31, wherein the classifier is a implemented as a CART.
  - 33. A speech synthesis system according to claim 32, wherein the decision tree uses the output of one or more cost functions and statistics of different cost components along the selected path in the DP grid as input parameters.
  - 34. A speech synthesis system according to claim 30, wherein the selecting in at least one of the speech segment selectors is based at least in part on introduction of stochastic variation on at least one of an individual cost function and a masking function associated to a cost.

35. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text, the selecting being based at least in part on introduction of stochastic variation on at least one of an individual cost function and a masking function associated to a cost; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.
- View Dependent Claims (36, 37, 38)
- - 36. A speech synthesis system according to claim 35, wherein the stochastic variation is relatively small with respect to the complete dynamic behavior of the cost function.
  - 37. A speech synthesis system according to claim 35, wherein the stochastic variation is implemented as at least one of an additive noise component and a multiplicative noise component.
  - 38. A speech synthesis system according to claim 35, wherein at least one cost function is implemented as a steerable noise generator having a probability density function reflecting the average cost and an allowed variation.

39. A self tuning speech segment selector for producing speech segment sequences from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text, the selecting being based at least in part on iterative searching, where at each iteration step at least one of unit selector weights and cost functions are adjusted.

40. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text, the selecting being based at least in part on iterative searching, where at each iteration step at least one of unit selector weights and cost functions are adjusted; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.
- View Dependent Claims (41, 42)
- - 41. A speech synthesis system according to claim 40, wherein the iterative searching is based on closed loop iterative reducing of transition cost weights so as to not exceed a maximum threshold for inter-segment discontinuity for a given feature.
  - 42. A speech synthesis system according to claim 40, wherein the iterative searching is based on closed loop iterative reducing of transition cost weights so as to reach without exceeding a maximum threshold for average inter-segment discontinuity for a given feature.

43. A speech synthesis system for producing synthesized speech from input text comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting among candidate sequences of speech segments referenced by the speech segment database and representative of an input text, the selecting being based on evaluating by a cost obtained through dynamic time warping of the spectral representation of the candidate sequences with the spectral representation of one or more recorded speech signals; and
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

44. A speech synthesis system for producing synthesized speech from input text comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting among candidate sequences of speech segments referenced by the speech segment database and representative of an input text, the selecting including use of a composition table containing pairs of segment designators to minimize adjacency feature mismatch effects; and
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

45. A speech synthesis system for producing synthesized speech from input text comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a user dictionary of compound speech units referenced by the speech segment database and accessed by phoneme sequences;
  
  a speech segment selector for selecting among candidate sequences of speech segments referenced by the speech segment database and representative of an input text, the selecting including use of compound speech units from the user dictionary; and
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.
- View Dependent Claims (46)
- - 46. A speech synthesis system according to claim 45, wherein instead phoneme sequences grapheme sequences are used.

47. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a carrier database containing carriers for a carrier and slot speech synthesis application, each carrier represented as a sequence of segment descriptors; and
  
  a speech carrier selector for selecting the carrier from the carrier database;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of a slot argument in a carrier and slot speech synthesis message; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments with the carrier portion of a carrier and slot speech synthesis message to produce a speech signal output corresponding to the carrier and slot speech synthesis message.

48. A restricted domain speech synthesis system for producing synthesized speech from a restricted domain input comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments; and
  
  a segment sequence database containing sequences of speech segment designators;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database from the segment sequence database; and
  
  a speech segment concatenator, in communication with the large speech segment database and the segment sequence database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the restricted domain input.
- View Dependent Claims (49)
- - 49. A restricted domain speech synthesis system according to claim 48, wherein the large speech segment database and the segment sequence database are constructed by means of a validation process.

50. A segment database construction system for corpus based speech synthesis comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a set of two or more speech segment selectors selecting two or more sequences of speech segments referenced by the large speech segment database and representative an input text;
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating one of the selected sequence of speech segments to produce a speech signal output corresponding to the input text; and
  
  an automatic segment sequence validator that automatically selects between the outputs of the different speech segment selectors.
- View Dependent Claims (51)
- - 51. A segment database construction system according to claim 50 for corpus based speech synthesis wherein the speech segment selectors use at least one of a different set of weights and cost functions to select a sequence of speech segments.

52. A segment database construction system for corpus based speech synthesis comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector using introduction of stochastic variation on at least one of an individual cost function and a masking function to select a sequence of speech segments; and
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating one of the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

53. A segment database construction system for corpus based speech synthesis comprising:
- a speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for generating an N-best list of speech segment sequences;
  
  a speech segment concatenator, in communication with the speech segment database, for concatenating one of the selected sequence of speech segments to produce a speech signal output corresponding to a synthesis input; and
  
  an automatic speech segment sequence validator that automatically selects a speech segment sequence from the N-best list.
- View Dependent Claims (54, 55, 56, 57)
- - 54. A restricted domain speech synthesis system according to claim 53, wherein the speech segment selector selects a sequence of speech segments without use of linguistic processing.
  - 55. A restricted domain speech synthesis system according to claim 53, wherein the input is a segmental transcription.
  - 56. A restricted domain speech synthesis system according to claim 53, wherein the segment designators are diphone identifiers arranged in convex partitions, each partition representing a set of diphone identifiers corresponding to diphones that begin with the same phoneme.
  - 57. A restricted domain speech synthesis system according to claim 53, wherein run-length encoding is used to represent consecutive segment designators.

58. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text;
  
  wherein compound speech units are used to increase the match between a grapheme-to-phoneme conversion of the input text and the segment designators.

59. A method for speech synthesis comprising:
- using speech synthesis to create a sequence of segment designators referencing speech segments in a database that are representative of an input text;
  
  validating the sequence of segment designators for synthesis quality; and
  
  storing the sequence of validated segment designators for use by an application in synthesizing speech corresponding to the input text.
- View Dependent Claims (60)
- - 60. A method of speech synthesis according to claim 59, wherein the application uses the same database as the speech synthesis uses.

61. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text;
  
  wherein the database includes at least one spectral segment that is linked to a plurality of one stored trajectories for at least one of pitch, energy, and rate so as to generate from the spectral segment more than one speech segment during synthesis.
- View Dependent Claims (62, 63, 64, 65, 66)
- - 62. A speech synthesis system according to claim 61, wherein a plurality of prosodic trajectories are generated by constructing a time mapping function through dynamic time warping of a speech segment spectrum to the spectrum of the corresponding spectrally redundant speech segments.
  - 63. A speech synthesis system according to claim 62, wherein the time mapping function is efficiently represented by a repeat vector.
  - 64. A speech synthesis system according to claim 63, wherein the repeat vector is constructed relative to the variable frame rate compressed frames.
  - 65. A speech synthesis system according to claim 62, wherein the time mapping function is represented differentially.
  - 66. A speech synthesis system according to claim 62, wherein the pitch track is represented as a piece-wise linear representation.

67. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments, where at least one speech segment includes spectral parameters which are represented differentially with respect to at least one other speech segment having a full spectral representation;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

68. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments, where spectral representation of each speech segment uses variable frame rate compression;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

69. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments, where coding of the speech segments approximates the variation of the prosody parameters over time by piece-wise linear functions that are stored as breakpoint-slope pairs;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text.

70. A method for speech synthesis comprising:
- exciting a time sequence of digital filters with a synthetic pulse, the synthetic pulse being applied at every pitch period in voiced speech;
  
  calculating the time-domain pulse response of at least one of the filters;
  
  weighting the time domain pulse response by a monotonically decaying function; and
  
  truncating the pulse response length to a predetermined length.
- View Dependent Claims (71, 72, 73)
- - 71. A method according to claim 70, wherein each pulse response is calculated by using a synthetic pulse as input to a selected digital filter from the time sequence of digital filters with zero filter states.
  - 72. A method according to claim 70, wherein the speech synthesis is realized by overlap-and-add of the sequence of pulse responses.
  - 73. A method according to claim 70, wherein the monotonically decaying weighting function that is applied to the pulse response is initially constant over a time interval equal to the pitch period and decays after it.

74. A speech synthesis system for producing synthesized speech from input text comprising:
- a large speech segment database referencing speech segments and accessed by segment designators, each segment designator being associated with a sequence of one or more speech segments;
  
  a speech segment selector for selecting a sequence of speech segments referenced by the large speech segment database and representative of an input text; and
  
  a speech segment concatenator, in communication with the large speech segment database, for concatenating the selected sequence of speech segments to produce a speech signal output corresponding to the input text;
  
  wherein voice characteristics of the speech signal output can be changed by applying different spectral warping functions on the spectrum of the selected speech segments depending on their segment designators or on segment designator classes to which they belong.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Van Coile, Bert, Pollet, Vincent, De Bock, Mario, De Moortel, Jan, Coorman, Geert, Van Gerven, Stefaan

Granted Patent

US 7,567,896 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/266
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 13/07 Concatenation rules

Corpus-based speech synthesis based on segment recombination

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

74 Claims

Specification

Solutions

Use Cases

Quick Links

Corpus-based speech synthesis based on segment recombination

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

74 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links