SYSTEM AND METHOD FOR HYBRID SPEECH SYNTHESIS

US 20080270140A1
Filed: 04/24/2007
Published: 10/30/2008
Est. Priority Date: 04/24/2007
Status: Active Grant

First Claim

Patent Images

1. A method for synthesizing a target voice, the method comprising:

receiving symbolic input descriptive of an utterance to be synthesized;

selecting one or more portions of the utterance to be constructed from prototype speech units of a target voice corpus, the target voice corpus including speech units recorded from a human speaker, the target voice corpus configured to provide characteristics of the target voice;

applying adaptations to selected ones of the prototype speech units of the target voice corpus, to produce adapted units that are contextually appropriate for the utterance;

obtaining at least some speech units from a source other than the target voice corpus; and

concatenating at least the adapted speech units from the target voice corpus and the speech units from the source other than the target voice corpus to produce a speech waveform for the utterance.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech synthesis system receives symbolic input describing an utterance to be synthesized. In one embodiment, different portions of the utterance are constructed from different sources, one of which is a speech corpus recorded from a human speaker whose voice is to be modeled. The other sources may include other human speech corpora or speech produced using Rule-Based Speech Synthesis (RBSS). At least some portions of the utterance may be constructed by modifying prototype speech units to produce adapted speech units that are contextually appropriate for the utterance. The system concatenates the adapted speech units with the other speech units to produce a speech waveform. In another embodiment, a speech unit of a speech corpus recorded from a human speaker lacks transitions at one or both of its edges. A transition is synthesized using RBSS and concatenated with the speech unit in producing a speech waveform for the utterance.

240 Citations

58 Claims

1. A method for synthesizing a target voice, the method comprising:
- receiving symbolic input descriptive of an utterance to be synthesized;
  
  selecting one or more portions of the utterance to be constructed from prototype speech units of a target voice corpus, the target voice corpus including speech units recorded from a human speaker, the target voice corpus configured to provide characteristics of the target voice;
  
  applying adaptations to selected ones of the prototype speech units of the target voice corpus, to produce adapted units that are contextually appropriate for the utterance;
  
  obtaining at least some speech units from a source other than the target voice corpus; and
  
  concatenating at least the adapted speech units from the target voice corpus and the speech units from the source other than the target voice corpus to produce a speech waveform for the utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1 wherein the adaptations are Phone-and-Transition (P&
    - T) adaptations and the prototype speech units are P&
      
      T speech units that comprise one or more phones and transitions.
  - 3. The method of claim 1 wherein at least some of the prototype speech units represent syllable nuclei.
  - 4. The method of claim 1 wherein all the speech units of the target voice corpus are recorded from one particular human speaker whose voice is the basis for the target voice.
  - 5. The method of claim 1 wherein the speech units of the target voice corpus are recorded from two or more different human speakers.
  - 6. The method of claim 1 wherein the adaptations comprise an adaptation that extracts and uses only a selected portion of a phone or a transition of one of the stored prototype speech units.
  - 7. The method of claim 1 wherein the adaptations comprise an adaptation that extracts and uses only a selected portion of one of the stored prototype speech units.
  - 8. The method of claim 1 wherein the adaptations comprise an adaptation that adjusts the duration of at least a portion of one of the stored speech units.
  - 9. The method of claim 1 wherein the adaptations comprise an adaptation that modifies the amplitude of at least a portion of one of the stored prototype speech units.
  - 10. The method of claim 1 wherein the adaptations comprise an adaptation that time reverses at least a portion of one of the stored prototype speech units.
  - 11. The method of claim 1 wherein the adaptations comprise an adaptation that uses a portion of one of the stored prototype speech units to realize a phoneme other than one realized in the original utterance from which the prototype was extracted.
  - 12. The method of claim 1 wherein the source other than the target voice corpus comprises a shared corpus that includes speech units recorded from a different human speaker than the human speaker used to record the target voice corpus, and wherein the shared corpus is configured to be used in synthesizing multiple different target voices.
  - 13. The method of claim 12 wherein the shared corpus further includes synthesized speech units.
  - 14. The method of claim 12 wherein the shared corpus includes a plurality of prototype speech units, and the method further comprises:
    - applying adaptations to selected ones of the prototype speech units of the shared corpus, to produce adapted speech units that are contextually appropriate for the utterance.
  - 15. The method of claim 1 wherein the source other than the target voice corpus is a plurality of shared corpora that are each recorded from a different human speaker, and wherein each shared corpus is configured to be used in synthesizing multiple different target voices.
  - 16. The method of claim 1 wherein the step of obtaining at least some speech units from a source other than the target voice corpus further comprises:
    - synthesizing the at least some speech units with Rule-Based Speech Synthesis (RBSS) rules.
  - 17. The method of claim 1 wherein the target voice corpus further includes synthesized speech units.

18. A method for speech synthesis, the method comprising:
- receiving symbolic input descriptive of an utterance to be synthesized;
  
  selecting one or more portions of the utterance to be constructed from prototype speech units of a speech corpus, the speech corpus including speech units recorded from a human speaker;
  
  applying Phone-and-Transition (P&
  
  T) adaptations to selected ones of the prototype speech units of the speech corpus, to produce adapted speech units that are contextually appropriate for the utterance; and
  
  concatenating at least the adapted speech units from the speech corpus to produce a speech waveform for the utterance.
- View Dependent Claims (19)
- - 19. The method of claim 18 wherein the prototype speech units are P&
    - T speech units that comprise one or more phones and transitions.

20. A system for synthesizing a target voice, comprising:
- a front end module configured to receive symbolic input descriptive of an utterance to be synthesized;
  
  a back end module configured to select one or more portions of the utterance to be constructed from prototype speech units of a target voice corpus, the target voice corpus including speech units recorded from a human speaker, the target voice corpus configured to provide characteristics of the target voice;
  
  a unit engine of the back end module configured to apply adaptations to selected ones of the prototype speech units of the target voice corpus, to produce adapted speech units that are contextually appropriate for the utterance; and
  
  a concatenation engine of the back end module configured to concatenate at least the adapted speech units from the target voice corpus and speech units from a source other than the target voice corpus, to produce a speech waveform for the utterance.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 21. The system of claim 20 wherein the adaptations are Phone-and-Transition (P&
    - T) adaptations and the prototype speech units are P&
      
      T speech units that comprise one or more phones and transitions.
  - 22. The system of claim 20 wherein at least some of the prototype speech units represent syllable nuclei.
  - 23. The system of claim 20 wherein all the speech units of the target voice corpus are recorded from one particular human speaker whose voice is the basis for the target voice.
  - 24. The system of claim 20 wherein the speech units of the target voice corpus are recorded from two or more different human speakers.
  - 25. The system of claim 20 wherein the adaptations comprise an adaptation that extracts and uses only a selected portion of a phone or a transition of one of the stored prototype speech units.
  - 26. The system of claim 20 wherein the P&
    - T adaptations comprise an adaptation that extracts and uses only a selected portion of one of the stored prototype speech units.
  - 27. The system of claim 20 wherein the adaptations comprise an adaptation that adjusts the duration of at least a portion of one of the stored prototype speech units.
  - 28. The system of claim 20 wherein the adaptations comprise an adaptation that modifies the amplitude of at least a portion of one of the stored prototype speech units.
  - 29. The system of claim 20 wherein the adaptations comprise an adaptation that time reverses at least a portion of one of the stored prototype speech units.
  - 30. The system of claim 20 wherein the adaptations comprise an adaptation that uses a portion of one of the stored prototype speech units to realize a phoneme other than one realized in the original utterance from which the prototype was extracted.
  - 31. The system of claim 20 wherein the source other than the target voice corpus comprises a shared corpus that includes speech units recorded from a different human speaker than the human speaker used to record the target voice corpus, and wherein the shared corpus is configured to be used in synthesizing multiple different target voices.
  - 32. The system of claim 31 wherein the shared corpus further includes synthesized speech units.
  - 33. The system of claim 31 wherein the shared corpus includes a plurality of prototype speech units, and the unit engine of the back end module is further configured to apply adaptations to selected ones of the prototype speech units of the shared corpus, to produce adapted speech units that are contextually appropriate for the utterance.
  - 34. The system of claim 20 wherein the source other than the target voice corpus comprises a plurality of shared corpora that are each recorded from a different human speaker, and wherein each shared corpus is configured to be used in synthesizing multiple different target voices.
  - 35. The system of claim 20 wherein the source other than the target voice corpus is a Rule-Based Speech Synthesizer configured to synthesize at least some speech units with Rule-Based Speech Synthesis (RBSS) rules.
  - 36. The system of claim 20 wherein the target voice corpus further includes synthesized speech units.

37. A system for speech synthesis comprising:
- a front end module configured to receive symbolic input descriptive of an utterance to be synthesized;
  
  a back end module configured to select one or more portions of the utterance to be constructed from prototype speech units of a speech corpus, the speech corpus including speech units recorded from a human speaker;
  
  a unit engine of the back end module configured to apply Phone-and-Transition (P&
  
  T) adaptations to selected ones of the prototype speech units of the speech corpus, to produce adapted speech units that are contextually appropriate for the utterance; and
  
  a concatenation engine of the back end module configure to concatenate at least the adapted speech units from the speech corpus to produce a speech waveform for the utterance.
- View Dependent Claims (38)
- - 38. The system of claim 37 wherein the prototype speech units are P&
    - T speech units that comprise one or more phones and transitions.

39. A method for speech synthesis comprising:
- receiving symbolic input descriptive of an utterance to be synthesized;
  
  selecting a portion of the utterance to be constructed from a speech unit of a speech corpus, the speech unit recorded from a human speaker, the speech unit lacking transitions at one or both of the speech unit'"'"'s edges;
  
  synthesizing a transition for use at an edge of the speech unit using Rule-Based Speech Synthesis (RBSS) rules; and
  
  concatenating the speech unit with the synthesized transition in producing a speech waveform for the utterance.
- View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 40. The method of claim 39 wherein the step of synthesizing further comprises:
    - obtaining one or more transition properties from the speech corpus for the transition to be synthesized.
  - 41. The method of claim 40 wherein the one or more transition properties comprise at least one property selected from the group consisting of:
    - transition duration, formant frequencies, formant bandwidths, amplitudes, and fundamental frequencies.
  - 42. The method of claim 39 wherein the RBSS rules are Rule Based Formant Synthesis (RBFS) rules.
  - 43. The method of claim 39 wherein the speech unit of the speech corpus is a Phone-and-Transition (P&
    - T) speech unit that comprises at least a phone segment.
  - 44. The method of claim 43 wherein the speech unit of the speech corpus is adapted by application of one or more P&
    - T adaptations prior to the step of concatenating.
  - 45. The method of claim 39 wherein the speech corpus is a target voice corpus recorded from a target speaker and configured to provide characteristics of a target voice.
  - 46. The method of claim 39 wherein the speech corpus is a shared corpus, and wherein the shared corpus is configured to be used in synthesizing multiple different target voices.
  - 47. The method of claim 39 wherein the step of concatenating further comprises:
    - concatenating the speech unit and the synthesized transition with one or more other speech units synthesized by RBSS rules.
  - 48. The method of claim 39 wherein the step of synthesizing further comprises:
    - creating an extension segment at an edge of the synthesized transition, the extension segment to overlap another speech unit when the synthesized transition is concatenated.

49. A system for speech synthesis comprising:
- a front end module configured to receive symbolic input descriptive of an utterance to be synthesized;
  
  a back end module configured to select a portion of the utterance to be constructed from a speech unit of a speech corpus, the speech unit recorded from a human speaker, the speech unit lacking transitions at one or both of the speech unit'"'"'s edges;
  
  a synthesis module configured to synthesize a transition for use at an edge of the speech unit by use of Rule-Based Speech Synthesis (RBSS) rules; and
  
  a concatenation engine of the back end module configured to concatenate the speech unit with the synthesized transition in production of a speech waveform for the utterance.
- View Dependent Claims (50, 51, 52, 53, 54, 55, 56, 57, 58)
- - 50. The system of claim 49 wherein a synthesis module is further configured to obtain one or more transition properties from the speech corpus for the transition to be synthesized.
  - 51. The system of claim 50 wherein the one or more transition properties comprise at least one property selected from the group consisting of:
    - transition duration, formant frequencies, formant bandwidths, amplitudes, and fundamental frequencies.
  - 52. The system of claim 49 wherein the RBSS rules are Rule Based Formant Synthesis (RBFS) rules.
  - 53. The system of claim 49 wherein the speech unit of the speech corpus is a Phone-and-Transition (P&
    - T) speech unit comprising at least a phone segment.
  - 54. The system of claim 53 wherein the speech unit of the speech corpus is adapted by application of one or more P&
    - T adaptations prior to the step of concatenating.
  - 55. The system of claim 49 wherein the speech corpus is a target voice corpus recorded from a target speaker and configured to provide characteristics of a target voice.
  - 56. The system of claim 49 wherein the speech corpus is a shared corpus, and wherein the shared corpus is configured to be used in synthesizing multiple different target voices.
  - 57. The system of claim 49 wherein the concatenation engine is further configured to concatenate the speech unit and the synthesized transition with one or more other speech units synthesized by RBSS rules.
  - 58. The system of claim 49 wherein the synthesis module is further configured to create an extension segment at an edge of the synthesized transition, the extension segment to overlap another speech unit when the synthesized transition is concatenated.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Synfonica LLC
Original Assignee
Novaspeech LLC
Inventors
Mills, Harold G., Hertz, Susan R.

Granted Patent

US 7,953,600 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/267
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/06   Elementary speech units use...

G10L 25/15   the extracted parameters be...

SYSTEM AND METHOD FOR HYBRID SPEECH SYNTHESIS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

240 Citations

58 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR HYBRID SPEECH SYNTHESIS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

240 Citations

58 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links