System and method for automatic detection of abnormal stress patterns in unit selection synthesis
First Claim
1. A method comprising:
- detecting, via a machine learning algorithm modeling human perception and trained with acoustic parameters from each syllable in a word, incorrect stress patterns in selected acoustic units representing speech to be synthesized, wherein the selected acoustic units comprise phonemes and come from a database of energy-normalized acoustic units that are normalized on a sentence basis;
performing a word level analysis of the incorrect stress patterns, a phrase level analysis of the incorrect stress patterns and a sentence level analysis of the incorrect stress patterns to yield analyses, wherein the analyses are performed in series; and
modifying, via a processor and prior to waveform synthesis, the incorrect stress patterns in the selected acoustic units according to the analyses, to yield corrected stress patterns.
8 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for detecting and correcting abnormal stress patterns in unit-selection speech synthesis. A system practicing the method detects incorrect stress patterns in selected acoustic units representing speech to be synthesized, and corrects the incorrect stress patterns in the selected acoustic units to yield corrected stress patterns. The system can further synthesize speech based on the corrected stress patterns. In one aspect, the system also classifies the incorrect stress patterns using a machine learning algorithm such as a classification and regression tree, adaptive boosting, support vector machine, and maximum entropy. In this way a text-to-speech unit selection speech synthesizer can produce more natural sounding speech with suitable stress patterns regardless of the stress of units in a unit selection database.
51 Citations
20 Claims
-
1. A method comprising:
-
detecting, via a machine learning algorithm modeling human perception and trained with acoustic parameters from each syllable in a word, incorrect stress patterns in selected acoustic units representing speech to be synthesized, wherein the selected acoustic units comprise phonemes and come from a database of energy-normalized acoustic units that are normalized on a sentence basis; performing a word level analysis of the incorrect stress patterns, a phrase level analysis of the incorrect stress patterns and a sentence level analysis of the incorrect stress patterns to yield analyses, wherein the analyses are performed in series; and modifying, via a processor and prior to waveform synthesis, the incorrect stress patterns in the selected acoustic units according to the analyses, to yield corrected stress patterns. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, result in the processor performing operations comprising; detecting, via a machine learning algorithm modeling human perception and trained with acoustic parameters from each syllable in a word, incorrect stress patterns in selected acoustic units representing speech to be synthesized, wherein the selected acoustic units comprise phonemes and come from a database of energy-normalized acoustic units that are normalized on a sentence basis; performing a word level analysis of the incorrect stress patterns, a phrase level analysis of the incorrect stress patterns, and a sentence level analysis of the incorrect stress patterns to yield analyses, wherein the analyses are performed in series; and modifying, via the processor and prior to waveform synthesis, the incorrect stress patterns in the selected acoustic units according to the analyses, to yield corrected stress patterns. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computer-readable storage device having instructions stored which, when executed by a processor, result in the processor performing operations comprising:
-
detecting, via a machine learning algorithm modeling human perception and trained with acoustic parameters from each syllable in a word, incorrect stress patterns in selected acoustic units representing speech to be synthesized, wherein the selected acoustic units comprise phonemes and come from a database of energy-normalized acoustic units that are normalized on a sentence basis; performing a word level analysis of the incorrect stress patterns, a phrase level analysis of the incorrect stress patterns, and a sentence level analysis of the incorrect stress patterns to yield analyses, wherein the analyses are performed in series; and modifying, via the processor and prior to waveform synthesis, the incorrect stress patterns in the selected acoustic units according to the analyses, to yield corrected stress patterns. - View Dependent Claims (17, 18, 19, 20)
-
Specification