Method and apparatus for improving spontaneous speech recognition performance
First Claim
1. An apparatus for improving spontaneous speech recognition performance, the apparatus comprising a computer including a processor and memory, the processor comprising:
- a frequency transformer that divides a voice signal into frames and applies a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain;
a magnitude feature extractor that extracts a magnitude feature from a magnitude of the voice signal transformed to the frequency domain;
a phase feature extractor that extracts a phase feature from a phase of the voice signal transformed to the frequency domain;
a syllabic nucleus detector that detects a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network;
a voice detector that detects a voice section and a non-voice section from the voice signal;
a speaking rate determiner that determines a speaking rate by using the detected syllabic nucleus and an interval of the detected voice section;
a calculator that calculates a degree of time scale modification by using the speaking rate; and
a time scale modifier that converts a voice into a length appropriate for an acoustic model by using the degree of time scale modification,and the deep neural network of the computer detects a syllabic nucleus from the syllabic nucleus detector and outputs a phoneme classification item as a multi-frame output.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to a method and apparatus for improving spontaneous speech recognition performance. The present invention is directed to providing a method and apparatus for improving spontaneous speech recognition performance by extracting a phase feature as well as a magnitude feature of a voice signal transformed to the frequency domain, detecting a syllabic nucleus on the basis of a deep neural network using a multi-frame output, determining a speaking rate by dividing the number of syllabic nuclei by a voice section interval detected by a voice detector, calculating a length variation or an overlap factor according to the speaking rate, and performing cepstrum length normalization or time scale modification with a voice length appropriate for an acoustic model.
28 Citations
16 Claims
-
1. An apparatus for improving spontaneous speech recognition performance, the apparatus comprising a computer including a processor and memory, the processor comprising:
-
a frequency transformer that divides a voice signal into frames and applies a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain; a magnitude feature extractor that extracts a magnitude feature from a magnitude of the voice signal transformed to the frequency domain; a phase feature extractor that extracts a phase feature from a phase of the voice signal transformed to the frequency domain; a syllabic nucleus detector that detects a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network; a voice detector that detects a voice section and a non-voice section from the voice signal; a speaking rate determiner that determines a speaking rate by using the detected syllabic nucleus and an interval of the detected voice section; a calculator that calculates a degree of time scale modification by using the speaking rate; and a time scale modifier that converts a voice into a length appropriate for an acoustic model by using the degree of time scale modification, and the deep neural network of the computer detects a syllabic nucleus from the syllabic nucleus detector and outputs a phoneme classification item as a multi-frame output. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer implemented method for improving spontaneous speech recognition performance, the computer including a processor and a memory, the computer implemented method comprising:
-
dividing a voice signal into a plurality of frames at predetermined intervals and applying a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain; extracting a magnitude feature from a magnitude of the voice signal transformed to the frequency domain; extracting a phase feature from a phase of the voice signal transformed to the frequency domain; detecting a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network; detecting a voice section and a non-voice section from the voice signal; determining a speaking rate by using an interval of the detected voice section; calculating a degree of time scale modification by using the speaking rate; converting a voice into a length appropriate for an acoustic model by using the degree of time scale modification; and detecting, by the deep neural network of the computer, a syllabic nucleus and outputting a phoneme classification item as a multi-frame output. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification