Method and apparatus for improving spontaneous speech recognition performance

US 10,388,275 B2
Filed: 09/07/2017
Issued: 08/20/2019
Est. Priority Date: 02/27/2017
Status: Active Grant

First Claim

Patent Images

1. An apparatus for improving spontaneous speech recognition performance, the apparatus comprising a computer including a processor and memory, the processor comprising:

a frequency transformer that divides a voice signal into frames and applies a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain;

a magnitude feature extractor that extracts a magnitude feature from a magnitude of the voice signal transformed to the frequency domain;

a phase feature extractor that extracts a phase feature from a phase of the voice signal transformed to the frequency domain;

a syllabic nucleus detector that detects a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network;

a voice detector that detects a voice section and a non-voice section from the voice signal;

a speaking rate determiner that determines a speaking rate by using the detected syllabic nucleus and an interval of the detected voice section;

a calculator that calculates a degree of time scale modification by using the speaking rate; and

a time scale modifier that converts a voice into a length appropriate for an acoustic model by using the degree of time scale modification,and the deep neural network of the computer detects a syllabic nucleus from the syllabic nucleus detector and outputs a phoneme classification item as a multi-frame output.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method and apparatus for improving spontaneous speech recognition performance. The present invention is directed to providing a method and apparatus for improving spontaneous speech recognition performance by extracting a phase feature as well as a magnitude feature of a voice signal transformed to the frequency domain, detecting a syllabic nucleus on the basis of a deep neural network using a multi-frame output, determining a speaking rate by dividing the number of syllabic nuclei by a voice section interval detected by a voice detector, calculating a length variation or an overlap factor according to the speaking rate, and performing cepstrum length normalization or time scale modification with a voice length appropriate for an acoustic model.

28 Citations

View as Search Results

16 Claims

1. An apparatus for improving spontaneous speech recognition performance, the apparatus comprising a computer including a processor and memory, the processor comprising:
- a frequency transformer that divides a voice signal into frames and applies a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain;
  
  a magnitude feature extractor that extracts a magnitude feature from a magnitude of the voice signal transformed to the frequency domain;
  
  a phase feature extractor that extracts a phase feature from a phase of the voice signal transformed to the frequency domain;
  
  a syllabic nucleus detector that detects a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network;
  
  a voice detector that detects a voice section and a non-voice section from the voice signal;
  
  a speaking rate determiner that determines a speaking rate by using the detected syllabic nucleus and an interval of the detected voice section;
  
  a calculator that calculates a degree of time scale modification by using the speaking rate; and
  
  a time scale modifier that converts a voice into a length appropriate for an acoustic model by using the degree of time scale modification,and the deep neural network of the computer detects a syllabic nucleus from the syllabic nucleus detector and outputs a phoneme classification item as a multi-frame output.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The apparatus of claim 1, wherein the magnitude feature includes at least one of Mel filter bank log energy (MFLE), a Mel frequency cepstrum coefficient (MFCC), a linear prediction coefficient (LPC), a pitch, a harmonic component, and a spectral flatness.
  - 3. The apparatus of claim 1, wherein the phase feature includes at least one of a delta-phase spectrum, a phase distortion deviation, a group delay, and a circular variance.
  - 4. The apparatus of claim 1, wherein the degree of time scale modification is any one of a variation and an overlap factor.
  - 5. The apparatus of claim 1, wherein the voice detector models a DFT coefficient distribution of a clean voice and noise as a normal distribution and performs a likelihood ratio test (LRT).
  - 6. The apparatus of claim 1, wherein the deep neural network used by the syllabic nucleus detector uses a training voice signal and transcription information of the training voice signal, transforms the training voice signal to the frequency domain, extracts a magnitude feature and a phase feature, configures the phoneme classification item from the transcription information of the training voice signal as a multi-frame output, trains the deep neural network to have the magnitude feature and the phase feature as an input and the phoneme classification item configured as the multi-frame output as an output, and trains the deep neural network through a back-propagation algorithm by using cross entropy (CE).
  - 7. The apparatus of claim 6, wherein the phoneme classification item includes “
    - silent,”
      
      “
      
      consonant,”
      
      “
      
      syllabic nucleus,” and
      
      “
      
      consecutive syllabic nucleus”
      
      .
  - 8. The apparatus of claim 6, wherein the multi-frame output includes performing forced alignment by using the transcription information of the voice signal and the voice recognizer to estimate a voice signal section corresponding to the phoneme classification item, group phoneme classification items of neighboring frames, and output multiple frames.

9. A computer implemented method for improving spontaneous speech recognition performance, the computer including a processor and a memory, the computer implemented method comprising:
- dividing a voice signal into a plurality of frames at predetermined intervals and applying a discrete Fourier transform (DFT) to transform the voice signal from the time domain to the frequency domain;
  
  extracting a magnitude feature from a magnitude of the voice signal transformed to the frequency domain;
  
  extracting a phase feature from a phase of the voice signal transformed to the frequency domain;
  
  detecting a syllabic nucleus by using the magnitude feature and the phase feature as an input of a deep neural network;
  
  detecting a voice section and a non-voice section from the voice signal;
  
  determining a speaking rate by using an interval of the detected voice section;
  
  calculating a degree of time scale modification by using the speaking rate;
  
  converting a voice into a length appropriate for an acoustic model by using the degree of time scale modification; and
  
  detecting, by the deep neural network of the computer, a syllabic nucleus and outputting a phoneme classification item as a multi-frame output.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The method of claim 9, wherein the extracting of a magnitude feature comprises extracting Mel filter bank log energy (MFLE), a Mel frequency cepstrum coefficient (MFCC), a linear prediction coefficient (LPC), a pitch, a harmonic component, and a spectral flatness as the magnitude feature.
  - 11. The method of claim 9, wherein the extracting of a phase feature comprises extracting a delta-phase spectrum, a phase distortion deviation, a group delay, and a circular variance as the phase feature.
  - 12. The method of claim 9, wherein the calculating of a degree of time scale modification comprises calculating any one of a variation and an overlap factor as the degree of time scale modification.
  - 13. The method of claim 9, wherein the detecting of a voice section and a non-voice section comprises modeling a DFT coefficient distribution of a clean voice and noise as a normal distribution and performing a likelihood ratio test (LRT).
  - 14. The method of claim 9, wherein the detecting of a syllabic nucleus comprises using a training voice signal and transcription information of the training voice signal, transforming the training voice signal to the frequency domain to extract a magnitude feature and a phase feature, configuring the phoneme classification item from the transcription information of the training voice signal as a multi-frame output, training the deep neural network to have the magnitude feature and the phase feature as an input and the phoneme classification item configured as the multi-frame output as an output, and training the deep neural network through a back-propagation algorithm by using cross entropy (CE).
  - 15. The method of claim 14, wherein the phoneme classification item includes “
    - silent,”
      
      “
      
      consonant,”
      
      “
      
      syllabic nucleus,” and
      
      “
      
      consecutive syllabic nucleus”
      
      .
  - 16. The method of claim 14, wherein the multi-frame output indicates performing forced alignment by using the transcription information of the voice signal and the voice recognizer to estimate a voice signal section corresponding to the phoneme classification item, group phoneme classification items of neighboring frames, and output multiple frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Electronics and Telecommunications Research Institute
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
Kim, Hyun Woo, Jung, Ho Young, Park, Jeon Gue, Lee, Yun Keun
Primary Examiner(s)
Sharma, Neeraj

Application Number

US15/697,923
Publication Number

US 20180247642A1
Time in Patent Office

712 Days
Field of Search

None
US Class Current
CPC Class Codes

G06N 3/08   Learning methods

G06N 3/084   Backpropagation, e.g. using...

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/16   using artificial neural net...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/027   Syllables being the recogni...

G10L 21/04   Time compression or expansion

G10L 25/84   for discriminating voice fr...

Method and apparatus for improving spontaneous speech recognition performance

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

28 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for improving spontaneous speech recognition performance

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others