METHOD AND APPARATUS FOR DETECTING SPEECH ENDPOINT USING WEIGHTED FINITE STATE TRANSDUCER

US 20140379345A1
Filed: 03/25/2014
Published: 12/25/2014
Est. Priority Date: 06/20/2013
Status: Active Grant

First Claim

Patent Images

1. An apparatus for detecting a speech endpoint, comprising:

a speech decision portion configured to receive frame units of feature vector converted from a speech signal and to analyze and classify the received feature vector into a speech class or a noise class;

a frame level WFST configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format;

a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;

a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and

an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are an apparatus and a method for detecting a speech endpoint using a WFST. The apparatus in accordance with an embodiment of the present invention includes: a speech decision portion configured to receive frame units of feature vector converted from a speech signal and to analyze and classify the received feature vector into a speech class or a noise class; a frame level WFST configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format; a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route.

41 Citations

View as Search Results

12 Claims

1. An apparatus for detecting a speech endpoint, comprising:
- a speech decision portion configured to receive frame units of feature vector converted from a speech signal and to analyze and classify the received feature vector into a speech class or a noise class;
  
  a frame level WFST configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format;
  
  a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
  
  a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and
  
  an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus of claim 1, wherein the WFST combination portion is configured to combine the frame level WFST with the speech level WFST by use of a join operation (^◯
    - ), among basic operations of a WFST, according to mathematical equation
      C=F◯
      
      U whereas C denotes a combined WFST, F denotes a frame level WFST, and U denotes a speech level WFST.
  - 3. The apparatus of claim 2, wherein the optimization portion is configured to optimize the combined WFST by use of a minimize operation (min), among basic operations of the WFST, according to mathematical equation
    D=min(C),whereas D denotes an optimized WFST.
  - 4. The apparatus of claim 3, wherein the speech level WFST includes 6 states of NOISE, SPEECH, Sn, Nn, BOU (begin of utterance), and EOU (end of utterance) in accordance with the speech class and the noise class and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn),whereas NOISE denotes a noise state, SPEECH denotes a speech state, BOU denotes a speech start state, EOU denotes a speech end state, Sn denotes an nth (n being a natural number) speech waiting state, and Nn denotes an nth noise waiting state, and whereas i is an initial, NOISE state and F is a final state set, which is EOU, and whereas E denotes a transition function set, and λ and
      
      ρ
      
      denote a speech class (H) weight and a noise class weight, respectively.
  - 5. The apparatus of claim 4, wherein the speech level WFST is configured to set a number of a speech waiting state Sn corresponding to a preset minimum speech frame count T_m, and to set a number of noise waiting state Nn corresponding to a latter part silent frame count T_b.
  - 6. The apparatus of claim 5, wherein the speech level WFST is configured to apply a hang-over technique additionally in order to prevent errors of misclassifying the speech class and the noise class from being generated and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn,Vn),whereas Vn is an nth hang-over state.
  - 7. The apparatus of claim 4, wherein the speech level WFST is configured to set the number of hang-over states for the each speech waiting state to be smaller than the latter part silent frame count T_b, and to set the number of hang-over states for the each noise waiting state to be smaller than the minimum speech frame count T.

8. A method for detecting a speech endpoint by receiving frame units of feature vector converted from a speech signal and detecting a speech endpoint by use of an apparatus for detecting a speech endpoint, the apparatus for detecting a speech endpoint executing:
- analyzing and classifying the feature vector into a speech class and a noise class;
  
  creating a frame level WFST by converting the speech class and the noise class to a WFST format after receiving the speech class and the noise class;
  
  creating a speech level WFST detecting a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
  
  obtaining a combined WFST by combining the frame level WFST with the speech level WFST; and
  
  optimizing the combined WFST.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method of claim 8, wherein, in the step of obtaining the combined WFST, the frame level WFST and the speech level WFST are combined by use of a join operation (^◯
    - ), among basic operations of a WFST, according to mathematical equation
      C=F◯
      
      U whereas C denotes a combined WFST, F denotes a frame level WFST, and U denotes a speech level WFST.
  - 10. The method of claim 9, wherein, in the step of optimization the combined WFST, the combined WFST is optimized by use of a minimize operation (min), among basic operations of the WFST, according to mathematical equation
    D=min(C),whereas D denotes an optimized WFST.
  - 11. The method of claim 9, wherein the creating of a speech level WFST includes 6 states of NOISE, SPEECH, Sn, Nn, BOU (begin of utterance), and EOU (end of utterance) in accordance with the speech class and the noise class and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn),whereas NOISE denotes a noise state, SPEECH denotes a speech state, BOU denotes a speech start state, EOU denotes a speech end state, Sn denotes an nth (n being a natural number) speech waiting state, and Nn denotes an nth noise waiting state, and whereas i is an initial, NOISE state and F is a final state set, which is EOU, and whereas E denotes a transition function set, and λ and
      
      Σ
      
      denote a speech class (H) weight and a noise class weight, respectively.
  - 12. A recording media having a computer-readable program written therein for executing the method for detecting a speech endpoint in accordance with claim 8.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Electronics and Telecommunications Research Institute
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
CHUNG, Hoon, Lee, Sung-Joo, Lee, Yun-Keun

Granted Patent

US 9,396,722 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/248
CPC Class Codes

G10L 15/05   Word boundary detection

G10L 25/27   characterised by the analys...

G10L 25/87   Detection of discrete point...

METHOD AND APPARATUS FOR DETECTING SPEECH ENDPOINT USING WEIGHTED FINITE STATE TRANSDUCER

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

41 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND APPARATUS FOR DETECTING SPEECH ENDPOINT USING WEIGHTED FINITE STATE TRANSDUCER

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links