Method and apparatus for detecting speech endpoint using weighted finite state transducer

US 9,396,722 B2
Filed: 03/25/2014
Issued: 07/19/2016
Est. Priority Date: 06/20/2013
Status: Expired due to Fees

First Claim

Patent Images

1. An apparatus for detecting a speech endpoint, comprising:

a voice-operated user interface to receive a speech signal using a speech input device;

a storage to store the speech signal received by the voice-operated user interface; and

a special purpose computer comprising at least one specially programmed processor to execute one or more programs to perform speech recognition by detecting a speech endpoint of the speech signal, the at least one specially programmed processor comprising;

a speech decision portion configured to receive frame units of a feature vector converted from the speech signal and to analyze and classify the received feature vector into a speech class or a noise class;

a frame level weighted finite state transducer (WFST) configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format;

a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;

a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and

an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route,wherein the special purpose computer performs speech recognition based on the detected speech endpoint of the speech signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are an apparatus and a method for detecting a speech endpoint using a WFST. The apparatus in accordance with an embodiment of the present invention includes: a speech decision portion configured to receive frame units of feature vector converted from a speech signal and to analyze and classify the received feature vector into a speech class or a noise class; a frame level WFST configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format; a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route.

162 Citations

12 Claims

1. An apparatus for detecting a speech endpoint, comprising:
- a voice-operated user interface to receive a speech signal using a speech input device;
  
  a storage to store the speech signal received by the voice-operated user interface; and
  
  a special purpose computer comprising at least one specially programmed processor to execute one or more programs to perform speech recognition by detecting a speech endpoint of the speech signal, the at least one specially programmed processor comprising;
  
  a speech decision portion configured to receive frame units of a feature vector converted from the speech signal and to analyze and classify the received feature vector into a speech class or a noise class;
  
  a frame level weighted finite state transducer (WFST) configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format;
  
  a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
  
  a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and
  
  an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route,wherein the special purpose computer performs speech recognition based on the detected speech endpoint of the speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus of claim 1, wherein the WFST combination portion is configured to combine the frame level WFST with the speech level WFST by use of a join operation (^◯
    - ), among basic operations of a WFST, according to mathematical equation whereas C denotes a combined WFST, F denotes a frame level WFST, and U denotes a
      C=F◯
      
      U, whereas C denotes a combined WFST, F denotes a frame level WFST, and U denotes a speech level WFST.
  - 3. The apparatus of claim 2, wherein the optimization portion is configured to optimize the combined WFST by use of a minimize operation (min), among basic operations of the WFST, according to mathematical equation
    D=min(C),whereas D denotes an optimized WFST.
  - 4. The apparatus of claim 3, wherein the speech level WFST includes six states of NOISE, SPEECH, Sn, Nn, BOU (begin of utterance), and EOU (end of utterance) in accordance with the speech class H₁and the noise class H₀and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn),whereas NOISE denotes a noise state, SPEECH denotes a speech state, BOU denotes a speech start state, EOU denotes a speech end state, Sn denotes an nth (n being a natural number) speech waiting state, and Nn denotes an nth noise waiting state, and whereas i is an initial state set for a NOISE state and F is a final state set, which is EOU, and whereas E denotes a transition function set, and A and p denote a speech class weight and a noise class weight, respectively.
  - 5. The apparatus of claim 4, wherein the speech level WFST is configured to set a number of a speech waiting state Sn corresponding to a preset minimum speech frame count T_m, and to set a number of noise waiting state Nn corresponding to a latter part silent frame count T_b.
  - 6. The apparatus of claim 5, wherein the speech level WFST is configured to apply a hang-over technique additionally in order to prevent errors of misclassifying the speech class and the noise class from being generated and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn,Vn),whereas Vn is an nth hang-over state.
  - 7. The apparatus of claim 6, wherein the speech level WFST is configured to set the number of hang-over states for the each speech waiting state to be smaller than the latter part silent frame count T_b, and to set the number of hang-over states for the each noise waiting state to be smaller than the minimum speech frame count T_m.

8. A method for performing speech recognition by detecting a speech endpoint, the method comprising:
- receiving a speech signal obtained through a speech input device of a voice-operated user interface;
  
  storing the received speech signal in a storage;
  
  receiving frame units of a feature vector converted from the speech signal stored in the storage;
  
  analyzing and classifying the feature vector into a speech class and a noise class;
  
  creating a frame level weighted finite state transducer (WFST) by converting the speech class and the noise class to a WFST format after receiving the speech class and the noise class;
  
  creating a speech level WFST detecting a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
  
  obtaining a combined WFST by combining the frame level WFST with the speech level WFST;
  
  optimizing the combined WFST; and
  
  performing speech recognition based on the detected speech endpoint of the speech signal,wherein the analyzing and classifying the feature vector, creating the frame level WFST, creating the speech level WFST, obtaining the combined WFST, optimizing the combined WFST, and performing speech recognition are performed using at least one specially programmed processor of a special purpose computer.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8, wherein, obtaining the combined WFST comprises combining the frame level WFST and the speech level WFST by use of a join operation (^◯
    - ), among basic operations of a WFST, according to mathematical equation
      C=F◯
      
      U, whereas C denotes a combined WFST, F denotes a frame level WFST, and U denotes a speech level WFST.
  - 10. The method of claim 9, wherein, optimizing the combined WFST comprises using a minimize operation (min), among basic operations of the WFST, according to mathematical equation
    D=min(C),whereas D denotes an optimized WFST.
  - 11. The method of claim 9, wherein the creating of a speech level WFST includes six states of NOISE, SPEECH, Sn, Nn, BOU (begin of utterance), and EOU (end of utterance) in accordance with the speech class H₁and the noise class H₀and is implemented according to mathematical equation
    A=(Σ
    - ,Q,i,F,E,λ
      
      ,ρ
      
      )
      Σ
      
      ×
      
      (H₀,H₁)
      Q=(NOISE,SPEECH,BOU,EOU,Sn,Nn),whereas NOISE denotes a noise state, SPEECH denotes a speech state, BOU denotes a speech start state, EOU denotes a speech end state, Sn denotes an nth (n being a natural number) speech waiting state, and Nn denotes an nth noise waiting state, and whereas i is an initial state set for a NOISE state and F is a final state set, which is EOU, and whereas E denotes a transition function set, and A and p denote a speech class weight and a noise class weight, respectively.

12. A non-transitory recording media having one or more computer-readable programs written therein, that when executed performs a method for performing speech recognition by detecting a speech endpoint, the method comprising:
- receiving a speech signal obtained through a speech input device of a voice-operated user interface;
  
  storing the received speech signal in a storage;
  
  receiving frame units of a feature vector converted from the speech signal stored in the storage;
  
  analyzing and classifying the feature vector into a speech class and a noise class;
  
  creating a frame level weighted finite state transducer (WFST) by converting the speech class and the noise class to a WFST format after receiving the speech class and the noise class;
  
  creating a speech level WFST detecting a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
  
  obtaining a combined WFST by combining the frame level WFST with the speech level WFST;
  
  optimizing the combined WFST; and
  
  performing speech recognition based on the detected speech endpoint of the speech signal,wherein the analyzing and classifying the feature vector, creating the frame level WFST, creating the speech level WFST, obtaining the combined WFST, and optimizing the combined WFST, and performing speech recognition are performed using at least one specially programmed processor of a special purpose computer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Electronics and Telecommunications Research Institute
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
Chung, Hoon, Lee, Sung-Joo, Lee, Yun-Keun
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Kim, Jonathan

Application Number

US14/224,626
Publication Number

US 20140379345A1
Time in Patent Office

847 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 15/05   Word boundary detection

G10L 25/27   characterised by the analys...

G10L 25/87   Detection of discrete point...

Method and apparatus for detecting speech endpoint using weighted finite state transducer

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

162 Citations

12 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for detecting speech endpoint using weighted finite state transducer

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

162 Citations

12 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others