Method and apparatus for detecting speech endpoint using weighted finite state transducer
First Claim
1. An apparatus for detecting a speech endpoint, comprising:
- a voice-operated user interface to receive a speech signal using a speech input device;
a storage to store the speech signal received by the voice-operated user interface; and
a special purpose computer comprising at least one specially programmed processor to execute one or more programs to perform speech recognition by detecting a speech endpoint of the speech signal, the at least one specially programmed processor comprising;
a speech decision portion configured to receive frame units of a feature vector converted from the speech signal and to analyze and classify the received feature vector into a speech class or a noise class;
a frame level weighted finite state transducer (WFST) configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format;
a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state;
a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and
an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route,wherein the special purpose computer performs speech recognition based on the detected speech endpoint of the speech signal.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are an apparatus and a method for detecting a speech endpoint using a WFST. The apparatus in accordance with an embodiment of the present invention includes: a speech decision portion configured to receive frame units of feature vector converted from a speech signal and to analyze and classify the received feature vector into a speech class or a noise class; a frame level WFST configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format; a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route.
162 Citations
12 Claims
-
1. An apparatus for detecting a speech endpoint, comprising:
-
a voice-operated user interface to receive a speech signal using a speech input device; a storage to store the speech signal received by the voice-operated user interface; and a special purpose computer comprising at least one specially programmed processor to execute one or more programs to perform speech recognition by detecting a speech endpoint of the speech signal, the at least one specially programmed processor comprising; a speech decision portion configured to receive frame units of a feature vector converted from the speech signal and to analyze and classify the received feature vector into a speech class or a noise class; a frame level weighted finite state transducer (WFST) configured to receive the speech class and the noise class and to convert the speech class and the noise class to a WFST format; a speech level WFST configured to detect a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; a WFST combination portion configured to combine the frame level WFST with the speech level WFST; and an optimization portion configured to optimize the combined WFST having the frame level WFST and the speech level WFST combined therein to have a minimum route, wherein the special purpose computer performs speech recognition based on the detected speech endpoint of the speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for performing speech recognition by detecting a speech endpoint, the method comprising:
-
receiving a speech signal obtained through a speech input device of a voice-operated user interface; storing the received speech signal in a storage; receiving frame units of a feature vector converted from the speech signal stored in the storage; analyzing and classifying the feature vector into a speech class and a noise class; creating a frame level weighted finite state transducer (WFST) by converting the speech class and the noise class to a WFST format after receiving the speech class and the noise class; creating a speech level WFST detecting a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; obtaining a combined WFST by combining the frame level WFST with the speech level WFST; optimizing the combined WFST; and performing speech recognition based on the detected speech endpoint of the speech signal, wherein the analyzing and classifying the feature vector, creating the frame level WFST, creating the speech level WFST, obtaining the combined WFST, optimizing the combined WFST, and performing speech recognition are performed using at least one specially programmed processor of a special purpose computer. - View Dependent Claims (9, 10, 11)
-
-
12. A non-transitory recording media having one or more computer-readable programs written therein, that when executed performs a method for performing speech recognition by detecting a speech endpoint, the method comprising:
-
receiving a speech signal obtained through a speech input device of a voice-operated user interface; storing the received speech signal in a storage; receiving frame units of a feature vector converted from the speech signal stored in the storage; analyzing and classifying the feature vector into a speech class and a noise class; creating a frame level weighted finite state transducer (WFST) by converting the speech class and the noise class to a WFST format after receiving the speech class and the noise class; creating a speech level WFST detecting a speech endpoint by analyzing a relationship between the speech class and noise class and a preset state; obtaining a combined WFST by combining the frame level WFST with the speech level WFST; optimizing the combined WFST; and performing speech recognition based on the detected speech endpoint of the speech signal, wherein the analyzing and classifying the feature vector, creating the frame level WFST, creating the speech level WFST, obtaining the combined WFST, and optimizing the combined WFST, and performing speech recognition are performed using at least one specially programmed processor of a special purpose computer.
-
Specification