Audio-visual feature fusion and support vector machine useful for continuous speech recognition
First Claim
1. A method for recognizing speech by fusing audio and visual features, comprisinggenerating an audio vector representing detected audio data of a speech utterance,detecting a face in a video data stream linked to the audio data of the speech utterance,applying a cascade of linear support vector machine classifiers to the detected face to locate a mouth region,generating vector data for the mouth region,training a hidden Markov model (HMM) by fusing audio and visual vector data with the HMM, andrecognizing an input speech by extracting audio and visual features and by comparing the extracted audio and visual features with HMMs obtained at least in part through audio and visual fusion.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition method includes several embodiments describing application of support vector machine analysis to a mouth region. Lip position can be accurately determined and used in conjunction with synchronous or asynchronous audio data to enhance speech recognition probabilities.
-
Citations
20 Claims
-
1. A method for recognizing speech by fusing audio and visual features, comprising
generating an audio vector representing detected audio data of a speech utterance, detecting a face in a video data stream linked to the audio data of the speech utterance, applying a cascade of linear support vector machine classifiers to the detected face to locate a mouth region, generating vector data for the mouth region, training a hidden Markov model (HMM) by fusing audio and visual vector data with the HMM, and recognizing an input speech by extracting audio and visual features and by comparing the extracted audio and visual features with HMMs obtained at least in part through audio and visual fusion.
-
11. An article comprising a computer readable medium to store computer executable instructions, the instructions defined to cause a computer to recognize speech by fusing audio and visual features via operations including:
-
generating an audio vector representing detected audio data of a speech utterance, detecting a face in a video data stream linked to the audio data of the speech utterance, applying a cascade of linear support vector machine classifiers to the detected face to locate a mouth region, generating vector data for the mouth region, training a hidden Markov model (HMM) by fusing audio and visual vector data with the HMM, and recognizing an input speech by extracting audio and visual features and by comparing the extracted audio and visual features with HMMs obtained at least in part through audio and visual fusion. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification