Factorial hidden markov model for audiovisual speech recognition
First Claim
Patent Images
1. A speech recognition method for audiovisual data comprisingobtaining a first data stream of speech data and a second data stream of face image data while a speaker is speaking;
- extracting visual features from the second data stream by masking, resizing, rotating, and normalizing a mouth region in a face image, and by using a two-dimensional discrete cosine transform;
constructing a factorial hidden Markov model for the first data stream and the second data stream, the factorial hidden Markov model including a plurality of hidden Markov models with each hidden Markov model having a plurality of discrete nodes and continuous observable nodes, wherein discrete nodes at a first time for each hidden Markov model are conditioned by discrete nodes at a second time of the plurality of hidden Markov models; and
providing maximum likelihood training for the factorial hidden Markov model to identify words.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition method includes use of synchronous or asynchronous audio and a video data to enhance speech recognition probabilities. A two stream factorial hidden Markov model is trained and used to identify speech. At least one stream is derived from audio data and a second stream is derived from mouth pattern data. Gestural or other suitable data streams can optionally be combined to reduce speech recognition error rates in noisy environments.
-
Citations
17 Claims
-
1. A speech recognition method for audiovisual data comprising
obtaining a first data stream of speech data and a second data stream of face image data while a speaker is speaking; -
extracting visual features from the second data stream by masking, resizing, rotating, and normalizing a mouth region in a face image, and by using a two-dimensional discrete cosine transform; constructing a factorial hidden Markov model for the first data stream and the second data stream, the factorial hidden Markov model including a plurality of hidden Markov models with each hidden Markov model having a plurality of discrete nodes and continuous observable nodes, wherein discrete nodes at a first time for each hidden Markov model are conditioned by discrete nodes at a second time of the plurality of hidden Markov models; and providing maximum likelihood training for the factorial hidden Markov model to identify words. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A speech recognition method comprising
using an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data, extracting visual features from the second data stream by masking, resizing, rotating, and normalizing a mouth region in a face image, and by using a two-dimensional discrete cosine transform, and applying a two stream factorial hidden Markov model (“ - HMM”
) to the first and second data streams for speech recognition, the factorial HMM including two HMMs with one corresponding to the first data stream and the other corresponding to the second data stream, each HMM having a plurality of discrete nodes and continuous observable nodes, wherein discrete nodes at a first time for each HMM are conditioned by discrete nodes at a second time of the two HMMs. - View Dependent Claims (7, 8, 9)
- HMM”
-
10. An article comprising a computer readable medium to store computer executable instructions, the instructions defined to cause a computer to
use an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data extract visual features from the second data stream by masking, resizing, rotating, and normalizing a mouth region in a face image, and by using a two-dimensional discrete cosine transform, and apply a two stream factorial hidden Markov model (“ - HMM”
) to the first and second data streams for speech recognition, the factorial HMM including two HMMs with one corresponding to the first data stream and the other corresponding to the second data stream, each HMM having a plurality of discrete nodes and continuous observable nodes, wherein discrete nodes at a first time for each HMM are conditioned by discrete nodes at a second time of the two HMMs. - View Dependent Claims (11, 12, 13)
- HMM”
-
14. A speech recognition system comprising
an audiovisual capture module to capture an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data, a feature extraction module to extract visual features from the second data stream by masking, resizing, rotating, and normalizing a mouth region in a face image, and by using a two-dimensional discrete cosine transform, and a speech recognition module that applies a two stream factorial hidden Markov model (“ - HMM”
) to the first and second data streams for speech recognition, the factorial HMM including two HMMs with one corresponding to the first data stream and the other corresponding to the second data stream, each HMM having a plurality of discrete nodes and continuous observable nodes, wherein discrete nodes at a first time for each HMM are conditioned by discrete nodes at a second time of the two HMMs. - View Dependent Claims (15, 16, 17)
- HMM”
Specification