Factorial hidden markov model for audiovisual speech recognition

US 20030212556A1
Filed: 05/09/2002
Published: 11/13/2003
Est. Priority Date: 05/09/2002
Status: Active Grant

First Claim

Patent Images

1. A speech recognition method for audiovisual data comprising constructing a distributed state representation hidden Markov model for audiovisual data, and providing maximum likelihood training for the distributed state representation hidden Markov model to identify words.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method includes use of synchronous or asynchronous audio and a video data to enhance speech recognition probabilities. A two stream factorial hidden Markov model is trained and used to identify speech. At least one stream is derived from audio data and a second stream is derived from mouth pattern data. Gestural or other suitable data streams can optionally be combined to reduce speech recognition error rates in noisy environments.

Citations

27 Claims

1. A speech recognition method for audiovisual data comprising constructing a distributed state representation hidden Markov model for audiovisual data, and providing maximum likelihood training for the distributed state representation hidden Markov model to identify words.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the distributed state representation hidden Markov model is a factorial hidden Markov model.
  - 3. The method of claim 1, further comprising maximum likelihood initialization using a Viterbi algorithm.
  - 4. The method of claim 1, wherein audiovisual data is separated and processed as at least two data streams.
  - 5. The method of claim 4, wherein the data streams are asynchronous.
  - 6. The method of claim 1, further comprising separation of audiovisual data into an audio stream and a video stream, and use of a Viterbi algorithm to determine optimal sequence of states for the coupled nodes of the audio and video streams that maximizes the observation likelihood during maximum likelihood training.

7. A speech recognition method comprising using an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data, and applying a two stream factorial hidden Markov model to the first and second data streams for speech recognition.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The method of claim 7, wherein the audio and video data sets providing the first and second data streams are asynchronous.
  - 9. The method of claim 7, further comprising parallel processing of the first and second data streams.
  - 10. The method of claim 7, further comprising visual feature extraction of a mouth region from the video data set.
  - 11. The method of claim 7, further comprising visual feature extraction from the video data set using a variable shape window and application of a two dimensional discrete transform.
  - 12. The method of claim 7, further comprising visual feature extraction from the video data set using linear discriminant analysis.
  - 13. The method of claim 7, further comprising training of the two stream factorial hidden Markov model using a Viterbi algorithm.

14. An article comprising a computer readable medium to store computer executable instructions, the instructions defined to cause a computer to use an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data, and apply a two stream factorial hidden Markov model to the first and second data streams for speech recognition.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer to process asynchronous first and second data streams.
  - 16. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer to parallel process the first and second data streams.
  - 17. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer provide visual feature extraction of a mouth region from the video data set.
  - 18. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer to provide visual feature extraction from the video data set using a variable shape window and application of a two dimensional discrete transform.
  - 19. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer to provide visual feature extraction from the video data set using linear discriminant analysis.
  - 20. The article comprising a computer readable medium to store computer executable instructions of claim 14, wherein the instructions further cause a computer to train the two stream factorial hidden Markov model using a Viterbi algorithm.

21. A speech recognition system comprising an audiovisual capture module to capture an audio and a video data set that respectively provide a first data stream of speech data and a second data stream of face image data, and a speech recognition module that applies a two stream factorial hidden Markov model to the first and second data streams for speech recognition.
- View Dependent Claims (22, 23, 24, 25, 26, 27)
- - 22. The speech recognition system of claim 21, further comprising asynchronous audio and video data sets.
  - 23. The speech recognition system of claim 21, further comprising parallel processing of the first and second data streams by the speech recognition module.
  - 24. The speech recognition system of claim 21, further comprising visual feature extraction of a mouth region from the video data set by the audiovisual capture module.
  - 25. The speech recognition system of claim 21, further comprising visual feature extraction from the video data set using a variable shape window and application of a two dimensional discrete transform by the audiovisual capture module.
  - 26. The speech recognition system of claim 21, further comprising visual feature extraction from the video data set by the audiovisual capture module using linear discriminant analysis.
  - 27. The speech recognition system of claim 21, further comprising training of the two stream factorial hidden Markov model by the speech recognition module using a Viterbi algorithm.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Nefian, Ara V.

Granted Patent

US 7,209,883 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/256
CPC Class Codes

G06F 18/295   Markov models or related mo...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/24   Speech recognition using no...

Factorial hidden markov model for audiovisual speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Factorial hidden markov model for audiovisual speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links