Methods and apparatus for audio-visual speaker recognition and utterance verification

US 6,219,640 B1
Filed: 08/06/1999
Issued: 04/17/2001
Est. Priority Date: 08/06/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing speaker recognition, the method comprising the steps of:

processing a video signal associated with an arbitrary content video source;

processing an audio signal associated with the video signal; and

making at least one of an identification and verification decision based on the processed audio signal and the processed video signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for performing speaker recognition comprise processing a video signal associated with an arbitrary content video source and processing an audio signal associated with the video signal. Then, an identification and/or verification decision is made based on the processed audio signal and the processed video signal. Various decision making embodiments may be employed including, but not limited to, a score combination approach, a feature combination approach, and a re-scoring approach. In another aspect of the invention, a method of verifying a speech utterance comprises processing a video signal associated with a video source and processing an audio signal associated with the video signal. Then, the processed audio signal is compared with the processed video signal to determine a level of correlation between the signals. This is referred to as unsupervised utterance verification. In a supervised utterance verification embodiment, the processed video signal is compared with a script representing an audio signal associated with the video signal to determine a level of correlation between the signals.

Citations

61 Claims

1. A method of performing speaker recognition, the method comprising the steps of:
- processing a video signal associated with an arbitrary content video source;
  
  processing an audio signal associated with the video signal; and
  
  making at least one of an identification and verification decision based on the processed audio signal and the processed video signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The method of claim 1, wherein the video signal processing operation comprises the step of detecting whether the video signal associated with the arbitrary content video source contains one or more faces.
  - 3. The method of claim 2, wherein the video signal processing operation further comprises the step of detecting one or more facial features on one or more detected faces.
  - 4. The method of claim 3, wherein at least one of face and facial feature detection employ Fisher linear discriminant (FLD) analysis.
  - 5. The method of claim 3, wherein at least one of face and facial feature detection employ a distance from face space (DFFS) measure.
  - 6. The method of claim 3, wherein the video signal processing operation further comprises the step of recognizing one or more faces from the detecting faces using the detected facial features.
  - 7. The method of claim 6, wherein the video signal processing operation further comprises the step of performing a confidence estimation procedure on results of the face recognition operation.
  - 8. The method of claim 6, wherein the audio signal processing operation comprises the step of recognizing a speaker associated with the audio signal.
  - 9. The method of claim 8, wherein the audio signal processing operation further comprises the step of performing a confidence estimation procedure on results of the audio speaker recognition operation.
  - 10. The method of claim 8, wherein respective results of the face recognition and audio speaker recognition operations are used to make at least one of the identification decision and the verification decision.
  - 11. The method of claim 10, wherein the results of one of the recognition operations are used to modify the results of the other of the recognition operations.
  - 12. The method of claim 11, wherein the decision is based on the modified results.
  - 13. The method of claim 10, wherein the results are combined such that one set of top N respective scores are generated for the face recognition and audio speaker recognition operations and used to make the decision.
  - 14. The method of claim 10, wherein the results include the top N respective scores generated during the face recognition and audio speaker recognition operations.
  - 15. The method of claim 14, wherein the top N respective scores are combined using a mixture parameter.
  - 16. The method of claim 15, wherein the mixture parameter is selected within a range which maximizes the highest and the second highest scores.
  - 17. The method of claim 15, wherein the mixture parameter is selected according to a reliability measure associated with the face recognition and audio speaker recognition operations.
  - 18. The method of claim 17, wherein the mixture parameter is optimized in accordance with a cost function representative of an error rate.
  - 19. The method of claim 17, wherein the mixture parameter is optimized in accordance with a cost function representative of a smoothed error rate.
  - 20. The method of claim 1, wherein at least one of the video signal and the audio signal are compressed signals.
  - 21. The method of claim 1, wherein compressed signals are decompressed prior to processing operations.
  - 22. The method of claim 1, wherein the arbitrary content video source provides MPEG-2 standard signals.
  - 23. The method of claim 1, wherein the video signal includes at least one of visible electromagnetic spectrum images, non-visible electromagnetic spectrum images, and images from other sensing techniques.
  - 24. The method of claim 1, further comprising the step of enrolling a user in accordance with at least one of acoustic and visual information.
  - 25. The method of claim 24, wherein the result of the enrollment operation is a combined biometric representing multiple modalities.

26. A method of verifying a speech utterance, the method comprising the steps of:
- processing a video signal associated with a video source;
  
  processing an audio signal associated with the video signal; and
  
  comparing the processed audio signal with the processed video signal to determine a level of correlation between the signals.
- View Dependent Claims (27, 28, 29, 30, 31)
- - 27. The method of claim 26, wherein the video signal processing operation further comprises the step of extracting visual feature vectors from the video signal.
  - 28. The method of claim 27, wherein the video signal processing operation further comprises the step of associating visemes with the extracted feature vectors.
  - 29. The method of claim 28, wherein the audio signal processing operation further comprises the step of extracting acoustic feature vectors and using the extracted features to generate a decoded script representative of the audio signal.
  - 30. The method of claim 29, wherein the decoded script is aligned with the visemes.
  - 31. The method of claim 30, wherein a likelihood of the alignment is computed and used to make the verification determination.

32. A method of verifying a speech utterance, the method comprising the steps of:
- processing a video signal associated with a video source; and
  
  comparing the processed video signal with a script representing an audio signal associated with the video signal to determine a level of correlation between the signals.

33. Apparatus for performing speaker recognition, the apparatus comprising:
- at least one processor operable to;
  
  process a video signal associated with an arbitrary content video source, (ii) process an audio signal associated with the video signal, and (iii) make at least one of an identification and verification decision based on the processed audio signal and the processed video signal.
- View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 34. The apparatus of claim 33, wherein the video signal processing operation comprises the step of detecting whether the video signal associated with the arbitrary content video source contains one or more faces.
  - 35. The apparatus of claim 34, wherein the video signal processing operation further comprises the step of detecting one or more facial features on one or more detected faces.
  - 36. The apparatus of claim 35, wherein the video signal processing operation further comprises the step of recognizing one or more faces from the detecting faces using the detected facial features.
  - 37. The apparatus of claim 35, wherein at least one of face and facial feature detection employ a distance from face space (DFFS) measure.
  - 38. The apparatus of claim 35, wherein at least one of face and facial feature detection employ Fisher linear discriminant (FLD) analysis.
  - 39. The apparatus of claim 38, wherein the video signal processing operation further comprises the step of performing a confidence estimation procedure on results of the face recognition operation.
  - 40. The apparatus of claim 38, wherein the audio signal processing operation comprises the step of recognizing a speaker associated with the audio signal.
  - 41. The apparatus of claim 40, wherein the audio signal processing operation further comprises the step of performing a confidence estimation procedure on results of the audio speaker recognition operation.
  - 42. The apparatus of claim 40, wherein respective results of the face recognition and audio speaker recognition operations are used to make at least one of the identification decision and the verification decision.
  - 43. The apparatus of claim 42, wherein the results are combined such that one set of top N respective scores are generated for the face recognition and audio speaker recognition operations and used to make the decision.
  - 44. The apparatus of claim 42, wherein the results of one of the recognition operations are used to modify the results of the other of the recognition operations.
  - 45. The apparatus of claim 44, wherein the decision is based on the modified results.
  - 46. The apparatus of claim 42, wherein the results include the top N respective scores generated during the face recognition and audio speaker recognition operations.
  - 47. The apparatus of claim 46, wherein the top N respective scores are combined using a mixture parameter.
  - 48. The apparatus of claim 47, wherein the mixture parameter is selected within a range which maximizes the highest and the second highest scores.
  - 49. The apparatus of claim 47, wherein the mixture parameter is selected according to a reliability measure associated with the face recognition and audio speaker recognition operations.
  - 50. The apparatus of claim 49, wherein the mixture parameter is optimized in accordance with a cost function representative of a smoothed error rate.
  - 51. The apparatus of claim 49, wherein the mixture parameter is optimized in accordance with a cost function representative of an error rate.
  - 52. The apparatus of claim 33, wherein at least one of the video signal and the audio signal are compressed signals.
  - 53. The apparatus of claim 33, wherein compressed signals are decompressed prior to processing operations.
  - 54. The apparatus of claim 33, wherein the arbitrary content video source provides MPEG-2 standard signals.
  - 55. The apparatus of claim 33, wherein the video signal includes at least one of visible electromagnetic spectrum images, non-visible electromagnetic spectrum images, and images from other sensing techniques.
  - 56. The apparatus of claim 33, wherein the processor is further operable to enroll a user in accordance with at least one of acoustic and visual information.
  - 57. The apparatus of claim 56, wherein the result of the enrollment operation is a combined biometric representing multiple modalities.

58. Apparatus for verifying a speech utterance, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with a video source, (ii) process an audio signal associated with the video signal, and (iii) compare the processed audio signal with the processed video signal to determine a level of correlation between the signals.

59. Apparatus for verifying a speech utterance, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with a video source, and (ii) compare the processed video signal with a script representing an audio signal associated with the video signal to determine a level of correlation between the signals.

60. A method of performing speaker recognition, the method comprising the steps of:
- processing an image signal associated with an arbitrary content image source;
  
  processing an audio signal associated with the image signal; and
  
  making at least one of an identification and verification decision based on the processed audio signal and the processed image signal.

61. Apparatus for performing speaker recognition, the apparatus comprising:
- at least one processor operable to;
  
  (i) process an image signal associated with an arbitrary content image source, (ii) process an audio signal associated with the image signal, and (iii) make at least one of an identification and verification decision based on the processed audio signal and the processed image signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Ghislain Maison, Benoit Emmanuel, Senior, Andrew William, Maes, Stephane Herman, Beigi, Homayoon S. M., Neti, Chalapathy Venkata, Basu, Sankar
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/369,706
Time in Patent Office

620 Days
Field of Search

382/115, 382/118, 379/88.02, 704/273, 704/246, 704/231, 704/251, 704/275
US Class Current

704/246
CPC Class Codes

G06F 18/256   of results relating to diff...

G06V 10/811   the classifiers operating o...

G06V 40/10   Human or animal bodies, e.g...

G06V 40/16   Human faces, e.g. facial pa...

G07C 9/37   using biometric data, e.g. ...

G10L 2015/226   using non-speech characteri...

Methods and apparatus for audio-visual speaker recognition and utterance verification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

61 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for audio-visual speaker recognition and utterance verification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

61 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links