Fusion of audio and video based speaker identification for multimedia information access

US 6,567,775 B1
Filed: 04/26/2000
Issued: 05/20/2003
Est. Priority Date: 04/26/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said method comprising the steps of:

processing said audio information to identify a plurality of potential speakers, each of said identified speakers having an associated confidence score;

processing said video information to identify a plurality of potential individuals in an image, each of said identified individuals having an associated confidence score; and

identifying said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus are disclosed for identifying a speaker in an audio-video source using both audio and video information. An audio-based speaker identification system identifies one or more potential speakers for a given segment using an enrolled speaker database. A video-based speaker identification system identifies one or more potential speakers for a given segment using a face detector/recognizer and an enrolled face database. An audio-video decision fusion process evaluates the individuals identified by the audio-based and video-based speaker identification systems and determines the speaker of an utterance in accordance with the present invention. A linear variation is imposed on the ranked-lists produced using the audio and video information. The decision fusion scheme of the present invention is based on a linear combination of the audio and the video ranked-lists. The line with the higher slope is assumed to convey more discriminative information. The normalized slopes of the two lines are used as the weight of the respective results when combining the scores from the audio-based and video-based speaker analysis. In this manner, the weights are derived from the data itself.

201 Citations

14 Claims

1. A method for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said method comprising the steps of:
- processing said audio information to identify a plurality of potential speakers, each of said identified speakers having an associated confidence score;
  
  processing said video information to identify a plurality of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  identifying said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising the step of imposing a linear variation on a list of said plurality of potential speakers and a list of said potential individuals.
  - 3. The method of claim 2, wherein said step of step of imposing a linear variation further comprises the steps of removing outliers using a Hough transform and fitting the surviving points set to a line using a least mean square error method.
  - 4. The method of claim 3, further comprising the step of representing said list of potential speakers as a straight line having a slope, m₁, and said list of potential individuals as a straight line having a slope, m₂, and wherein said slopes of said lines when normalized by their sum are used as the weight of the said audio and video information, respectively, during said identifying step.
  - 5. The method of claim 4, wherein said identifying step further comprises the step of computing a fused score, FS_k, for each speaker as follows:

6. A method for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said method comprising the steps of:
- processing said audio information to identify a ranked-list of potential speakers, each of said identified speakers having an associated confidence score;
  
  processing said video information to identify a ranked-list of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  identifying said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, further comprising the step of imposing a linear variation on said ranked-list of potential speakers and said ranked-list of potential individuals.
  - 8. The method of claim 7, wherein said step of step of imposing a linear variation further comprises the steps of removing outliers using a Hough transform and fitting the surviving points set to a line using a least mean square error method.
  - 9. The method of claim 8, further comprising the step of representing said ranked-list of potential speakers as a straight line having a slope, m₁, and said ranked-list of potential individuals as a straight line having a slope, m₂, and wherein said slopes of said lines are used as the weight of the said audio and video information, respectively, during said identifying step.
  - 10. The method of claim 9, wherein said identifying step further comprises the step of computing a fused score, FS_k, for each speaker as follows:

11. A system for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said system comprising:
- a memory that stores computer-readable code; and
  
  a processor operatively coupled to said memory, said processor configured to implement said computer-readable code, said computer-readable code configured to;
  
  process said audio information to identify a plurality of potential speakers, each of said identified speakers having an associated confidence score;
  
  process said video information to identify a plurality of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  identify said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.

12. A system for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said system comprising:
- a memory that stores computer-readable code; and
  
  a processor operatively coupled to said memory, said processor configured to implement said computer-readable code, said computer-readable code configured to;
  
  process said audio information to identify a ranked-list of potential speakers, each of said identified speakers having an associated confidence score;
  
  process said video information to identify a ranked-list of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  identify said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.

13. An article of manufacture for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said article of manufacture comprising:
- a computer readable medium having computer readable code means embodied thereon, said computer readable program code means comprising;
  
  a step to process said audio information to identify a plurality of potential speakers, each of said identified speakers having an associated confidence score;
  
  a step to process said video information to identify a plurality of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  a step to identify said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.

14. An article of manufacture for identifying a speaker in an audio-video source, said audio-video source having audio information and video information, said article of manufacture comprising:
- a computer readable medium having computer readable code means embodied thereon, said computer readable program code means comprising;
  
  a step to process said audio information to identify a ranked-list of potential speakers, each of said identified speakers having an associated confidence score;
  
  a step to process said video information to identify a ranked-list of potential individuals in an image, each of said identified individuals having an associated confidence score; and
  
  a step to identify said speaker in said audio-video source based on said audio and video information, wherein said audio and video information is weighted based on slope information derived from said confidence scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Maali, Fereydoun, Viswanathan, Mahesh
Primary Examiner(s)
Knepper, David D.

Application Number

US09/558,371
Time in Patent Office

1,119 Days
Field of Search

704/200, 704/231, 704/246, 704/251, 704/270, 704/272, 704/273, 704/235, 704/260, 712/10, 382/157, 382/203
US Class Current

704/231
CPC Class Codes

G06F 18/256   of results relating to diff...

G06V 20/40   in video content extracting...

G10L 17/10   Multimodal systems, i.e. ba...

Fusion of audio and video based speaker identification for multimedia information access

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

201 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Fusion of audio and video based speaker identification for multimedia information access

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

201 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links