Speaker detection and tracking using audiovisual data

US 7,692,685 B2
Filed: 03/31/2005
Issued: 04/06/2010
Est. Priority Date: 06/27/2002
Status: Expired due to Fees

First Claim

Patent Images

1. An object tracker system, comprising:

a processor that executes the following computer executable components stored on a computer readable medium;

an audio model component that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model;

a video model component that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and

an audio video tracker component that models the location of the object based, at least in part, upon the audio model and the video model, wherein the audio video tracker provides an output associated with the location of the object based on, at least in past, a linear mapping that approximates the location of the object, wherein the linear mapping is computed as a function of the time delay between the at least two audio input signals, wherein error in approximating the location of the object is modeled by a zero mean Gaussian distribution associated with a precision matrix, and wherein the zero mean Gaussian distribution associated with the precision matrix is based on, at least in part;

a product of a horizontal position of the object and a difference in horizontal position of a first audio input device and a second audio input device;

a product of a vertical position of the object and a difference in vertical position of the first audio input device and the second audio input device; and

a precision matrix of an approximation error modeled by a zero mean Gaussian.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method facilitating object tracking is provided. The invention includes an audio model that receives at least two audio input signals and a video model that receives a video input. The audio model and the video model employ probabilistic generative models which are combined to facilitate object tracking. Expectation maximization can be employed to modify trainable parameters of the audio model and the video model.

37 Citations

View as Search Results

18 Claims

1. An object tracker system, comprising:
- a processor that executes the following computer executable components stored on a computer readable medium;
  
  an audio model component that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model;
  
  a video model component that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and
  
  an audio video tracker component that models the location of the object based, at least in part, upon the audio model and the video model, wherein the audio video tracker provides an output associated with the location of the object based on, at least in past, a linear mapping that approximates the location of the object, wherein the linear mapping is computed as a function of the time delay between the at least two audio input signals, wherein error in approximating the location of the object is modeled by a zero mean Gaussian distribution associated with a precision matrix, and wherein the zero mean Gaussian distribution associated with the precision matrix is based on, at least in part;
  
  a product of a horizontal position of the object and a difference in horizontal position of a first audio input device and a second audio input device;
  
  a product of a vertical position of the object and a difference in vertical position of the first audio input device and the second audio input device; and
  
  a precision matrix of an approximation error modeled by a zero mean Gaussian.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system of claim 1, further comprising a video input device.
  - 3. The system of claim 2,wherein the video input device comprises a camera.
  - 4. The system of claim 1, further comprising at least one audio input device.
  - 5. The system of claim 1, wherein the audio input device includes at least one of a microphone, a telephone, or a speaker phone.
  - 6. The system of claim 1, wherein the original audio signal of the object, the time delay between at least two audio input signals, and the variability component of the original audio signal comprise unobserved variables of the audio model;
    - and wherein the audio model further includes trainable parameters.
  - 7. The system of claim 6, wherein the audio video tracker component employs an expectation maximization algorithm to modify the trainable parameters of the audio model.
  - 8. The system of claim 1, wherein the location of the object, the original image of the object, and the variability component of the original image comprise unobserved variables of the video model;
    - and wherein the video modelfurther includes trainable parameters.
  - 9. The system of claim 8, wherein the audio video tracker component employs an expectation maximization algorithm to modify the trainable parameters of the video model.
  - 10. The system of claim 1, wherein the object is a human speaker.
  - 11. A video conferencing system employing the system of claim 1.
  - 12. A multi-media processing system employing the system of claim 1.
  - 13. The system of claim 1, the audio model employs a hidden Markov model.
  - 14. The system of claim 1, wherein the video model employs a hidden Markov model.
  - 15. The system of claim 1, further comprising a data packet transmitted between two or more computer components that facilitates object tracking including a first data field comprising information associated with a horizontal location of an object and a second data field comprising information associated with a vertical location of the object.

16. One or more processor-accessible storage media comprising processor-executable instructions for object tracking that, when executed, direct a device to perform actions comprising:
- updating a posterior distribution over unobserved variables of an audio model and a video model;
  
  providing the audio model with trainable parameters and having an original audio signal of the object, a time delay between at least two audio input signals and a variability component of the original audio signal as unobserved variables of the audio model;
  
  providing the video model with trainable parameters and having the location of the object, an original image of the object and a variability component of the original image as unobserved variables of the video model;
  
  updating trainable parameters of the audio model and the video model; and
  
  providing an output associated with a location of an object, wherein the location of the object is based on, at least in part, a linear mapping that approximates the location of the object, wherein the linear mapping is computed as a function of the time delay between the at least two audio input signals, wherein error in approximating the location of the object is modeled by a zero mean Gaussian distribution associated with a precision matrix, and wherein the zero mean Gaussian distribution associated with the precision matrix is based on, at least in part;
  
  a product of a horizontal position of the object and a difference in horizontal position of a first audio input device and a second audio input device;
  
  a product of a vertical position of the object and a difference in vertical position of the first audio input device and the second audio input device; and
  
  a precision matrix of an approximation error modeled by a zero mean Gaussian.
- View Dependent Claims (17)
- - 17. The one or more processor-accessible storage media method of claim 16, further comprising at least one of the following:
    - receiving at least two audio input signals; and
      
      receiving a video input signal.

18. An object tracker system, comprising:
- means for modeling audio that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the means for modeling audio employing a probabilistic generative model;
  
  means for modeling video that models a location of the object, an original image of the object and a variability component of the original image, the means for modeling video employing a probabilistic generative model; and
  
  means for tracking the location of the object based, at least in part, upon the means for modeling audio, the means for modeling video, and means for fusing audio and video models into a single probabilistic graphical model;
  
  wherein the means for tracking the location of the object includes means for providing an output associated with the location of the object, wherein the single probabilistic graphical model utilizes a zero mean Gaussian distribution associated with a precision matrix to model error in approximating the location of the object, and wherein the zero mean Gaussian distribution associated with the precision matrix comprises the following equation;
  
  p(τ
  
  |l)=N(τ
  
  |α
  
  l_x+α
  
  ′
  
  l_y+β
  
  , ν
  
  _τ),wherein I is the location of the object,l_xis a horizontal position of the object,l_yis a vertical position of the object,τ
  
  is a time delay between a first audio input signal and a second audio input signal,ν
  
  _τ is a precision matrix of an approximation error modeled by a zero mean Gaussian,α
  
  is a horizontal difference in position between a first audio input device and a second audio input device,α
  
  ′
  
  is a vertical difference in position between the first audio input device and the second audio input device,β
  
  is a parameter;
  
  wherein instructions associated with one or more of the above means are executed by a processor operatively coupled to memory.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Attias, Hagai, Beal, Matthew James, Jojic, Nebojsa
Primary Examiner(s)
VO, TUNG T

Application Number

US11/094,922
Publication Number

US 20050171971A1
Time in Patent Office

1,832 Days
Field of Search

348/14.07, 348152-169, 348/170, 348/14.09, 382/100, 382/116, 382/103, 382/228, 706/10, 381/94.1
US Class Current

348/169
CPC Class Codes

G06F 18/256   of results relating to diff...

G06F 2218/22   Source localisation; Invers...

G06V 10/24   Aligning, centring, orienta...

G06V 10/811   the classifiers operating o...

H04N 7/15   Conference systems

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links