Speaker detection and tracking using audiovisual data

US 6,940,540 B2
Filed: 06/27/2002
Issued: 09/06/2005
Est. Priority Date: 06/27/2002
Status: Active Grant

First Claim

Patent Images

1. An object tracker system, comprising:

an audio model that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model, and employing, at least in part, the following equations;

p(r)=π

_r,
p(a|r)=N(a|0,η

_r),
p(x₁|a)=N(x₁|λ

₁a,ν

₁),
p(x₂|a,τ

)=N(x₂|λ

₂L_τa,ν

₂), where r is variability component of the original audio signal, π

is a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal, τ

is the time delay between x₁and x₂, λ

₁is an attenuation parameter associated with x₁, λ

₂is an attenuation parameter associated with x₂, η

_ris a precision matrix parameter associated with r, ν

₁is a precision matrix parameter associated with additive noise of x₁, ν

₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator;

a video model that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and

, an audio video tracker that models the location of the object based, at least in part, upon the audio model and the video model, the audio video tracker providing an output associated with the location of the object.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method facilitating object tracking is provided. The system includes an audio model that receives at least two audio input signals and a video model that receives a video input. The audio model and the video model employ probabilistic generative models which are combined to facilitate object tracking. Expectation maximization can be employed to modify trainable parameters of the audio model and the video model.

Citations

24 Claims

1. An object tracker system, comprising:
- an audio model that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model, and employing, at least in part, the following equations;
  
  p(r)=π
  
  _r,
  p(a|r)=N(a|0,η
  
  _r),
  p(x₁|a)=N(x₁|λ
  
  ₁a,ν
  
  ₁),
  p(x₂|a,τ
  
  )=N(x₂|λ
  
  ₂L_τa,ν
  
  ₂), where r is variability component of the original audio signal, π
  
  is a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal, τ
  
  is the time delay between x₁and x₂, λ
  
  ₁is an attenuation parameter associated with x₁, λ
  
  ₂is an attenuation parameter associated with x₂, η
  
  _ris a precision matrix parameter associated with r, ν
  
  ₁is a precision matrix parameter associated with additive noise of x₁, ν
  
  ₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator;
  
  a video model that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and
  
  , an audio video tracker that models the location of the object based, at least in part, upon the audio model and the video model, the audio video tracker providing an output associated with the location of the object.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The system of claim 1, the video model employing, at least in part, the following equations:
    - p(s)=π
      
      _s,
      p(ν
      
      |s)=N(ν
      
      |μ
      
      _s,φ
      
      _s),
      p(y|ν
      
      ,l)=N(y|G_lν
      
      ,ψ
      
      ), where π
      
      _sis a prior probability parameter of s, y is the video input signal, l is the location of the object, ν
      
      is the original image of the object, μ
      
      _sis a mean parameter associated with s, φ
      
      _sis a precision matrix parameter associated with s, ψ
      
      is a precision matrix parameter associated with additive noise of y, G_ldenotes a shift operator.
  - 3. The system of claim 2, the video model employing, at least in part, the following equation:
    - p(τ
      
      |l)=N(τ
      
      |al_x+al_y+β
      
      ,ν
      
      _τ) where l is the location of the object, l_xis a horizontal position of the object, l_yis a vertical position of the object, τ
      
      a time delay between a first audio input signal and a second audio input signal, ν
      
      _τ, is a precision matrix of an approximation error modeled by a zero mean Gaussian, α
      
      is a horizontal difference between a first audio input device and a second audio device, α
      
      ′
      
      is a vertical difference of the first audio input device and the second audio device, β
      
      is a parameter.
  - 4. The system of claim 1, further comprising a video input device.
  - 5. The system of claim 4, the video input device comprising a camera.
  - 6. The system of claim 1, further comprising at least one audio input device.
  - 7. The system of claim 1, the audio input device being at least one of a microphone, a telephone and a speaker phone.
  - 8. The system of claim 1, the original audio signal of an object, the time delay between at least two audio input signals and the variability component of the original audio signal being unobserved variables of the audio model, the audio model further including trainable parameters.
  - 9. The system of claim 8, the audio video tracker employing an expectation maximization algorithm to modify the trainable parameters of the audio model.
  - 10. The system of claim 1, the location of the object, the original image of the object and the variability component of the original image being unobserved variables of the video model, the video model further including trainable parameters.
  - 11. The system of claim 10, the audio video tracker employing an expectation maximization algorithm to modify the trainable parameters of the video model.
  - 12. The system of claim 1, the object being a human speaker.
  - 13. A video conferencing system employing the system of claim 1.
  - 14. A multi-media processing system employing the system of claim 1.
  - 15. The system of claim 1, the audio model employing a hidden Markov model.
  - 16. The system of claim 1, the video model employing a hidden Markov model.

17. A method for object tracking, comprising:
- updating a posterior distribution over unobserved variables of an audio model and a video model;
  
  updating trainable parameters of the audio model and the video model;
  
  employing, at least in part, the following equations in the audio model;
  
  p(r)=π
  
  _r,
  p(a|r)=N(a|0,η
  
  _r),
  p(x₁|a)=N(x₁|λ
  
  ₁a,ν
  
  ₁),
  p(x₂|a,τ
  
  )=N(x₂|λ
  
  ₂L_τa,ν
  
  ₂), where r is variability component of the original audio signal, π
  
  a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal, τ
  
  is the time delay between x₁and x₂, λ
  
  ₁is an attenuation parameter associated with x₁, λ
  
  ₂is an attenuation parameter associated with x₂, η
  
  _ris a precision matrix parameter associated with r, ν
  
  ₁is a precision matrix parameter associated with additive noise of x₁, ν
  
  ₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator; and
  
  , providing an output associated with a location of an object.
- View Dependent Claims (18, 19)
- - 18. The method of claim 17, further comprising at least one of the following acts:
    - receiving at least two audio input signals; and
      
      , receiving a video input signal.
  - 19. The method of claim 17, further comprising at least one of the following acts:
    - providing an audio model having trainable parameters and having an original audio signal of the object, a time delay between at least two audio input signals and a variability component of the original audio signal as unobserved variables of the audio model; and
      
      , providing a video model having trainable parameters and having the location of the object, an original image of the object and a variability component of the original image as unobserved variables of the video model.

20. A data packet transmitted between two or more computer components that facilitates object tracking, the data packet comprising:
- a first data field comprising information associated with a horizontal location of an object; and
  
  , a second data field comprising information associated with a vertical location of the object, the horizontal location and the vertical location being based, at least in part, upon an object tracker system receiving at least two audio signal inputs and a video input signal;
  
  wherein the object tracker system comprising at least an audio model employing, at least in part, the following equations;
  
  p(r)=π
  
  _r,
  p(a|r)=N(a|0,η
  
  _r),
  p(x₁|a)=N(x₁|λ
  
  ₁a,ν
  
  ₁),
  p(x₂|a,τ
  
  )=N(x₂|λ
  
  ₂L_τa,ν
  
  ₂), where r is variability component of the original audio signal, π
  
  _ris a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal. π
  
  is the time delay between x₁and x₂, λ
  
  ₁is an attenuation parameter associated with x₁, λ
  
  ₂is an attenuation parameter associated with x₂, η
  
  _ris a precision matrix parameter associated with r, ν
  
  ₁is a precision matrix parameter associated with additive noise of x₁, ν
  
  ₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator.

21. A computer readable medium storing computer executable components of an object tracker system, comprising:
- an audio model component that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model;
  
  a video model component that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and
  
  employing, at least in part, the following equations;
  
  p(s)=π
  
  _s,
  p(ν
  
  |s)=N(ν
  
  |μ
  
  _s,φ
  
  _s),
  p(y|ν
  
  ,l)=N(y|G_lν
  
  ,ψ
  
  ), where π
  
  _sis a prior probability parameter of s, y is the video input signal, l is the location of the object, ν
  
  is the original image of the object, μ
  
  _sis a mean parameter associated with s, φ
  
  _sis a precision matrix parameter associated with s, ψ
  
  is a precision matrix parameter associated with additive noise of y, G_ldenotes a shift operator; and
  
  , an audio video tracker component that models the location of the object based, at least in part, upon the audio model and the video model, the audio video tracker providing an output associated with the location of the object.

22. An means for modeling audio that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the means for modeling audio employing a probabilistic generative model, and employing, at least in part, the following equations in the audio model:
- p(r)=π
  
  _r,
  p(a|r)=N(a|0,η
  
  _r),
  p(x₁|a)=N(x₁|λ
  
  ₁a,ν
  
  ₁),
  p(x₂|a,τ
  
  )=N(x₂|λ
  
  ₂L_τa,ν
  
  ₂), where r is variability component of the original audio signal, π
  
  a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal, τ
  
  is the time delay between x₁and x₂, λ
  
  ₁is an attenuation parameter associated with x₁, λ
  
  ₂is an attenuation parameter associated with x₂, η
  
  _ris a precision matrix parameter associated with r, ν
  
  ₁is a precision matrix parameter associated with additive noise of x₁, ν
  
  ₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator;
  
  means for modeling video that models a location of the object, an original image of the object and a variability component of the original image, the means for modeling video employing a probabilistic generative model; and
  
  , means for tracking the location of the object based, at least in part, upon the means for modeling audio and the means for model video, the means for tracking the location of the object providing an output associated with the location of the object.

23. An object tracker system, comprising:
- an audio model that models an original audio signal of an object, a time delay between at least two audio input signals and a variability component of the original audio signal, the audio model employing a probabilistic generative model;
  
  a video model that models a location of the object, an original image of the object, a variability component of the original image and a background image, the video model employing a probabilistic generative model, the video model receiving a video input, and employing, at least in part, the following equations;
  
  p(s)=π
  
  _s,
  p(ν
  
  |s)=N(ν
  
  |μ
  
  _s,φ
  
  _s),
  p(y|ν
  
  ,l)=N(y|G_lν
  
  ,ψ
  
  ), where π
  
  _sis a prior probability parameter of s, y is the video input signal, l is the location of the object, ν
  
  is the original image of the object, μ
  
  _sis a mean parameter associated with s, φ
  
  _sis a precision matrix parameter associated with s, ψ
  
  is a precision matrix parameter associated with additive noise of ν
  
  , G_ldenotes a shift operator. an audio video tracker that models the location of the object based, at least in part, upon the audio model and the video model, the audio video tracker providing an output associated with the location of the object.

24. An object tracker system, comprising:
- an audio model that models an original audio signal of an object, a time delay between at least two audio input signals, a variability component of the original audio signal and a previous original audio signal of the object, the audio model employing a probabilistic generative model, and employing, at least in part, the following equations;
  
  p(r)=π
  
  _r,
  p(a|r)=N(a|0,η
  
  _r),
  p(x₁|a)=N(x₁|λ
  
  ₁a,ν
  
  ₁),
  p(x₂|a,τ
  
  )=N(x₂|λ
  
  ₂L_τa,ν
  
  ₂), where r is variability component of the original audio signal, π
  
  a prior probability parameter of r, a is the original audio signal of the object, x₁is a first audio input signal, x₂is a second audio input signal, τ
  
  is the time delay between x₁and x₂, λ
  
  ₁is an attenuation parameter associated with x₁, λ
  
  ₂is an attenuation parameter associated with x₂, η
  
  _ris a precision matrix parameter associated with r, ν
  
  ₁is a precision matrix parameter associated with additive noise of x₁, ν
  
  ₂is a precision matrix parameter associated with additive noise of x₂, L_rdenotes a temporal shift operator;
  
  a video model that models a location of the object, an original image of the object and a variability component of the original image, the video model employing a probabilistic generative model, the video model receiving a video input; and
  
  , an audio video tracker that models the location of the object based, at least in part, upon the audio model, the video model and a previous location of the object, the audio video tracker providing an output associated with the location of the object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Jojic, Nebojsa, Attias, Hagai, Beal, Matthew James
Primary Examiner(s)
VO, TUNG T

Application Number

US10/183,575
Publication Number

US 20040001143A1
Time in Patent Office

1,167 Days
Field of Search

G06/F01.5/00, H04/N00.5/225, 348/14.07, 348152-169, 348243-172, 348/515, 382/100, 382/116, 382/159, 702/181
US Class Current

348/169
CPC Class Codes

G06F 18/256   of results relating to diff...

G06F 2218/22   Source localisation; Invers...

G06V 10/24   Aligning, centring, orienta...

G06V 10/811   the classifiers operating o...

H04N 7/15   Conference systems

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links