Speaker detection and tracking using audiovisual data

US 8,842,177 B2
Filed: 03/31/2010
Issued: 09/23/2014
Est. Priority Date: 06/27/2002
Status: Expired due to Fees

First Claim

Patent Images

1. One or more processor-accessible storage devices comprising processor-executable instructions for object tracking that, when executed, direct a device to perform actions comprising:

receiving at least two audio input signals associated to an object;

receiving a video input signal associated to the object;

modeling a location of the object based at least in part on the at least two audio input signals and the video input signal; and

calculating an error in modeling the location of the object based at least in part on;

a precision matrix of an approximation error modeled by a zero mean Gaussian;

a product of a vertical position of the object and a difference in vertical position of a first audio input device and a second audio input device; and

a product of a horizontal position of the object and a difference in horizontal position of the first audio input device and the second audio input device.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Object tracking includes an audio model that receives at least two audio input signals and a video model that receives a video input. The audio model and the video model employ probabilistic generative models which are combined to facilitate object tracking. Expectation maximization can be employed to modify trainable parameters of the audio model and the video model.

Citations

20 Claims

1. One or more processor-accessible storage devices comprising processor-executable instructions for object tracking that, when executed, direct a device to perform actions comprising:
- receiving at least two audio input signals associated to an object;
  
  receiving a video input signal associated to the object;
  
  modeling a location of the object based at least in part on the at least two audio input signals and the video input signal; and
  
  calculating an error in modeling the location of the object based at least in part on;
  
  a precision matrix of an approximation error modeled by a zero mean Gaussian;
  
  a product of a vertical position of the object and a difference in vertical position of a first audio input device and a second audio input device; and
  
  a product of a horizontal position of the object and a difference in horizontal position of the first audio input device and the second audio input device.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The one or more processor-accessible storage devices of claim 1, wherein the location of the object is based at least in part on a probabilistic generative model of the video input signal associated to the object.
  - 3. The one or more processor-accessible storage devices of claim 1, wherein the location of the object is based at least in part on a probabilistic generative model of the at least two audio input signals associated to the object.
  - 4. The one or more processor-accessible storage devices of claim 1, wherein the modeling employs a hidden Markov model.
  - 5. The one or more processor-accessible storage devices of claim 1, wherein the location of the object is based at least in part on a probabilistic generative model of the at least two audio input signals and a probabilistic generative model of the video input signal.
  - 6. The one or more processor-accessible storage devices of claim 1, wherein the precision matrix is based at least in part on additive sensor noise.

7. An object tracker system, comprising:
- a processor that executes the following computer executable components stored on a computer readable medium;
  
  an audio model component that models an original audio signal of an object;
  
  a video model component that models a location of the object; and
  
  an audio video tracker component that models the location of the object based, at least in part on the audio model and the video model, wherein the audio video tracker provides an output associated with the location of the object based at least in part on a linear mapping that approximates the location of the object, wherein error in approximating the location of the object is modeled by a zero mean Gaussian distribution associated with a precision matrix, and wherein the zero mean Gaussian distribution associated with the precision matrix is based at least in part on a product of a horizontal position of the object and a difference in horizontal position of a first audio input device and a second audio input device.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 8. The system of claim 7, wherein the zero mean Gaussian distribution associated with the precision matrix is based further at least in part on a product of a vertical position of the object and a difference in vertical position of the first audio input device and the second audio input device.
  - 9. The system of claim 7, wherein the zero mean Gaussian distribution associated with the precision matrix is based further on a precision matrix of an approximation error modeled by a zero mean Gaussian.
  - 10. The system of claim 7, further comprising a video input device.
  - 11. The system of claim 10, wherein the video input device comprises a camera.
  - 12. The system of claim 7, further comprising at least one audio input device.
  - 13. The system of claim 12, wherein the audio input device includes at least one of a microphone, a telephone, or a speaker phone.
  - 14. The system of claim 7, wherein the original audio signal of the object, a time delay between at least two audio input signals, and a variability component of the original audio signal comprise unobserved variables of the audio model and wherein the audio model further includes trainable parameters.
  - 15. The system of claim 14, wherein the audio video tracker component employs an expectation maximization algorithm to modify the trainable parameters of the audio model.
  - 16. The system of claim 7, wherein the location of the object, an original image of the object, and a variability component of the original image comprise unobserved variables of the video model;
    - and wherein the video model further includes trainable parameters.
  - 17. The system of claim 16, wherein the audio video tracker component employs an expectation maximization algorithm to modify the trainable parameters of the video model.
  - 18. The system of claim 7, wherein the precision matrix is based at least in part on additive sensor noise that contaminates the original audio signal of the object.

19. One or more processor-accessible storage devices comprising processor-executable instructions for object tracking that, when executed, direct a device to perform actions comprising:
- updating a posterior distribution over unobserved variables of a probabilistic generative audio model and a probabilistic generative video model;
  
  providing the probabilistic generative audio model with trainable parameters;
  
  providing the probabilistic generative video model with trainable parameters;
  
  updating trainable parameters of the probabilistic generative audio model and the probabilistic generative video model;
  
  determining a location of an object at least in part by combining the probabilistic generative audio model and the probabilistic generative video model using a probabilistic generative model, wherein an error in the determining the location of the object is based at least in part on at least one of;
  
  a product of a vertical position of the object and a difference in vertical position of a first audio input device and a second audio input device; and
  
  a product of a horizontal position of the object and a difference in horizontal position of the first audio input device and the second audio input device; and
  
  providing an output associated with the location of the object.
- View Dependent Claims (20)
- - 20. The one or more processor-accessible storage devices of claim 19, wherein the error in determining the location of the object is further based on a precision matrix that is based at least in part on additive sensor noise.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Beal, Matthew James, Jojic, Nebojsa, Attias, Hagai
Primary Examiner(s)
Vo, Tung
Assistant Examiner(s)
CATTUNGAL, ROWINA J

Application Number

US12/751,699
Publication Number

US 20100194881A1
Time in Patent Office

1,637 Days
Field of Search

348/169, 348/170, 348/14.09, 348/222.1, 382/103, 382/228, 706/10, 381/94.1, 702/181
US Class Current

348/135
CPC Class Codes

G06F 18/256   of results relating to diff...

G06F 2218/22   Source localisation; Invers...

G06V 10/24   Aligning, centring, orienta...

G06V 10/811   the classifiers operating o...

H04N 7/15   Conference systems

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker detection and tracking using audiovisual data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links