Associating Audio with Three-Dimensional Objects in Videos
First Claim
1. A method for locating and tracking one or more audio sources recorded by a plurality of microphones, the method comprising:
- receiving positions and orientations for each of at least one camera;
receiving positions for each of the plurality of microphones.receiving at least one video recorded by a camera;
receiving a plurality of audio signals, each audio signal recorded by a microphone of the plurality of microphones;
applying source separation to the plurality of audio signals to generate one or more audio source signals, each audio source signal having originated from a respective audio source of the one or more audio sources;
estimating, for each audio source, a position associated with the audio source;
estimating a position with computer vision for each of one or more visual objects based on a visual analysis of the at least one video and the at least one position of the at least one camera;
matching each of the one or more audio sources to a corresponding visual object of the one or more visual objects based on the estimated position of the audio source and the estimated position of the visual object;
tracking movement of the one or more visual objects to generate visual object position data associated with movement of the one or more visual objects; and
storing audio source position data for each of the one or more audio sources based on the visual object position data associated with the visual object to which the audio source was matched.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a system and method for generating a model of the geometric relationships between various audio sources recorded by a multi-camera system. The spatial audio scene module associates source signals, extracted from recorded audio, of audio sources to visual objects identified in videos recorded by one or more cameras. This association may be based on estimated positions of the audio sources based on relative signal gains and delays of the source signal received at each microphone. The estimated positions of audio sources are tracked indirectly by tracking the associated visual objects with computer vision. A virtual microphone module may receive a position for a virtual microphone and synthesize a signal corresponding to the virtual microphone position based on the estimated positions of the audio sources.
45 Citations
20 Claims
-
1. A method for locating and tracking one or more audio sources recorded by a plurality of microphones, the method comprising:
-
receiving positions and orientations for each of at least one camera; receiving positions for each of the plurality of microphones. receiving at least one video recorded by a camera; receiving a plurality of audio signals, each audio signal recorded by a microphone of the plurality of microphones; applying source separation to the plurality of audio signals to generate one or more audio source signals, each audio source signal having originated from a respective audio source of the one or more audio sources; estimating, for each audio source, a position associated with the audio source; estimating a position with computer vision for each of one or more visual objects based on a visual analysis of the at least one video and the at least one position of the at least one camera; matching each of the one or more audio sources to a corresponding visual object of the one or more visual objects based on the estimated position of the audio source and the estimated position of the visual object; tracking movement of the one or more visual objects to generate visual object position data associated with movement of the one or more visual objects; and storing audio source position data for each of the one or more audio sources based on the visual object position data associated with the visual object to which the audio source was matched. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A non-transitory computer readable medium storing instructions for locating and tracking one or more audio sources recorded by a plurality of microphones, wherein the instructions when executed by one or more computer processors cause the one or more processors to perform steps comprising:
-
receiving at least one position for each of at least one camera; receiving positions for each of the plurality of microphones. receiving at least one video recorded by a camera; receiving a plurality of audio signals, each audio signal recorded by a microphone of the plurality of microphones; applying source separation to the plurality of audio signals to generate one or more audio source signals, each audio source signal having originated from a respective audio source of the one or more audio sources; estimating, for each audio source, a position associated with the audio source; estimating a position for each of one or more visual objects based on the at least one video; matching each of the one or more audio sources to a corresponding visual object of the one or more visual objects based on the estimated position of the audio source and the estimated position of the visual object; tracking movement of each of the one or more visual objects to generate visual object position data associated with movement of the one or more visual objects; and storing audio source position data for each of the one or more audio sources based on the visual object position data associated with the visual object to which the audio source was matched. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for locating and tracking one or more audio sources recorded by a plurality of microphones, the method comprising:
-
receiving at least one position for each of at least one camera; receiving positions for each of the plurality of microphones. receiving at least one video recorded by a camera; receiving a plurality of audio signals, each audio signal recorded by a microphone of the plurality of microphones; estimating a position for each of one or more visual objects based on the at least one video corresponding to a first time in the at least one video; applying source separation to the plurality of audio signals to generate one or more audio source signals, each audio source signal having originated from a respective audio source of the one or more audio sources; estimating, for each audio source, a position associated with the audio source corresponding to the first time; matching each of the one or more audio sources to a corresponding visual object of the one or more visual objects based on the estimated position of the audio source and the estimated position of the visual object; tracking the position of each of the one or more audio sources by tracking the visual object to which the audio source is matched. - View Dependent Claims (20)
-
Specification