Joint acoustic and visual processing
First Claim
1. A method for cross-modal media processing comprising:
- configuring a cross-modal similarity processor, including processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements;
wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector,wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, andwherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector.
1 Assignment
0 Petitions
Accused Products
Abstract
An approach to joint acoustic and visual processing associates images with corresponding audio signals, for example, for the retrievals of images according to voice queries. A set of paired images and audio signals are processed without requiring transcription, segmentation, or annotation of either the images or the audio. This processing of the paired images and audio is used to determine parameters of an image processor and an audio processor, with the outputs of these processors being comparable to determine a similarity across acoustic and visual modalities. In some implementations, the image processor and the audio processor make use of deep neural networks. Further embodiments associate parts of images with corresponding parts of audio signals.
24 Citations
15 Claims
-
1. A method for cross-modal media processing comprising:
-
configuring a cross-modal similarity processor, including processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements; wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector, wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, and wherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15)
-
-
14. A non-transitory machine-readable medium having instructions stored thereon, the instructions when executed by a data processing system cause said system to:
-
configure a cross-modal similarity processor, by processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements; wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector, wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, and wherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector.
-
Specification