Joint acoustic and visual processing

US 10,515,292 B2
Filed: 06/15/2017
Issued: 12/24/2019
Est. Priority Date: 06/15/2016
Status: Active Grant

First Claim

Patent Images

1. A method for cross-modal media processing comprising:

configuring a cross-modal similarity processor, including processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements;

wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector,wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, andwherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach to joint acoustic and visual processing associates images with corresponding audio signals, for example, for the retrievals of images according to voice queries. A set of paired images and audio signals are processed without requiring transcription, segmentation, or annotation of either the images or the audio. This processing of the paired images and audio is used to determine parameters of an image processor and an audio processor, with the outputs of these processors being comparable to determine a similarity across acoustic and visual modalities. In some implementations, the image processor and the audio processor make use of deep neural networks. Further embodiments associate parts of images with corresponding parts of audio signals.

24 Citations

View as Search Results

15 Claims

1. A method for cross-modal media processing comprising:
- configuring a cross-modal similarity processor, including processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements;
  
  wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector,wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, andwherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15)
- - 2. The method of claim 1 wherein the processing of the reference set of media is performed without requiring semantic annotation between the items of the corresponding pairs as an input to the processing.
  - 3. The method of claim 2 wherein the processing of the reference set of media performed requiring without semantic annotation of the items of the corresponding pairs.
  - 4. The method of claim 1 further comprising:
    - receiving a query specified as an item in a first mode, the first mode being either an audio mode or image mode; and
      
      using the configured similarity processor to select one or more items in a second mode different that the first mode, the second mode being either an audio mode or image mode, including evaluating a similarity of the query and a plurality of items to select the one or more items using the cross-modality similarity processor.
  - 5. The method of claim 4 wherein the query is specified as an audio item, and the selected one or more items comprise one or more images.
  - 6. The method of claim 4 wherein the query is specified as an image item, and the selected one or more items comprise one or more audio items.
  - 7. The method of claim 1 further comprising:
    - receiving a first item in a first mode, the first mode being either an audio mode or image mode;
      
      receiving a second item in a second mode, the second mode being different that the first mode; and
      
      using the configured similarity processor to select a part of the second item according to the first item, including evaluating a similarity of some or all of the first item and a plurality of parts of the second item using the cross-modality similarity processor.
  - 8. The method of claim 7 wherein the first item an audio item and the second item is an image item.
  - 9. The method of claim 8 wherein the first item comprises a section of an audio signal, and the image item comprises a frame of a video signal.
  - 10. The method of claim 8 further comprising:
    - presenting the audio item concurrently with the image item,wherein presenting the image item includes highlighting the selected part of the image item.
  - 11. The method of claim 1 wherein the image processor and the audio processor each comprises a convolutional neural network.
  - 12. The method of claim 1 further comprising:
    - processing a second reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements, the processing being performed without requiring segmentation or annotation of the items according to the content of said items,wherein processing the second reference set includes, for each pair of items of the second reference set,partitioning each item of the pair into a plurality of parts,forming a set of pairs of parts, one part of each pair from a corresponding item of the pair of items, andfor each pair of parts, determining a similarity of the pair of parts using the configured similarity processor;
      
      wherein processing the second reference set further includesselecting a subset of the pairs of parts according to the similarity of the parts determined using the similarity processor, andfor the selected subset of the pairs of parts, forming groups of similar pairs of parts, each group representing a semantic unit.
  - 13. The method of claim 12 further comprising:
    - receiving a query specified as an item in a first mode, the first mode being either an audio mode or image mode; and
      
      using the configured similarity processor and the groups representing respective semantic units to select one of more parts of items in a second mode different that the first mode, the second mode being either an audio mode or image mode, including evaluating a similarity of the query and a plurality of parts items to select the one or more parts of items using the cross-modality similarity processor.
  - 15. The method of claim 1 wherein the audio processor is configured to produce the numerical representation of the input audio signal without forming a word-based semantic representation of the input audio signal.

14. A non-transitory machine-readable medium having instructions stored thereon, the instructions when executed by a data processing system cause said system to:
- configure a cross-modal similarity processor, by processing a first reference set of media that includes a set of corresponding pairs of media items, each pair of the media items includes one audio item and one image item, the items of each pair having related content elements;
  
  wherein the configuring of the similarity processor includes setting parameter values for an image processor and for an audio processor, the image processor and the audio processor each being configured to produce a fixed-length numerical representation of an input image and input audio signal, respectively, wherein the image processor is configured to produce a first numerical vector, and the audio processor is configured to produce a second numerical vector,wherein the image processor and the audio processor each comprises an artificial neural network, and setting parameter values for the image processor and for the audio processor includes applying a neural network weight determination approach to determine the parameter values, andwherein the similarity processor is configured to output a quantity representing a similarity between the input image and the input audio signal based on the numerical representations, the quantity representing the similarity comprising a similarity between the first numerical vector and the second numerical vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Inventors
Harwath, David F., Glass, James R.
Primary Examiner(s)
Woldemariam, Aklilu K

Application Number

US15/623,682
Publication Number

US 20180039859A1
Time in Patent Office

922 Days
Field of Search

382155, 381 56, 707733, 707749, 707916, 707765, 707 99, 715202, 715209, 715764
US Class Current
CPC Class Codes

G06F 18/214   Generating training pattern...

G06F 18/22   Matching criteria, e.g. pro...

G06F 18/256   of results relating to diff...

G06F 3/167   Audio in a user interface, ...

G06N 3/045   Combinations of networks

G06N 3/048   Activation functions

G06N 3/08   Learning methods

G06V 10/774   Generating sets of training...

G06V 10/811   the classifiers operating o...

G06V 10/82   using neural networks

G06V 20/35   Categorising the entire sce...

G10L 15/1815   Semantic context, e.g. disa...

G10L 25/30   using neural networks

G10L 25/54   for retrieval

Joint acoustic and visual processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

24 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Joint acoustic and visual processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links