Embedding Space for Images with Multiple Text Labels

US 20170206435A1
Filed: 01/15/2016
Published: 07/20/2017
Est. Priority Date: 01/15/2016
Status: Active Grant

First Claim

Patent Images

1. A method implemented by a computing device to annotate individual images with multiple text labels to describe content of the images, the method comprising:

processing a training image having multiple text labels to generate a set of image regions that correspond to the respective multiple text labels;

embedding, within an embedding space that is configured to embed both text labels and image regions mapped to the text labels, the set of image regions based, in part, on positions at which the multiple text labels that correspond to the image regions of the training image are embedded in the embedding space;

learning a mapping function that maps image regions to the text labels embedded in the embedding space, said learning based, in part, on said embedding the set of image regions within the embedding space;

discovering text labels that correspond to image regions of a query image by mapping the image regions of the query image to the embedding space using the learned mapping function; and

annotating the query image with at least two of the discovered text labels.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embedding space for images with multiple text labels is described. In the embedding space both text labels and image regions are embedded. The text labels embedded describe semantic concepts that can be exhibited in image content. The embedding space is trained to semantically relate the embedded text labels so that labels like “sun” and “sunset” are more closely related than “sun” and “bird”. Training the embedding space also includes mapping representative images, having image content which exemplifies the semantic concepts, to respective text labels. Unlike conventional techniques that embed an entire training image into the embedding space for each text label associated with the training image, the techniques described herein process a training image to generate regions that correspond to the multiple text labels. The regions of the training image are then embedded into the training space in a manner that maps the regions to the corresponding text labels.

Citations

20 Claims

1. A method implemented by a computing device to annotate individual images with multiple text labels to describe content of the images, the method comprising:
- processing a training image having multiple text labels to generate a set of image regions that correspond to the respective multiple text labels;
  
  embedding, within an embedding space that is configured to embed both text labels and image regions mapped to the text labels, the set of image regions based, in part, on positions at which the multiple text labels that correspond to the image regions of the training image are embedded in the embedding space;
  
  learning a mapping function that maps image regions to the text labels embedded in the embedding space, said learning based, in part, on said embedding the set of image regions within the embedding space;
  
  discovering text labels that correspond to image regions of a query image by mapping the image regions of the query image to the embedding space using the learned mapping function; and
  
  annotating the query image with at least two of the discovered text labels.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. A method as described in claim 1, wherein the discovered text labels describe the image content of the query image.
  - 3. A method as described in claim 1, wherein processing the training image to generate the set of image regions that correspond to the respective multiple text labels includes:
    - determining candidate image regions for the set of image regions; and
      
      reducing a number of the determined candidate image regions using at least one post-processing technique.
  - 4. A method as described in claim 3, wherein the candidate image regions are determined using geodesic object proposal.
  - 5. A method as described in claim 3, wherein the at least one post-processing technique includes enforcing a size criterion by discarding candidate image regions having less than a threshold size.
  - 6. A method as described in claim 3, wherein the at least one post-processing technique includes enforcing an aspect ratio criterion by discarding candidate image regions having aspect ratios outside a predefined set of allowable aspect ratios.
  - 7. A method as described in claim 3, wherein the at least one post-processing technique assigns a single candidate image region to each of the multiple text labels of the training image using a single label embedding model.
  - 8. A method as described in claim 1, further comprising processing the query image to generate the image regions of the query image, including:
    - determining a set of semantically meaningful image regions of the query image; and
      
      discarding at least one of the semantically meaningful image regions using at least one post-processing technique, the discarding reducing the set of semantically meaningful image regions to the image regions of the query image.
  - 9. A method as described in claim 1, wherein discovering the text labels that correspond to the image regions of the query image includes computing distances in the embedding space between the image regions of the query image and the text labels to which the image regions of the query image are mapped.
  - 10. A method as described in claim 9, wherein the distances are computed using vectors that represent respective image regions of the query image, the vectors extracted from the image regions of the query image with a Convolutional Neural Network (CNN).
  - 11. A method as described in claim 9, further comprising selecting the discovered text labels used to annotate the query image based on the distances.
  - 12. A method as described in claim 1, further comprising presenting the image regions of the query image that correspond to the discovered text labels with which the query image is annotated.
  - 13. A method as described in claim 1, further comprising training the embedding space to enable mapping of the image regions to the text labels, including:
    - semantically relating text labels of a text vocabulary to determine positions at which to embed the text labels in the embedding space;
      
      processing a plurality of training images each having multiple text labels to generate sets of image regions that correspond to the respective multiple text labels; and
      
      embedding, within the embedding space, the sets of image regions based, in part, on differences of a first computed distance in the embedding space with a second computed distance in the embedding space, the first computed distance being between an embedding of an individual image region of a set image regions and a respective text label for which the individual region is generated, the second computed distance being between the embedding of the individual region and a set of negative text label vectors.
  - 14. A method as described in claim 13, wherein the text labels of the text vocabulary are semantically related based on the Glove model.

15. A system to annotate individual images with multiple text labels to describe content of the images, the system comprising:
- one or more processors; and
  
  computer-readable storage media having stored thereon instructions that are executable by the one or more processors to perform operations comprising;
  
  training an embedding space in which both images and text labels are embedded, said training semantically relating text labels configured to describe semantic concepts exhibited in image content and mapping representative images that have image content which exemplifies the semantic concepts to respective text labels;
  
  learning a mapping function based on the training that maps image regions to the text labels embedded in the embedding space;
  
  obtaining an image to annotate;
  
  determining a set of regions of the image using at least one region proposal technique which determines image regions capable of being mapped to corresponding text labels embedded in the embedding space;
  
  mapping the set of regions of the image to corresponding text labels in the embedding space according to the mapping function, the corresponding text labels describing semantic concepts exhibited in image content of the set of regions of the image; and
  
  annotating the image with at least two of the corresponding text labels.
- View Dependent Claims (16, 17)
- - 16. A system as described in claim 15, wherein at least one region of the set of regions comprises less than an entirety of the image.
  - 17. A system as described in claim 15, wherein at least one region of the set of regions comprises an entirety of the image.

18. A method implemented by a computing device to annotate individual images with multiple text labels to describe content of the images, the method comprising:
- training an embedding space in which both images and text labels are embedded, said training semantically relating text labels configured to describe semantic concepts exhibited in image content and mapping representative images that have image content which exemplifies the semantic concepts to respective text labels;
  
  discovering at least two text labels in the trained embedding space that describe image content of an input image, the at least two text labels discovered describing the image content of at least two respective regions of the input image; and
  
  associating the at least two text labels with the input image.
- View Dependent Claims (19, 20)
- - 19. A method as described in claim 18, wherein one of the at least two text labels associated with the input image is used to identify the input image as corresponding to a search.
  - 20. A method as described in claim 18, wherein the at least two respective regions comprise less than an entirety of the input image.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Systems Incorporated (Adobe Inc.)
Inventors
Ren, Zhou, Jin, Hailin, Fang, Chen, Lin, Zhe

Granted Patent

US 10,026,020 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/5846   using extracted text

G06F 16/5866   using information manually ...

G06F 18/2431   Multiple classes

G06V 10/82   using neural networks

G06V 20/35   Categorising the entire sce...

G06V 20/70   Labelling scene content, e....

G06V 30/274   Syntactic or semantic conte...

Embedding Space for Images with Multiple Text Labels

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Embedding Space for Images with Multiple Text Labels

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links