Automatically segmenting images based on natural language phrases

US 10,089,742 B1
Filed: 03/14/2017
Issued: 10/02/2018
Est. Priority Date: 03/14/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-readable storage medium having instructions stored thereon for segmenting an image that includes a plurality of pixels, which, when executed by a processor of a computing device cause the computing device to perform actions comprising:

receiving an ordered set of tokens that references a first region of the image;

generating an image map that represents a correspondence between each of a plurality of image features and a corresponding portion of the plurality of pixels;

generating a set of token data elements, wherein each of the token data elements represents semantic features of a corresponding token of the set of tokens;

iteratively updating a segmentation map that represents whether each of the plurality of pixels is included in the first region of the image, wherein each of a plurality of iterative updates of the segmentation map is based on a previous version of the segmentation map and a combination of the image map and one of the token data elements that is based on an order of the set of tokens; and

generating a segmented image based on the image and the segmentation map.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention is directed towards segmenting images based on natural language phrases. An image and an n-gram, including a sequence of tokens, are received. An encoding of image features and a sequence of token vectors are generated. A fully convolutional neural network identifies and encodes the image features. A word embedding model generates the token vectors. A recurrent neural network (RNN) iteratively updates a segmentation map based on combinations of the image feature encoding and the token vectors. The segmentation map identifies which pixels are included in an image region referenced by the n-gram. A segmented image is generated based on the segmentation map. The RNN may be a convolutional multimodal RNN. A separate RNN, such as a long short-term memory network, may iteratively update an encoding of semantic features based on the order of tokens. The first RNN may update the segmentation map based on the semantic feature encoding.

13 Citations

View as Search Results

20 Claims

1. A computer-readable storage medium having instructions stored thereon for segmenting an image that includes a plurality of pixels, which, when executed by a processor of a computing device cause the computing device to perform actions comprising:
- receiving an ordered set of tokens that references a first region of the image;
  
  generating an image map that represents a correspondence between each of a plurality of image features and a corresponding portion of the plurality of pixels;
  
  generating a set of token data elements, wherein each of the token data elements represents semantic features of a corresponding token of the set of tokens;
  
  iteratively updating a segmentation map that represents whether each of the plurality of pixels is included in the first region of the image, wherein each of a plurality of iterative updates of the segmentation map is based on a previous version of the segmentation map and a combination of the image map and one of the token data elements that is based on an order of the set of tokens; and
  
  generating a segmented image based on the image and the segmentation map.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-readable storage medium of claim 1, wherein the actions further comprise:
    - iteratively updating an n-gram data element that encodes semantic features of the order of the set of tokens, wherein each of a plurality of iterative updates of the n-gram data element is based on a previous version of the n-gram data element and one of the token data elements based on the order of the set of tokens; and
      
      iteratively updating the segmentation map, wherein each of the plurality of iterative updates of the segmentation map is further based on a combination of the image map and an updated n-gram data element corresponding to the order of the set of tokens.
  - 3. The computer-readable storage medium of claim 2, wherein each of the plurality of iterative updates of the n-gram data element is further based on a trained long short-term memory (LSTM) neural network that propagates each of the plurality of iterative updates of the n-gram element.
  - 4. The computer-readable storage medium of claim 1, wherein each of the plurality of iterative updates of the segmentation map is further based on a trained recurrence neural network (RNN) that propagates each of the plurality of iterative updates of the segmentation map.
  - 5. The computer-readable storage medium of claim 1, wherein each of the plurality of iterative updates of the segmentation map is further based on a trained convolutional multimodal recurrence neural network (mRNN) that propagates each of the plurality of iterative updates of the segmentation map.
  - 6. The computer-readable storage medium of claim 1, wherein the image features are identified by an image feature identification model that is implemented on a trained fully convolutional neural network (FCN).
  - 7. The one or more computer-readable storage media of claim 1, wherein a word embedding natural language model that embeds each of the tokens in a multidimensional space and a distance between a pair of tokens embedded within the multidimensional space indicates semantic similarities between the pair of tokens based on semantic distributions within a semantic corpus is employed to identify the semantic features of the tokens.

8. A method for segmenting an image, comprising:
- receiving the image, wherein the image includes a plurality of pixels;
  
  generating an n-gram based on a natural language phrase that references an object depicted within a first region of image, wherein the n-gram includes an ordered set of tokens;
  
  generating an image data structure that encodes a mapping between each of a plurality of image features and a corresponding portion of the plurality of pixels, wherein the plurality of images features are identified within the image based on an image feature identification model;
  
  generating a set of token data structures based on a natural language model, wherein each of the token data structures encodes semantic features of a corresponding token of the set of tokens;
  
  iteratively generating a segmentation map based on a first recurrent neural network (RNN) and a plurality of iteratively generated combinations of the image data structure and portions of the set of token data structures, wherein the first RNN propagates the segmentation map during the iterative generation of the segmentation data structure and the segmentation map identifies a subset of the plurality of pixels that are included in the first region of the image; and
  
  segmenting the image based on the iteratively generated segmentation map.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The method for claim 8, further comprising:
    - iteratively generating an n-gram data structure based on a second RNN and the set of token data structures, wherein the second RNN propagates the n-gram data structure during the iterative generation of the n-gram data structure; and
      
      iteratively generating the segmentation map further based on a plurality of iteratively generated combinations of the image data structure and the n-gram data structure.
  - 10. The method of claim 9, further comprising:
    - training a long short-term memory (LSTM) neural network based on a training data that includes a plurality of other n-grams; and
      
      employing the trained LSTM as the second RNN.
  - 11. The method of claim 8, further comprising:
    - receiving a training image, a training n-gram, and a ground-truth segmentation map;
      
      iteratively generating a training segmentation map based on the training image, the training n-gram, and the first RNN;
      
      determining a loss metric based on a comparison of the ground-truth segmentation map and the training segmentation map; and
      
      updating the first RNN based on the loss metric.
  - 12. The method of claim 8, further comprising:
    - receiving audio data encoding the natural language phrase as spoken by a user;
      
      generating textual data based on the received audio data and a speech-to-text model; and
      
      generating the n-gram based the generated textual data.
  - 13. The method of claim 8, further comprising:
    - training a convolutional multimodal recurrent neural network (mRNN) based on a training data that includes a plurality of other images, a plurality of other n-grams, and a plurality of segmentation maps; and
      
      employing the trained mRNN as the first RNN.

14. A computing system for segmenting an image based on an n-gram that references a first region of the image, wherein the image includes a plurality of pixels and the n-gram includes an ordered set of tokens, the system comprising:
- a processor device; and
  
  a computer-readable storage medium, coupled with the processor device, having instructions stored thereon, which, when executed by the processor device, perform actions comprising;
  
  steps for identifying a plurality of images features within the image based on an image feature identification model;
  
  steps for encoding a mapping between each of the plurality of image features and a corresponding portion of the plurality of pixels in an image data structure;
  
  steps for identifying semantic features for each token in the set of tokens based on a natural language model;
  
  steps for encoding the sematic features of each token in the set of tokens as a set of token data structures;
  
  steps for iteratively updating a segmentation map based on the segmentation map and an ordered set of combinations of the image data structure and the set of token data structures based on an order of the set of tokens; and
  
  steps for providing a segmented image based on the image and the segmentation map.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computing system of claim 14, the actions further comprising:
    - steps for iteratively encoding semantic features of the order of the set of tokens in an n-gram data structure based on the n-gram data structure and the set of token data structures; and
      
      steps for iteratively updating the segmentation map further based on the iteratively encoded n-gram data structure.
  - 16. The computing system of claim 15, the actions further comprising:
    - steps for updating the n-gram data structure based on a trained recurrent neural network (RNN); and
      
      steps for employing the trained RNN to store an encoding of the n-gram data structure for a subsequent updating of the n-gram data structure.
  - 17. The computing system of claim 14, the actions further comprising:
    - steps for updating the segmentation map based on a trained recurrent neural network (RNN); and
      
      steps for employing the trained RNN to store an encoding of the segmentation map for a subsequent updating of the segmentation map.
  - 18. The computing system of claim 17, wherein the trained RNN is a convolutional multimodal recurrent neural network (mRNN).
  - 19. The computing system of claim 14, wherein the image feature identification model is implemented on a trained fully convolutional neural network (FCN).
  - 20. The computing system of claim 14, wherein the natural language model is a word embedding natural language model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Systems Incorporated (Adobe Inc.)
Inventors
Lin, Zhe, Lu, Xin, Shen, Xiaohui, Yang, Jimei, Liu, Chenxi
Primary Examiner(s)
Akhavannik, Hadi

Application Number

US15/458,887
Publication Number

US 20180268548A1
Time in Patent Office

567 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 18/24143   Distances to neighbourhood ...

G06F 40/216   using statistical methods

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/084   Backpropagation, e.g. using...

G06T 2207/20084   Artificial neural networks ...

G06T 2207/20101   Interactive definition of p...

G06T 7/11   Region-based segmentation

G06V 10/82   using neural networks

G06V 20/10   Terrestrial scenes scenes u...

G06V 20/70   Labelling scene content, e....

G06V 30/19173   Classification techniques

G10L 15/26   Speech to text systems G10L...

Automatically segmenting images based on natural language phrases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

13 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatically segmenting images based on natural language phrases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links