Visual Language Modeling for Image Classification

US 20090060351A1
Filed: 08/30/2007
Published: 03/05/2009
Est. Priority Date: 08/30/2007
Status: Active Grant

First Claim

Patent Images

1. A method at least partially implemented by a computing device, the method comprising:

modeling images representing multiple image categories as respective matrices of visual words;

generating visual language models from the respective matrices of visual words;

estimating an image category for an image in view of the visual language models; and

presenting the image category or a result based on the image category to a user.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for visual language modeling for image classification are described. In one aspect the systems and methods model training images corresponding to multiple image categories as matrices of visual words. Visual language models are generated from the matrices. In view of a given image, for example, provided by a user or from the Web, the systems and methods determine an image category corresponding to the given image. This image categorization is accomplished by maximizing the posterior probability of visual words associated with the given image over the visual language models. The image category, or a result corresponding to the image category, is presented to the user.

Citations

20 Claims

1. A method at least partially implemented by a computing device, the method comprising:
- modeling images representing multiple image categories as respective matrices of visual words;
  
  generating visual language models from the respective matrices of visual words;
  
  estimating an image category for an image in view of the visual language models; and
  
  presenting the image category or a result based on the image category to a user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein modeling the images as respective matrices of visual words, the images are training images, and wherein the modeling further comprises:
    - for each training image;
      
      dividing the training image into multiple image patches, each image patch being a group of pixels;
      
      for each image patch of the image patches;
      
      extracting features to describe one or more properties of the patch;
      
      representing at least a subset of the features as one or more multidimensional vectors;
      
      transforming, in view of a visual word grammar, the one or more multidimensional vectors into a respective hash code, the respective hash code being a visual word of the visual words, the visual word being in a visual document corresponding to the training image.
  - 3. The method of claim 2, wherein the method further comprises selecting the features to emphasize aspects of a category corresponding to the training image.
  - 4. The method of claim 1, wherein generating the visual language models from the respective matrices of visual words further comprises:
    - correlating visual words in the matrices of visual words according to a visual word grammar indicating conditional distribution of the visual words; and
      
      for each category of the multiple image categories, building respective visual language models based on the conditional distribution of the visual words.
  - 5. The method of claim 4, wherein the visual grammar indicates that visual words are conditionally dependent on only previous visual words.
  - 6. The method of claim 4, wherein a first model of the visual language models treats visual words corresponding to the training image as independent visual words, a second model of the visual language models is based on proximity between neighboring visual words, and a third model of the visual language models is based on visual word dependency on immediate vertical and horizontal neighboring visual words.
  - 7. The method of claim 6, wherein the first model is a unigram model, the second model is a bigram model, and the third model is a trigram model.
  - 8. The method of claim 1, wherein the images are training images, and wherein estimating the image category further comprises:
    - generating a visual document for the image, the visual document comprising a matrix of visual words;
      
      maximizing, for each category of the multiple image categories, conditional distribution of n-grams of individual words of the visual words with respect to the visual language models; and
      
      calculating the image category to be a category associated with a maximum posterior probability to indicate a likelihood that the image is generated by the category, the maximum posterior probability being a generalization probability represented by a product of all conditional probabilities of n-gram.
  - 9. The method of claim 8, wherein “
    - n”
      
      in the n-grams comprises 1, 2, and 3.

10. A computer-readable medium including computer-program instructions executable by a processor encoded thereon, the computer-program instructions when executed by the processor for performing operations comprising:
- building visual language models from matrices of visual words generated from a set of training images, the visual language models being based on a visual word grammar, the training images corresponding to one or more predetermined image classifications;
  
  creating a visual document from an image for image categorization;
  
  determining an image category for the image based on characteristics of the visual document in view of the visual language models, and the image category corresponding to a classification of the one or more predetermined image classifications; and
  
  presenting the image category or a result based on the image category to a user.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The computer-readable medium of claim 10, wherein the visual word grammar indicates that visual words are conditionally dependent on other visual words according to a predetermined order of word generation.
  - 12. The computer-readable medium of claim 10, wherein building the visual language models further comprises:
    - for each training image;
      
      dividing the training image into multiple image patches, each image patch being a group of pixels;
      
      for each image patch of the image patches;
      
      extracting features to describe one or more properties of the patch;
      
      representing at least a subset of the features as one or more multidimensional vectors;
      
      transforming, in view of a visual word grammar, the one or more multidimensional vectors into a respective hash code, the respective hash code being a visual word of the visual words.
  - 13. The computer-readable medium of claim 12, wherein the operations further comprise operations for selecting the features to emphasize aspects of a classification corresponding to the training image.
  - 14. The computer-readable medium of claim 12, wherein the operations further comprise operations for:
    - correlating the visual words according to a conditional distribution of the visual words; and
      
      for each classification of the one or more predetermined image classifications, building respective visual language models based on the conditional distribution of the visual words, a first model of the visual language models treating visual words corresponding to the training image as independent visual words, a second model of the visual language models being based on proximity between two neighboring visual words, a third model of the visual language models being based on visual word dependency on immediate vertical and horizontal neighboring visual words.
  - 15. The computer-readable medium of claim 14, wherein the first model is a unigram model, the second model is a bigram model, and the third model is a trigram model.
  - 16. The computer-readable medium of claim 10, wherein determining the image category further comprises:
    - generating a visual document for the image, the visual document comprising a matrix of visual words;
      
      maximizing, for each classification of the one or more predetermined image classifications, conditional distribution of n-grams of individual words of the visual words with respect to the visual language models; and
      
      calculating the image category to be a category associated with a maximum posterior probability to indicate a likelihood that the image is generated by the category, the maximum posterior probability being a generalization probability represented by a product of all conditional probabilities of n-gram.

17. A computing device comprising:
- a processor; and
  
  a memory couple to the processor, the memory including computer-program instructions encoded thereon, the computer-program instructions, when executed by the processor, for performing operations comprising;
  
  loading a set of training images associated with corresponding image categories;
  
  for each training image of the training images;
  
  (a) dividing the training image into a respective set of image patches;
  
  (b) generating a visual word for each image patch to form a respective visual document for the training image;
  
  for each category of the one or more image categories, generating visual language model(s);
  
  estimating, using the visual language model(s), an image category for a given image;
  
  presenting the image category or a result corresponding to the image category to a user.
- View Dependent Claims (18, 19, 20)
- - 18. The computing device of claim 17, wherein the visual language model(s) comprise a unigram visual life which model, a bigram visual language model, and a trigram visual language model.
  - 19. The computing device of claim 17:
    - wherein generating the visual word for each image patch further comprises extracting one or more features from the image patch to describe properties of the image patch, the one or more features being selected according to a image category corresponding to the training image; and
      
      wherein the features are used to generate a model of the visual language model(s).
  - 20. The computing device of claim 17, wherein estimating the image category for the given image further comprises generating a visual document comprising respective visual words from the given image to determine a conditional distribution of the visual words over respective ones of these visual language model(s), a visual language model associated with a largest conditional distribution of the visual words indicating the image category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Wu, Lei, Ma, Wei-Ying, Li, Mingjing, Li, Zhiwei

Granted Patent

US 8,126,274 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/224
CPC Class Codes

G06F 18/24155   Bayesian classification

G06V 10/424   Syntactic representation, e...

G06V 10/50   by performing operations wi...

G06V 10/764   using classification, e.g. ...

G06V 20/70   Labelling scene content, e....

Visual Language Modeling for Image Classification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Visual Language Modeling for Image Classification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links