Enhanced max margin learning on multimodal data mining in a multimedia database

US 8,463,053 B1
Filed: 08/10/2009
Issued: 06/11/2013
Est. Priority Date: 08/08/2008
Status: Active Grant

First Claim

Patent Images

1. A method for multimodal data mining, comprising:

defining a multimodal data set comprising image information;

representing image information of a data object as a set of feature vectors in a feature space representing a plurality of constraints for each training example, wherein the feature vectors comprise a joint feature representation associated with Lagrange multipliers, the feature vectors being partitioned into a dual variable set comprising two partitions and having non-image representations associated with the respective image data object;

clustering in the feature space to group similar features;

associating a non-image representation with a respective data object based on the clustering;

determining a joint feature representation of a respective data object as a mathematical weighted combination of a set of components of the joint feature representation;

optimizing a weighting for a plurality of components of the mathematical weighted combination with respect to a prediction error between a predicted classification and a training classification by iteratively solving a Lagrange dual problem, with an automated data processor, by partitioning the Lagrange multipliers into an active set and an inactive set, wherein the Lagrange multiplier for a member of the active set is greater than or equal to zero and the Lagrange multiplier for a member of the inactive set is zero, the iteratively solving comprising moving members of the active set having zero-valued Lagrange multipliers to the inactive set without changing an objective function, and moving members of the inactive set to the active set which result in a decrease in the objective function; and

employing the mathematical weighted combination for automatically classifying a new data object,wherein;

the set of feature vectors in the feature space represents a plurality of constraints for each training example, the feature vectors comprise joint feature representation defined by Φ

, having a Lagrange multiplier μ

_{i, y}for each constraint to form the Lagrangian, wherein i and j denote different elements of a respective set, y_iis an annotation and a member of the set Y, x_iand x_jare each annotations and members of the set X, superscript T denotes a transpose matrix, n is the number of elements in the respective set, y represents a prediction of y_ifrom the set Y_i, {tilde over (y)} is an operator of y, α

=Σ

_{i, y}μ

_{i, y}Φ

_i,y_i_{, y}, l( y,y_i) is a loss function defined as the number of the different entries in vectors y and y_i, Φ

_{i,yi, y}=Φ

_i(y_i)−

Φ

_i( y), and a kernel function K((x_i, y),(x_j,{tilde over (y)}))=<

Φ

_{i,yi, y}, Φ

_{j,yj, y}>

, the feature vectors being partitioned into a dual variable set μ

comprising two partitions, μ

_Band μ

_Nand non-image representations S associated with the respective image data object, the dual variable set μ

having i examples such that μ

=[μ

₁^T. . . μ

_n^T]^Tand S=[S₁^T. . . S_n^T]^T, wherein the lengths of μ and

S are the same, and A_iis defined as a vector which has the same length as that of μ

, where A_i, y=1 and A_j, y=0 for j≠

i, such that A=[A₁. . . A_n]^T, matrix D represents a kernel matrix where each entry is K((x_i, y), (x_j,{tilde over (y)})), and C represents a vector where each entry is a constant C;

the feature vectors comprise a dual variable set μ

comprising labeled examples which is decomposed into two partitions, μ

_Band μ

_N; and

said optimizing comprises iteratively solving for each member of the set;

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Multimodal data mining in a multimedia database is addressed as a structured prediction problem, wherein mapping from input to the structured and interdependent output variables is learned. A system and method for multimodal data mining is provided, comprising defining a multimodal data set comprising image information; representing image information of a data object as a set of feature vectors in a feature space; clustering in the feature space to group similar features; associating a non-image representation with a respective image data object based on the clustering; determining a joint feature representation of a respective data object as a mathematical weighted combination of a set of components of the joint feature representation; optimizing a weighting for a plurality of components of the mathematical weighted combination with respect to a prediction error between a predicted classification and a training classification; and employing the mathematical weighted combination for automatically classifying a new data object.

6 Citations

20 Claims

1. A method for multimodal data mining, comprising:
- defining a multimodal data set comprising image information;
  
  representing image information of a data object as a set of feature vectors in a feature space representing a plurality of constraints for each training example, wherein the feature vectors comprise a joint feature representation associated with Lagrange multipliers, the feature vectors being partitioned into a dual variable set comprising two partitions and having non-image representations associated with the respective image data object;
  
  clustering in the feature space to group similar features;
  
  associating a non-image representation with a respective data object based on the clustering;
  
  determining a joint feature representation of a respective data object as a mathematical weighted combination of a set of components of the joint feature representation;
  
  optimizing a weighting for a plurality of components of the mathematical weighted combination with respect to a prediction error between a predicted classification and a training classification by iteratively solving a Lagrange dual problem, with an automated data processor, by partitioning the Lagrange multipliers into an active set and an inactive set, wherein the Lagrange multiplier for a member of the active set is greater than or equal to zero and the Lagrange multiplier for a member of the inactive set is zero, the iteratively solving comprising moving members of the active set having zero-valued Lagrange multipliers to the inactive set without changing an objective function, and moving members of the inactive set to the active set which result in a decrease in the objective function; and
  
  employing the mathematical weighted combination for automatically classifying a new data object,wherein;
  
  the set of feature vectors in the feature space represents a plurality of constraints for each training example, the feature vectors comprise joint feature representation defined by Φ
  
  , having a Lagrange multiplier μ
  
  _{i, y}for each constraint to form the Lagrangian, wherein i and j denote different elements of a respective set, y_iis an annotation and a member of the set Y, x_iand x_jare each annotations and members of the set X, superscript T denotes a transpose matrix, n is the number of elements in the respective set, y represents a prediction of y_ifrom the set Y_i, {tilde over (y)} is an operator of y, α
  
  =Σ
  
  _{i, y}μ
  
  _{i, y}Φ
  
  _i,y_i_{, y}, l( y,y_i) is a loss function defined as the number of the different entries in vectors y and y_i, Φ
  
  _{i,yi, y}=Φ
  
  _i(y_i)−
  
  Φ
  
  _i( y), and a kernel function K((x_i, y),(x_j,{tilde over (y)}))=<
  
  Φ
  
  _{i,yi, y}, Φ
  
  _{j,yj, y}>
  
  , the feature vectors being partitioned into a dual variable set μ
  
  comprising two partitions, μ
  
  _Band μ
  
  _Nand non-image representations S associated with the respective image data object, the dual variable set μ
  
  having i examples such that μ
  
  =[μ
  
  ₁^T. . . μ
  
  _n^T]^Tand S=[S₁^T. . . S_n^T]^T, wherein the lengths of μ and
  
  S are the same, and A_iis defined as a vector which has the same length as that of μ
  
  , where A_i, y=1 and A_j, y=0 for j≠
  
  i, such that A=[A₁. . . A_n]^T, matrix D represents a kernel matrix where each entry is K((x_i, y), (x_j,{tilde over (y)})), and C represents a vector where each entry is a constant C;
  
  the feature vectors comprise a dual variable set μ
  
  comprising labeled examples which is decomposed into two partitions, μ
  
  _Band μ
  
  _N; and
  
  said optimizing comprises iteratively solving for each member of the set;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1, wherein the multimodal data set comprises image information and annotations of the image information.
  - 3. The method according to claim 1, wherein the dual variable set comprises a semantic variable.
  - 4. The method according to claim 1, wherein the dual variable set comprises linguistic data.
  - 5. The method according to claim 1, wherein the dual variable set comprises image data.
  - 6. The method according to claim 1, wherein the dual variable set comprises audio data.
  - 7. The method according to claim 1, wherein the dual variable set comprises a semantic variable and an image variable.
  - 8. The method according to claim 1, further comprising using the partitions to inference between two different examples in the set μ
    - .
  - 9. The method according to claim 1, further comprising annotating at least one example based on the partitions.
  - 10. The method according to claim 1, further comprising receiving a query representing one of the dual variables, and identifying the examples which correspond to the query.
  - 11. The method according to claim 1, wherein one of the variables is a linguistic variable, and the other of the variables comprises image data, further comprising receiving at least one word as a query and responding to the query by retrieving examples having images which correspond to that word.
  - 12. The method according to claim 1, wherein a first of the variables represents a descriptive annotation of the second of the variables.
  - 13. The method according to claim 1, wherein at least one of the labeled examples comprises representations of a plurality of objects, and wherein a label variable comprises a structured output word space.

14. A system for multimodal data mining, comprising:
- an input adapted to receive a multimodal data set comprising image information;
  
  an automated processor, configured to;
  
  represent image information of a data object as a set of feature vectors in a feature space, representing a plurality of constraints for each training example, wherein the feature vectors comprise joint feature representation associated with Lagrange multipliers, the feature vectors being partitioned into a dual variable set comprising two partitions and having non-image representations associated with the respective image data object;
  
  perform clustering in the feature space to group similar features;
  
  associate a non-image representation with a respective image data object based on the clustering;
  
  determine a joint feature representation of a respective data object as a mathematical weighted combination of a set of components of the joint feature representation;
  
  optimize a weighting for a plurality of components of the mathematical weighted combination with respect to a prediction error between a predicted classification and a training classification, by iteratively solving a Lagrange dual problem, with an automated data processor, by partitioning the Lagrange multipliers into an active set and an inactive set, wherein the Lagrange multiplier for a member of the active set is greater than or equal to zero and the Lagrange multiplier for a member of the inactive set is zero, the iteratively solving comprising moving members of the active set having zero-valued Lagrange multipliers to the inactive set without changing an objective function, and moving members of the inactive set to the active set which result in a decrease in the objective function; and
  
  an output from the automated processor, configured to communicate a classification of a new data object based on the mathematical weighted combination,wherein;
  
  the set of feature vectors in the feature space represents a plurality of constraints for each training example, the feature vectors comprise joint feature representation defined by Φ
  
  , having a Lagrange multiplier μ
  
  _{i, y}for each constraint to form the Lagrangian, wherein i and j denote different elements of a respective set, y_iis an annotation and a member of the set Y, x_iand x_jare each annotations and members of the set X, superscript T denotes a transpose matrix, n is the number of elements in the respective set, y represents a prediction of y_ifrom the set Y_i, {tilde over (y)} is an operator of y, α
  
  =Σ
  
  _{i, y}μ
  
  _{i, y}Φ
  
  _i,y_i_{, y}, l( y,y_i) is a loss function defined as the number of the different entries in vectors y and y_i, Φ
  
  _{i,yi, y}=Φ
  
  _i(y_i)−
  
  Φ
  
  _i( y), and a kernel function K((x_i, y),(x_j,{tilde over (y)}))=<
  
  Φ
  
  _{i,yi, y}>
  
  , the feature vectors being partitioned into a dual variable set μ
  
  comprising two partitions, μ
  
  _Band μ
  
  _Nand non-image representations S associated with the respective image data object, the dual variable set μ
  
  having i examples such that μ
  
  =[μ
  
  ₁^T. . . μ
  
  _n^T]^Tand S=[S₁^T. . . S_n^T]^T, wherein the lengths of μ and
  
  S are the same, and A₁is defined as a vector which has the same length as that of μ
  
  , where A_i, y=1 and A_j, y=0 for j≠
  
  i, such that A=[A₁. . . A_n]^T, matrix D represents a kernel matrix where each entry is K((x_i, y), (x_j,{tilde over (y)})), and C represents a vector where each entry is a constant C;
  
  the feature vectors comprise a dual variable set μ
  
  comprising labeled examples which is decomposed into two partitions, μ
  
  _Band μ
  
  _N; and
  
  said optimizing comprises iteratively solving for each member of the set;
- View Dependent Claims (15, 16)
- - 15. The system according to claim 14, wherein the multimodal data set comprises image information and annotations of the image information.
  - 16. The system according to claim 14, wherein one of the variables is a linguistic variable, and the other of the variables comprises image data, the automated processor being further configured to receive at least one word as a query and to respond to the query by retrieving examples having images which correspond to that word.

17. A method for multimodal data processing, comprising:
- representing image information of a data object as a set of feature vectors in a feature space, representing a plurality of constraints for each training example, wherein the feature vectors comprise joint feature representation associated with Lagrange multipliers, the feature vectors being partitioned into a dual variable set comprising two partitions and having non-image representations associated with the respective image data object;
  
  clustering data objects having similar features in the feature space together;
  
  associating non-image information with a respective image data object based on the clustering;
  
  representing a respective data object as a mathematical weighted combination of a set of joint feature representation components;
  
  optimizing a weighting for a plurality of components of the mathematical weighted combination with respect to a prediction error between a predicted classification and a training classification, by iteratively solving a Lagrange dual problem, with an automated data processor, by partitioning the Lagrange multipliers into an active set and an inactive set, wherein the Lagrange multiplier for a member of the active set is greater than or equal to zero and the Lagrange multiplier for a member of the inactive set is zero, the iteratively solving comprising moving members of the active set having zero-valued Lagrange multipliers to the inactive set without changing an objective function, and moving members of the inactive set to the active set which result in a decrease in the objective function; and
  
  employing the mathematical weighted combination for automatically classifying a new data object,wherein;
  
  the set of feature vectors in the feature space represents a plurality of constraints for each training example, the feature vectors comprise joint feature representation defined by Φ
  
  , having a Lagrange multiplier μ
  
  _{i, y}for each constraint to form the Lagrangian, wherein i and j denote different elements of a respective set, y_iis an annotation and a member of the set Y, x_iand x_jare each annotations and members of the set X, superscript T denotes a transpose matrix, n is the number of elements in the respective set, y represents a prediction of y_ifrom the set Y_i, {tilde over (y)} is an operator of y, α
  
  =Σ
  
  _{i, y}μ
  
  _{i, y}Φ
  
  _i,y_i_{, y}, l( y,y_i) is a loss function defined as the number of the different entries in vectors y and y_i, Φ
  
  _{i,yi, y}=Φ
  
  _i(y_i)−
  
  Φ
  
  _i( y), and a kernel function K((x_i, y),(x_j,{tilde over (y)}))=<
  
  Φ
  
  _{i,yi, y}>
  
  , the feature vectors being partitioned into a dual variable set μ
  
  comprising two partitions, μ
  
  _Band μ
  
  _Nand non-image representations S associated with the respective image data object, the dual variable set μ
  
  having i examples such that μ
  
  =[μ
  
  ₁^T. . . μ
  
  _n^T]^Tand S=[S₁^T. . . S_n^T]^T, wherein the lengths of μ and
  
  S are the same, and A₁is defined as a vector which has the same length as that of μ
  
  , where A_i, y=1 and A_j, y=0 for j≠
  
  i, such that A=[A₁. . . A_n]^T, matrix D represents a kernel matrix where each entry is K((x_i, y), (x_j,{tilde over (y)})), and C represents a vector where each entry is a constant C;
  
  the feature vectors comprise a dual variable set μ
  
  comprising labeled examples which is decomposed into two partitions, μ
  
  _Band μ
  
  _N; and
  
  said optimizing comprises iteratively solving for each member of the set;
- View Dependent Claims (18, 19, 20)
- - 18. The method according to claim 17, wherein the multimodal data set comprises image information and annotations of the image information.
  - 19. The method according to claim 17, wherein one of the variables is a linguistic variable, and the other of the variables comprises image data, further comprising receiving at least one word as a query and responding to the query by retrieving examples having images which correspond to that word.
  - 20. The method according to claim 17, further comprising annotating at least one example based on the partitions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Research Foundation for The State University of New York (State University of New York)
Original Assignee
The Research Foundation for The State University of New York (State University of New York)
Inventors
Guo, Zhen, Zhang, Zhongfei (Mark)
Primary Examiner(s)
Bella, Matthew
Assistant Examiner(s)
Rosario, Dennis

Application Number

US12/538,845
Time in Patent Office

1,401 Days
Field of Search

382/255
US Class Current

382/225
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/40   of multimedia data, e.g. sl...

G06F 16/45   Clustering; Classification

G06F 16/5838   using colour

G06F 17/10   Complex mathematical operat...

G06F 18/00   Pattern recognition

G06F 18/23   Clustering techniques

G06F 18/2411   based on the proximity to a...

G06F 18/253   of extracted features

G06V 10/764   using classification, e.g. ...

Enhanced max margin learning on multimodal data mining in a multimedia database

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

6 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Enhanced max margin learning on multimodal data mining in a multimedia database

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

6 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links