Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities
First Claim
1. A computer system having one or more memories and one or more central processing units (CPUs), the system comprising:
- one or more multimedia items, stored in the memories, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities; and
a combining process that creates a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities, and concatenates for each of the one or more multimedia items the visual feature vectors and the textual feature vectors into a unified feature vector.
1 Assignment
0 Petitions
Accused Products
Abstract
This invention is a system and method to perform categorization (classification) of multimedia items. These items are comprised of a multitude of disparate information sources, in particular, visual information and textual information. Classifiers are induced based on combining textual and visual feature vectors. Textual features are the traditional ones, such as, word count vectors. Visual features include, but are not limited to, color properties of key intervals and motion properties of key intervals. The visual feature vectors are determined in such a fashion that the vectors are sparse. The vector components are features such as the absence or presence of the color green in spatial regions and the absence or the amount of visual flow in spatial regions of the media items. The text and the visual representation vectors are combined in a systematic and coherent fashion. This vector representation of a media item lends itself to well-established learning techniques. The resulting system, subject of this invention, categorizes (or classifies) media items based both on textual features and visual features.
-
Citations
24 Claims
-
1. A computer system having one or more memories and one or more central processing units (CPUs), the system comprising:
-
one or more multimedia items, stored in the memories, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities; and
a combining process that creates a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities, and concatenates for each of the one or more multimedia items the visual feature vectors and the textual feature vectors into a unified feature vector. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer system having one or more memories and one or more central processing units (CPUs), the system comprising:
-
one or more multimedia items, stored in the memories, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities;
a block process that divides the multimedia items into blocks of one or more key intervals, each key interval having one more frames of the multimedia items;
a combining process that creates a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities, and concatenates for each of the blocks the visual feature vectors and the textual feature vectors into a unified feature vector;
one or more classifiers having one or more classes;
an application process that for each of the blocks, uses the classifiers to determine zero or more of the classes to which the respective blocks belong; and
a segmentation process that finds temporally contiguous groups of the blocks and combines the contiguous groups into media segments where all the blocks in the media segment have one or more of the same classes. - View Dependent Claims (17, 18, 19, 20, 21)
-
-
22. A method for segmenting multimedia streams comprising the steps of:
-
storing one or more multimedia items in one or more memories of computer, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities;
dividing the multimedia items into blocks of one or more key intervals, each key interval having one more frames of the multimedia items;
for each block, creating a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities;
for each block, concatenating the visual feature vectors and the textual feature vectors into a unified feature vector;
categorizing each of the blocks by categorizing the respective unified feature vector; and
assembling two or more of the categorized blocks into a segment.
-
-
23. A memory storing a program, the program comprising the steps of:
-
storing one or more multimedia items in one or more memories of computer, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities;
dividing the multimedia items into blocks of one or more key intervals, each key interval having one more frames of the multimedia items;
for each block, creating a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities;
for each block, concatenating the visual feature vectors and the textual feature vectors into a unified feature vector;
categorizing each of the blocks by categorizing the respective unified feature vector; and
assembling two or more of the categorized blocks into a segment.
-
-
24. A system for segmenting multimedia streams comprising:
-
means for storing one or more multimedia items in one or more memories of computer, each multimedia item having two or more disparate modalities, the disparate modalities being at least one or more visual modalities and one or more textual modalities;
means for dividing the multimedia items into blocks of one or more key intervals, each key interval having one more frames of the multimedia items;
means for creating a visual feature vector for each of the visual modalities and a textual feature vector for each of the textual modalities, block by block;
means for concatenating the visual feature vectors and the textual feature vectors into a unified feature vector, block by block;
means for categorizing each of the blocks by categorizing the respective unified feature vector; and
means for assembling two or more of the categorized blocks into a segment.
-
Specification