Collective media annotation using undirected random field models
First Claim
1. A method for detecting two or more concepts in a source digital video which includes one or more digital video frames comprising:
- (a) segmenting the source digital video into a plurality of shots, wherein each shot includes one or more of the digital video frames;
(b) identifying a keyframe within each shot, wherein the keyframe is one of the one or more of the digital video frames;
(c) extracting low level features from the keyframe, wherein the low level features are representative of the two or more concepts, and are related in a graph of concepts, and wherein each concept is semi-automatically generated text associated with the one or more digital video frames, and wherein semi-automatically generated text is automatically generated text which has been manually revised;
(d) training a discriminative classifier for each concept using a set of the low level features, wherein the discriminative classifier is a support vector machine;
(e) building a collective annotation model combining each of the discriminative classifiers;
(f) defining in the collective annotation model one or more interaction potential to model interdependence between related concepts;
(g) receiving a second source digital video;
(h) applying the discriminative classifiers to the second source digital video; and
(i) determining a probability of a presence or absence of the two or more concepts in the low level features extracted from the second source digital video using the collective annotation model and the defined interaction potentials.
1 Assignment
0 Petitions
Accused Products
Abstract
In an embodiment, the present invention relates to a method for semantic analysis of digital multimedia. In an embodiment of the invention, low level features are extracted representative of one or more concepts. A discriminative classifier is trained using these low level features. A collective annotation model is built based on the discriminative classifiers. In various embodiments of the invention, the frame work is totally generic and can be applied with any number of low-level features or discriminative classifiers. Further, the analysis makes no domain specific assumptions, and can be applied to activity analysis or other scenarios without modification. The framework admits the inclusion of a broad class of potential functions, hence enabling multi-modal analysis and the fusion of heterogeneous information sources.
6 Citations
19 Claims
-
1. A method for detecting two or more concepts in a source digital video which includes one or more digital video frames comprising:
-
(a) segmenting the source digital video into a plurality of shots, wherein each shot includes one or more of the digital video frames; (b) identifying a keyframe within each shot, wherein the keyframe is one of the one or more of the digital video frames; (c) extracting low level features from the keyframe, wherein the low level features are representative of the two or more concepts, and are related in a graph of concepts, and wherein each concept is semi-automatically generated text associated with the one or more digital video frames, and wherein semi-automatically generated text is automatically generated text which has been manually revised; (d) training a discriminative classifier for each concept using a set of the low level features, wherein the discriminative classifier is a support vector machine; (e) building a collective annotation model combining each of the discriminative classifiers; (f) defining in the collective annotation model one or more interaction potential to model interdependence between related concepts; (g) receiving a second source digital video; (h) applying the discriminative classifiers to the second source digital video; and (i) determining a probability of a presence or absence of the two or more concepts in the low level features extracted from the second source digital video using the collective annotation model and the defined interaction potentials. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 19)
-
-
17. A system to identify two or more concepts in digital media comprising:
-
a computer including a processing component for extracting low level features representative of the two or more concepts from a source digital video, wherein each concept is semi-automatically generated text associated with one or more digital video frames of the source digital video, and wherein semi-automatically generated text is automatically generated text which has been manually revised; a processing component for training a discriminative classifier for each concept using a set of the low level features, wherein the discriminative classifier is a support vector machine, and wherein the two or more concepts are related in a graph of concepts; a processing component capable of building a collective annotation model based on each of the discriminative classifiers; one or more defined interaction potential defined in the collective annotation model used to model interdependence between related concepts; and a processing component capable of identifying a probability of a presence or absence of the two or more concepts in the low level features extracted from a second source digital video using the collective annotation model and the defined interaction potentials.
-
-
18. A non-transitory machine readable medium having instructions stored thereon that when executed by a processor cause a system to:
-
segment the source digital video into a plurality of shots, wherein each shot includes one or more of the digital video frames; identify a keyframe within each shot, wherein the keyframe is one of the one or more of the digital video frames; extract low level features representative of the two or more concepts, wherein the low level features are representative of the two or more concepts, and are related in a graph of concepts, and wherein each concept is semi-automatically generated text associated with one or more digital video frames, and wherein semi-automatically generated text is automatically generated text which has been manually revised; train a discriminative classifier for each concept using a set of the low level features, wherein the discriminative classifier is a support vector machine; build a collective annotation model based on each of the discriminative classifiers; define in the collective annotation model one or more interaction potential to identify related concepts; receive a second source digital video; apply the discriminative classifiers to the second source digital video; and determine a probability of a presence or absence or absence of the two or more concepts in the low level features extracted from the second source digital video using the collective annotation model and the defined interaction potentials.
-
Specification