Association of visual labels and event context in image data
First Claim
Patent Images
1. A method, comprising:
- generating a first set of contextual dimensions from one or more textual descriptions associated with a given event, wherein the one or more textual descriptions comprise a corpus of text describing one or more aspects of the given event, and the first set of contextual dimensions results in a first taxonomy for the one or more textual descriptions;
generating a second set of contextual dimensions from one or more audio-visual features associated with the given event, wherein the one or more audio-visual features comprise at least one of a video content and an image content that visually depicts the one or more aspects of the given event together with an audio content component, and the second set of contextual dimensions results in a second taxonomy for the one or more audio-visual features;
constructing a similarity structure from the first set of contextual dimensions and the second net of contextual dimensions, wherein the similarity structure comprises a visual and textual concept relationship network that links the first taxonomy and the second taxonomy based on relatedness between elements of the first taxonomy and the second taxonomy; and
matching one or more of the textual descriptions with one or more of the audio-visual features based on the similarity structure such that the one or more textual descriptions that match the one or more audio-visual features serve to annotate the one or more audio-visual features;
wherein the generating, constructing and matching steps are performed via one or more processing devices.
1 Assignment
0 Petitions
Accused Products
Abstract
A first set of contextual dimensions is generated from one or more textual descriptions associated with a given event, which includes one or more examples. A second set of contextual dimensions is generated from one or more visual features associated with the given event, which includes one or more visual example recordings. A similarity structure is constructed from the first set of contextual dimensions and the second set of contextual dimensions. One or more of the textual descriptions is matched with one or more of the visual features based on the similarity structure.
29 Citations
21 Claims
-
1. A method, comprising:
-
generating a first set of contextual dimensions from one or more textual descriptions associated with a given event, wherein the one or more textual descriptions comprise a corpus of text describing one or more aspects of the given event, and the first set of contextual dimensions results in a first taxonomy for the one or more textual descriptions; generating a second set of contextual dimensions from one or more audio-visual features associated with the given event, wherein the one or more audio-visual features comprise at least one of a video content and an image content that visually depicts the one or more aspects of the given event together with an audio content component, and the second set of contextual dimensions results in a second taxonomy for the one or more audio-visual features; constructing a similarity structure from the first set of contextual dimensions and the second net of contextual dimensions, wherein the similarity structure comprises a visual and textual concept relationship network that links the first taxonomy and the second taxonomy based on relatedness between elements of the first taxonomy and the second taxonomy; and matching one or more of the textual descriptions with one or more of the audio-visual features based on the similarity structure such that the one or more textual descriptions that match the one or more audio-visual features serve to annotate the one or more audio-visual features; wherein the generating, constructing and matching steps are performed via one or more processing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product comprising a processor-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by the one or more processing devices implement steps of:
-
generating a first set of contextual dimensions from one or more textual descriptions associated with a given event, wherein the one or more textual descriptions comprise a corpus of text describing one or more aspects of the given event, and the first set of contextual dimensions results in a first taxonomy for the one or more textual descriptions; generating a second set of contextual dimensions from one or more audio-visual features associated with the given event, wherein the one or more audio-visual features comprise at least one of a video content and an image content that visually depicts the one or more aspects of the given event together with an audio content component, and the second set of contextual dimensions results in a second taxonomy for the one or more visual features; constructing a similarity structure from the first set of contextual dimensions and the second set of contextual dimensions, wherein the similarity structure comprises a visual and textual concept relationship network that links the first taxonomy and the second taxonomy based on relatedness between elements of the first taxonomy and the second taxonomy; and matching one or more of the textual descriptions with one or more of the audio-visual features based on the similarity structure such that the one or more textual descriptions that match the one or more audio-visual features serve to annotate the one or more audio-visual features.
-
-
12. An apparatus, comprising:
-
a memory; and a processor operatively coupled to the memory and configured to;
generate a first set of contextual dimensions from one or more textual descriptions associated with a given event, wherein the one or more textual descriptions comprise a corpus of text describing one or more aspects of the given event,and the first set of contextual dimensions results in a first taxonomy for the one or more textual descriptions;
generate a second set of contextual dimensions from one or more audio-visual features associated with the given event, wherein the one or more audio-visual features comprise at least one of a video content and an image content that visually depicts the one or more aspects of the given event together with an audio content component, and the second set of contextual dimensions results in a second taxonomy for the one or more audio-visual features;
construct a similarity structure from the first set of contextual dimensions and the second set of contextual dimensions, wherein the similarity structure comprises a visual and textual concept relationship network that links the first taxonomy and the second taxonomy based on relatedness between elements of the first taxonomy and the second taxonomy; and
match one or more of the textual descriptions with one or more of the audio-visual features based on the similarity structure such that the one or more textual descriptions that match the one or more visual audio-visual features serve to annotate the one or more audio-visual features. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification