Multimedia computer system with story segmentation capability and operating program therefor including finite automation video parser
First Claim
1. A multimedia signal parsing method for operating a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to permit story segmentation of the multimedia signal into discrete stories, each of which has associated therewith a final finite automaton (FA) model and keywords, at least one of which is associated with a respective node of the FA model, the method comprising steps for:
- (a) analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
(b) comparing said identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
(c) constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
(d) coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes;
(e) extracting said keywords from said text information and storing said keywords at locations associated with each node of said constructed FA model;
(f) analyzing and segmenting the audio signal of the multimedia signal into identified speaker segments, music segments, laughter segments, and silent segments (g) attaching said identified speaker segments, music segments, laughter segments, and silent segments to said constructed FA model;
(h) when said constructed FA model matches a previously defined FA model, storing the identity of said constructed FA model as said final FA model along with said keywords; and
(i) when said constructed FA model does not match a previously defined FA model, generating a new FA model corresponding to said constructed FA model, storing said new FA model, and storing the identity of said new FA model as said final FA model along with said keywords.
5 Assignments
0 Petitions
Accused Products
Abstract
A story segment retrieval device for a multimedia computer system storing a multimedia signal including a video signal, an associated audio signal and text information as a plurality of individually retrievable story segments, each having associated therewith a finite automaton (FA) model and keywords, at least one of which is associated with each respective node of the FA model. Advantageously, the story segment retrieval device includes a device for selecting a class of FA models corresponding to a desired story segment to thereby generate a selected FA model class, a device for selecting a subclass of the selected FA model class corresponding to the desired story segment to thereby generate a selected FA model subclass, a device for generating a plurality of keywords corresponding to the desired story segment, a device for sorting a set of the story segments corresponding to the selected FA model subclass using selected keyframes, keywords and query video clips to retrieve ones of the set of the story segments including the desired story segment. Multimedia signal parsing, video story segmentation, and video story categorization methods and corresponding systems, as well as storage media storing computer-readable instructions for performing these methods, are also described.
-
Citations
19 Claims
-
1. A multimedia signal parsing method for operating a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to permit story segmentation of the multimedia signal into discrete stories, each of which has associated therewith a final finite automaton (FA) model and keywords, at least one of which is associated with a respective node of the FA model, the method comprising steps for:
-
(a) analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
(b) comparing said identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
(c) constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
(d) coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes;
(e) extracting said keywords from said text information and storing said keywords at locations associated with each node of said constructed FA model;
(f) analyzing and segmenting the audio signal of the multimedia signal into identified speaker segments, music segments, laughter segments, and silent segments (g) attaching said identified speaker segments, music segments, laughter segments, and silent segments to said constructed FA model;
(h) when said constructed FA model matches a previously defined FA model, storing the identity of said constructed FA model as said final FA model along with said keywords; and
(i) when said constructed FA model does not match a previously defined FA model, generating a new FA model corresponding to said constructed FA model, storing said new FA model, and storing the identity of said new FA model as said final FA model along with said keywords. - View Dependent Claims (2, 3, 4, 5)
(d) coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes by;
(d)(i) retrieving of said text information from the multimedia signal; and
(d)(ii) performing discourse analysis of the retrieved text information so as to generate indicia used in coupling said neighboring video shots.
-
-
3. The multimedia signal parsing method as recited in claim 1, wherein said method further comprises:
-
(j) when it is determined that said video shot sequence does not fit the constructed FA model, realigning said video shot sequence, wherein said step (j) is performed prior to performing said step (f).
-
-
4. The multimedia signal parsing method as recited in claim 1, further comprising steps for:
-
(k) determining whether it is necessary to restructure the constructed FA model to accommodate said identified speaker segments, music segments, and silent segments; and
(l) when restructuring is necessary, restructure the constructed FA model;
wherein said steps (k) and (l) are performed prior to performing said steps (h) and(i).
-
-
5. The multimedia signal parsing method as recited in claim 1, further comprising steps for:
-
(m) determining whether said keywords generated in step (e) match user-selected keywords selected; and
(n) when a match is not detected, terminating the multimedia signal parsing method.
-
-
6. A combination receiving a multimedia signal including a video shot sequence, an audio signal and text information for parsing the multimedia signal into one of a plurality of story program categories, each of the program categories having an associated finite automaton (FA) model and keywords, at least one of which keywords being associated with a respective node of the FA model, comprising:
-
first means for analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
second means for comparing said identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
third means constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
fourth means for coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes;
fifth means for extracting said keywords from said text information and storing said keywords at locations associated with each node of said constructed FA model;
sixth means for analyzing and segmenting the audio signal in the multimedia signal into identified speaker segments, music segments, and silent segments seventh means for attaching said identified speaker segments, music segments, and silent segments to said constructed FA model;
eighth means for storing the identity of said constructed FA model as said final FA model along with said keywords when said constructed FA model matches a previously defined FA model; and
ninth means for generating a new FA model corresponding to said constructed FA model, for storing said new FA model, and for storing the identity of said new FA model as said final FA model along with said keywords when said constructed FA model does not match a previously defined FA model. - View Dependent Claims (7, 8, 9, 10, 11)
fourth means for coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes by employing tenth means for retrieving of said text information from the multimedia signal; and
eleventh means for performing discourse analysis of the retrieved text information so as to generate indicia used in coupling said neighboring video shots.
-
-
8. The combination as recited in claim 6, further comprising:
-
twelfth means for, when it is determined that said video shot sequence does not fit the constructed FA model, realigning said video shot sequence, wherein said twelfth means is operatively coupled between said fifth means and said sixth means.
-
-
9. The combination as recited in claim 6, further comprising:
-
fourteenth means for determining whether it is necessary to restructure the constructed FA model to accommodate said identified speaker segments, music segments, and silent segments; and
fifteenth means for, when restructuring is necessary, restructuring the constructed FA model;
wherein said fourteenth and fifteen means are serially coupled to one another and operatively coupled between said eight and ninth means.
-
-
10. The combination as recited in claim 6, further comprising:
-
sixteenth means for determining whether said keywords generated by said fifth means match user-selected keywords selected; and
seventeenth means for, when a match is not detected, terminating operation of the combination.
-
-
11. The method as recited in claim 6, further comprising:
-
eighteenth means for extracting a plurality of keywords from an input first sentence;
nineteenth means for categorizing said first sentence into one of a plurality of video story categories;
twentieth means for determining whether a current video shot belongs to a previous video story category, a current video story category or a new video story category of said plurality of video story categories responsive to similarity between said first sentence and an immediately preceding sentence; and
twenty-first means for operating said eighteenth through twentieth means seriatim until all video clips and respective sentences are assigned to one of said categories, wherein said eighteenth through twentieth means are serially coupled to both said eighth means and said ninth means, and wherein said eighteenth through twenty-first means are operative when said identified FA model corresponds to a predetermined one of the program categories.
-
-
12. A video story parsing method employed in the operation of a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to permit a multimedia signal parsed into a predetermined category having an associated finite automaton (FA) model and keywords, at least one of the keywords being associated with a respective node of the FA model to be parsed into a number of discrete video stories, the method comprising steps for:
-
(a) extracting a plurality of keywords from an input first sentence;
(b) categorizing said first sentence into one of a plurality of categories;
(c) determining whether a current video shot belongs to a previous category, a current category or a new category of said plurality of categories responsive to similarity between said first sentence and an immediately preceding sentence;
(d) repeating steps (a) through (c) until all video clips and respective sentences are assigned to one of said categories.
-
-
13. The video story parsing method as recited in 12, wherein said step (b) comprises:
-
(b) categorizing said first sentence into one of a plurality of categories by determining a measure Mki of the similarity between the keywords extracted during step (a) and a keyword set for an ith story category Ci according to the expression set;
if Memi≢
0,
if Memi=0,where MK denotes a number of matched words out of a total number Nkeywords of keywords in the respective keyword set for a characteristic sentence in said category Ci, where Memi is indicative of a measure of similarity with respect to the previous sentence sequence within category Ci and wherein 0≦
Mki<
1.
-
-
14. A method for operating a multimedia computer system receiving a multimedia signal including a video shot sequence, an associated audio signal and corresponding text information to thereby generate a video story database including a plurality of discrete stories searchable by one of finite automaton (FA) model having associated keywords, at least one of which keywords is associated with a respective node of the FA model, and user selected similarity criteria, the method comprising steps for:
-
(a) analyzing the video portion of the received multimedia signal to identify keyframes therein to thereby generate identified keyframes;
(b) comparing said identified keyframes within the video shot sequence with predetermined FA characteristics to identify a pattern of appearance within the video shot sequence;
(c) constructing a finite automaton (FA) model describing the appearance of the video shot sequence to thereby generate a constructed FA model;
(d) coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes;
(e) extracting said keywords from said text information and storing said keywords at locations associated with each node of said constructed FA model;
(f) analyzing and segmenting the audio signal of the multimedia signal into identified speaker segments, music segments, laughter segments, and silent segments (g) attaching said identified speaker segments, music segments, laughter segments, and silent segments to said constructed FA model;
(h) when said constructed FA model matches a previously defined FA model, storing the identity of said constructed FA model as said final FA model along with said keywords;
(i) when said constructed FA model does not match a previously defined FA model, generating a new FA model corresponding to said constructed FA model, storing said new FA model, and storing the identity of said new FA model as said final FA model along with said keywords;
(j) when said final FA model corresponds to a predetermined program category, performing video story segmentation according to the substeps of;
(j)(i) extracting a plurality of keywords from an input first sentence;
(j)(ii) categorizing said first sentence into one of a plurality of video story categories;
(j)(iii) determining whether a current video shot belongs to a previous video story category, a current video story category or a new video story category of said plurality of video story categories responsive to similarity between said first sentence and an immediately preceding sentence; and
(j)(iv) repeating steps (j)(i) through (j)(iii) until all video clips and respective sentences are assigned to one of said video story categories. - View Dependent Claims (16, 17, 18, 19)
(d) coupling neighboring video shots or similar shots with said identified keyframes when said neighboring video shots are apparently related to a story represented by said identified keyframes by;
(d)(i) retrieving of said text information from the multimedia signal; and
(d)(ii) performing discourse analysis of the retrieved text information so as to generate indicia used in coupling said neighboring video shots.
-
-
17. The method as recited in claim 14, wherein said method further comprises:
-
(k) when it is determined that said video shot sequence does not fit the constructed FA model, realigning said video shot sequence, wherein said step (k) is performed prior to performing said step (f).
-
-
18. The multimedia signal parsing method as recited in claim 14, further comprising steps for:
-
(l) determining whether it is necessary to restructure the constructed FA model to accommodate said identified speaker segments, music segments, and silent segments; and
(m) when restructuring is necessary, restructure the constructed FA model;
wherein said steps (l) and (m) are performed prior to performing said steps (h) and (i).
-
-
19. The method as recited in claim 14, further comprising steps for:
-
(n) determining whether said keywords generated in step (e) match user-selected keywords selected; and
(o) when a match is not detected, terminating the multimedia signal parsing method.
-
-
15. The method as recited in 14, wherein said substep (j)(ii) further comprises:
-
(j)(ii) categorizing said first sentence into one of a plurality of sentence categories by determining a measure Mki of the similarity between the keywords extracted during step (k)(i) and a keyword set for an ith video story category Ci according to the expression set;
if Memi≢
0,
if Memi=0,where MK denotes a number of matched words out of a total number Nkeywords of keywords in the respective keyword set for a characteristic sentence in said category Ci, where Memi is indicative of a measure of similarity with respect to the previous sentence sequence within category Ci and wherein 0≦
Mki<
1.
-
Specification