Automated Video-To-Text System

US 20070273696A1
Filed: 04/03/2007
Published: 11/29/2007
Est. Priority Date: 04/19/2006
Status: Active Grant

First Claim

Patent Images

1. A method for converting video to text, comprising the steps of:

receiving at least one frame of video;

partitioning the at least one frame of a video into a plurality of blobs;

providing a semantic class label for each blob;

constructing a graph from a plurality of the semantic class labels representing blobs at the vertices and a plurality of edges represent the spatial interactions between blobs; and

traversing the graph to generate text associated with the video.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for transforming Video-To-Text is disclosed that automatically generates text descriptions of the content of a video. The present invention first segments an input video sequence according to predefined semantic classes using a Mixture-of-Experts blob segmentation algorithm. The resulting segmentation is coerced into a semantic concept graph and based on domain knowledge and a semantic concept hierarchy. Then, the initial semantic concept graph is summarized and pruned. Finally, according to the summarized semantic concept graph and its changes over time, text and/or speech descriptions are automatically generated using one of the three description schemes: key-frame, key-object and key-change descriptions.

43 Citations

View as Search Results

24 Claims

1. A method for converting video to text, comprising the steps of:
- receiving at least one frame of video;
  
  partitioning the at least one frame of a video into a plurality of blobs;
  
  providing a semantic class label for each blob;
  
  constructing a graph from a plurality of the semantic class labels representing blobs at the vertices and a plurality of edges represent the spatial interactions between blobs; and
  
  traversing the graph to generate text associated with the video.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20)
- - 2. The method of claim 1, further comprising the step of processing the text description through a text-to-speech synthesizer to generate a speech version of the a least one frame of video.
  - 3. The method of claim 1, wherein said step of partitioning the at least one frame of video into a plurality of blobs further comprises the steps of:
    - (a) applying a trainable sequential maximal a priori estimator (TSMAP) to the at least one frame of video to generate a supervised segmentation map which contains semantic class labels of all of the pixels of the original image;
      
      (b) applying the mean shift segmentation algorithm to the at least one frame of video to generate an unsupervised segmentation map of regions that partition an image into non-overlapping regions;
      
      (c) detecting and segmenting moving objects to produce a moving object detection mask, the mask representing moving and stationary blobs; and
      
      (d) combining the maps generated in steps (a)-(c) to generate a plurality of semantic class labels representing the blobs contained in the at least one frame of video.
  - 4. The method of claim 3, wherein said step of combining the maps generated in steps (a)-(c) comprises the steps of:
    - (e) computing a majority sematic class label from the supervised segmentation map for each blob in the unsupervised segmentation map;
      
      (f) assigning the majority semantic class label computed in step (e) to all pixels of a blob in the unsupervised segmentation map;
      
      (g) computing a majority semantic class label from the moving object detection mask for each blob in the unsupervised segmentation map;
      
      (h) assigning the majority semantic class label computed in step (g) to all pixels of a blob in the unsupervised segmentation map; and
      
      (i) combining the majority semantic class labels assigned in steps (f) and (h) into a plurality of semantic class labels representing the blobs contained in the at least one frame of video using reasoning based on the semantic meanings of the class labels.
  - 5. The method of claim 4, wherein said step of combining the majority semantic class labels of step (i) comprises the steps of:
    - (j) modifying the semantic class label from step (f) using the behavior, the behavior being one of moving and stationary, computed from step (h) when the semantic class labels from steps (f) and (h) are consistent;
      
      otherwise (k) choosing the semantic class label of step (e) when the confidence of the semantic class label computed is high and the confidence of the moving object detection step (g) is low;
      
      otherwise (l) conducting a search for the most likely semantic class label that is consistent with the results of the moving object detection step (g).
  - 6. The method of claim 1, further comprising the steps of:
    - assigning each of the semantic class labels to a corresponding node of the graph;
      
      computing a count of the 4-connections between nodes; and
      
      forming an edge between two nodes when the count is above a predetermined number of pixels.
  - 7. The method of claim 6, wherein the predetermined amount is about 10 pixels.
  - 8. The method of claim 1, wherein the step of constructing a graph further comprises the steps:
    - placing the semantic class labels into a class-dependent tree; and
      
      summarizing at least two semantic class labels into a parent node.
  - 9. The method of claim 6, further comprising the steps of:
    - defining a summarization loss for each summarization of a semantic class label to its parent;
      
      computing a summarization cost for every summarization based both on the reduction of information given by the class dependent tree and a description cost of a scene description in number of words;
      
      computing a final cost value by combining the summarization cost and the description cost using a weigh that is a function of a verbose level; and
      
      summarizing the at least two semantic class labels to a parent node when the verbose level is low, the weight for the scene description cost is high, and the weight for summarization loss is low.
  - 10. The method of claim 6, further comprising at least one of the following steps of:
    - deleting a node that is most likely to be misclassifications by using knowledge of a ground sampling distance and an expected size and shape of a blob;
      
      eliminating shadows by interconnecting all of the shadow'"'"'s neighbors and deleting the node representing the shadow;
      
      deleting nodes that has no connections;
      
      merging connected road types into a single field node;
      
      replacing a fully-connected sub-graph of plurality of tree nodes with a forest node;
      
      reclassifying a vehicle as a moving vehicle if its mean speed exceeds a predetermined threshold;
      
      replacing a group of three or more moving-vehicles connected by “
      
      following”
      
      edges by a convoy; and
      
      replacing a fully-connected sub-graph of plurality of “
      
      X”
      
      nodes by a “
      
      group-of-X”
      
      node.
  - 11. The method of claim 1, wherein the step of traversing the graph further includes the step of employing one of a key frame description procedure, a key object description procedure, and an event/change description procedure.
  - 12. The method of claim 11, wherein employing a key frame description procedure comprises the steps of:
    - (a) initializing each node of the graph with a measure of importance, the measure of importance being based on the semantic class type of the node totaled with the importance of the children of the node and that of its nearest neighbors;
      
      (b) choosing a seed node, the seed node being the node with the highest cumulative importance;
      
      (c) describing the seed node;
      
      (d) describing the seed node'"'"'s neighbors based on one of a depth first and breadth first fashion as dictated by the connectivity of the seed node, the neighbor nodes being described in a depth first fashion when the seed node has less than five neighbors itself, and the neighbor nodes being described in a branch first fashion when the seed node has at least five neighbors; and
      
      (e) once every node that can be reached from the seed node has been described, choosing a new seed node and repeating steps (a)-(e) until all nodes have been visited.
  - 13. The method of claim 11, wherein employing a key object description procedure comprises the step of:
    - (a) initializing each node with a measure of importance, the measure of importance being based on a domain specific ontology and on operator behavior model;
      
      (b) selecting the most important nodes as key nodes;
      
      (c) selecting one of the key nodes as a seed node;
      
      (d) describing the seed node;
      
      (e) describing the seed node'"'"'s neighbors based on one of a depth first and breadth first fashion as dictated by the connectivity of the seed node, the neighbor nodes being described in a depth first fashion when the seed node has less than five neighbors itself, and the neighbor nodes being described in a branch first fashion when the seed node has at least five neighbors; and
      
      (f) once every node that can be reached from the seed node has been described, choosing a new seed node and repeating steps (c)-(e) until all nodes have been visited.
  - 14. The method of claim 11, wherein employing a event/change description procedure comprises the step of:
    - (a) initializing each node and edge representing an event of the graph with a measure of importance, an event being the appearance or disappearance of a node or edge, a high level of importance being an event occurring a predetermined number of times above a predetermined threshold defined by an operator of a video equipment; and
      
      (b) reading out important nodes and edges.
  - 15. The method of claim 1, further including the step of detecting an event.
  - 16. The method of claim 15, wherein the step of detecting an event further comprises the steps of:
    - detecting node changes including creation of a new node, deletion of an existing node, and changes in properties of an existing node;
      
      verifying that the node change is valid; and
      
      creating an alert if the change is valid based on the importance of the node, wherein a node is considered important when it is tracked by an operator of video equipment a predetermined number of times within a predetermined time interval.
  - 17. The method of claim 16, wherein the step of detecting an event further comprises the steps of:
    - detecting a link change, including creation of new link, deletion of an existing link, and changes in properties of an existing link;
      
      verifying that the link change is valid; and
      
      creating an alert if the change is valid based on the importance of the link, wherein a link is considered important when it is tracked by an operator of video equipment a predetermined number of times within a predetermined time interval.
  - 18. The method of claim 16, wherein the step of detecting an event further comprises the steps of:
    - detecting a predefined sub-graph of the graph for known events and threats.
  - 20. The apparatus of claim 18, further comprising a text-to-speech synthesizer for generating a speech version of said at least one frame of video.

19. An apparatus for converting a video to a text description, comprising:
- a receiver for receiving at least one frame of video;
  
  a segmenter module for partitioning said at least one frame of video into a plurality of blobs and for providing a semantic class label for each of said blobs;
  
  a tracking module for providing a global identifier for each of said blobs;
  
  a summarization module for constructing a graph from a plurality of semantic class labels representing said blobs at the vertices and a plurality of edges representing the spatial interactions between said blobs; and
  
  a description generation module for traversing the graph to generate text associated with the video.
- View Dependent Claims (21, 22)
- - 21. The apparatus of claim 19, further comprising a domain knowledgebase/ontology for capturing constraints, context, and common sense knowledge in an application domain.
  - 22. The apparatus of claim 21, further comprising a reasoning engine for enforcing contextual constraints and for cross-validates the results computed from said mixture-of-experts segmenter module, said dense blob tracking module, said blob modeling summarization module, and said blob modeling summarization module in the context of said application domain.

23. A computer-readable medium carrying one or more sequences of instructions for converting video to text, wherein execution of the one of more sequences of instructions by one or more processors causes the one or more processors to perform the steps of:
- receiving at least one frame of video;
  
  partitioning the at least one frame of a video into a plurality of blobs;
  
  providing a semantic class label for each blob;
  
  constructing a graph from a plurality of the semantic class labels representing blobs at the vertices and a plurality of edges represent the spatial interactions between blobs; and
  
  traversing the graph to generate text associated with the video.
- View Dependent Claims (24)
- - 24. The computer-readable medium of claim 23, further comprising the step of processing the text description through a text-to-speech synthesizer to generate a speech version of the a least one frame of video.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
Sarnoff Corporation (SRI International, Inc.)
Inventors
Cheng, Hui, Butler, Darren

Granted Patent

US 7,835,578 B2
Time in Patent Office

Days
Field of Search
US Class Current

345/467
CPC Class Codes

G06T 2207/10016   Video; Image sequence

G06T 2207/20072   Graph-based image processing

G06T 7/00   Image analysis

G06V 10/426   Graphical representations

G06V 20/52   Surveillance or monitoring ...

Automated Video-To-Text System

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

43 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Automated Video-To-Text System

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links