Temporally consistent caption detection on videos using a 3D spatiotemporal method
First Claim
Patent Images
1. An apparatus comprising:
- an input for receiving a video signal comprising a first image and a second image wherein said second image is received after said first image;
a processor operative to determine a first probable location of a first caption within a first image, analyze said first image to identify a first region of said first image comprising said probable location of said first caption, determine a second probable location of a second caption with a second image, analyze said second image to identify a second region of said second image comprising said probable location of said second caption, determine a spatial overlap between said first region and said second region, and generate a data representing said spatial overlap.an output for coupling said data to a video processor.
4 Assignments
0 Petitions
Accused Products
Abstract
A caption detection system wherein all detected caption boxes over time for one caption area are identical, thereby reducing temporal instability and inconsistency. This is achieved by grouping candidate pixels in the 3D spatiotemporal space and generating a 3D bounding box for one caption area. 2D bounding boxes are obtained by slicing the 3D bounding boxes, thereby reducing temporal instability as all 2D bounding boxes corresponding to a caption area are sliced from one 3D bounding box and are therefore identical over time.
14 Citations
19 Claims
-
1. An apparatus comprising:
-
an input for receiving a video signal comprising a first image and a second image wherein said second image is received after said first image; a processor operative to determine a first probable location of a first caption within a first image, analyze said first image to identify a first region of said first image comprising said probable location of said first caption, determine a second probable location of a second caption with a second image, analyze said second image to identify a second region of said second image comprising said probable location of said second caption, determine a spatial overlap between said first region and said second region, and generate a data representing said spatial overlap. an output for coupling said data to a video processor. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. The apparatus of claim 1 wherein said processor generates a plurality of spatial overlap representations, wherein each of said plurality of spatial overlaps is compared to a different threshold, and the combination of said comparisons is used to indicate a high probability of a time continuous caption being located in said spatial overlap.
-
8. The apparatus of claim 1 wherein said data is a bounding box representing said spatial overlap.
-
9. The apparatus of claim 1 wherein said each of said first region and said second region are represented as a bounding box and said data represents the spatial overlap of said bounding boxes.
-
10. A method for processing a video signal comprising the steps of:
-
receiving a first image in said video signal; determining a first probable location of a first caption within a first image; analyzing said first image to identify a first region of said first image comprising said probable location of said first caption; receiving a second image in said video signal wherein said second image is received after said first image; determining a second probable location of a second caption with a second image; analyzing said second image to identify a second region of said second image comprising said probable location of said second caption determining a spatial overlap between said first region and said second region; and generating a data representing said spatial overlap.
-
-
11. The method of claim 10 wherein said data is stored in a memory and updated with the determination results of each subsequent image received via said video signal, the data representing a two dimensional analysis of the spatial overlap and a temporal representation of the spatial overlap.
-
12. The method of claim 11 wherein said data is coupled to a video processor when said temporal representation exceeds a threshold.
-
13. The method of claim 12 wherein exceeding said threshold indicates a high probability of a time continuous caption being located in said spatial overlap.
-
14. The method of claim 10 wherein a plurality of spatial overlap representations are generated, wherein each of said plurality of spatial overlaps is compared to a different threshold, and the combination of said comparisons is used to indicate a high probability of a time continuous caption being located in said spatial overlap.
-
15. The method of claim 10 wherein said data is a bounding box representing said spatial overlap.
-
16. The method of claim 10 wherein said each of said first region and said second region are represented as a bounding box and said data represents the spatial overlap of said bounding boxes.
-
17. The method of claim 10 further comprising the step of verifying the content of said spatial overlap using a projection profile of a text block image and a plurality of features extracted based on the a local minima of a projection profile.
-
18. The method of claim 10 further comprising the step of verifying the content of said spatial overlap using a machine learning based classifier to classify a text block image as text image or non-text image.
-
19. The method of claim 10 wherein said spatial overlap is represented as a plurality of grey scale values, wherein each gray scale value indicates a probability of a pixel within said spatial overlap being a part of a time continuous caption within said spatial overlap
Specification