Text localization for image and video OCR

US 20100054585A1
Filed: 02/26/2009
Published: 03/04/2010
Est. Priority Date: 09/03/2008
Status: Active Grant

First Claim

Patent Images

1. A method of text detection in a video image, comprising:

at an image processor, receiving a video frame that potentially contains text;

segmenting the image into regions having similar color;

identifying high likelihood non-text regions from the regions having similar color and discarding the high likelihood non-text regions;

merginging those regions whose the size and color are similar and their horizontal positions are within a threshold in the remaining regions;

describing the regions using features by carrying out a feature extraction process to extract stroke features, edge features, and fill factor features; and

passing the remaining regions through a trained binary classifier to obtain the final text regions which can be binarized and recognized by OCR software.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In accord with embodiments consistent with the present invention, a first action in recognizing text from image and video is to locate accurately the position of the text in image and video. After that, the located and possibly low resolution text can be extracted, enhanced and binarized. Finally existing OCR technology can be applied to the binarized text for recognition. This abstract is not to be considered limiting, since other embodiments may deviate from the features described in this abstract.

55 Citations

View as Search Results

18 Claims

1. A method of text detection in a video image, comprising:
- at an image processor, receiving a video frame that potentially contains text;
  
  segmenting the image into regions having similar color;
  
  identifying high likelihood non-text regions from the regions having similar color and discarding the high likelihood non-text regions;
  
  merginging those regions whose the size and color are similar and their horizontal positions are within a threshold in the remaining regions;
  
  describing the regions using features by carrying out a feature extraction process to extract stroke features, edge features, and fill factor features; and
  
  passing the remaining regions through a trained binary classifier to obtain the final text regions which can be binarized and recognized by OCR software.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, further comprising passing the binarized highest likelihood text regions through an optical character reader.
  - 3. The method according to claim 1, wherein segmenting the image into regions of similar color is carried out by determining that the absolute difference of the average red, green blue (R, G, B) values of two regions are each less than a merging threshold.
  - 4. The method according to claim 1, wherein the segmenting comprises:
    - calculating a color difference of neighboring pixels;
      
      sorting the pixels according to their color difference, and merging pixels with color difference smaller than a threshold so that regions are generated.
  - 5. The method according to claim 1, wherein the binary classifier comprises a support vector machine (SVM) based classifier.
  - 6. The method according to claim 1, wherein stroke width values are considered similar if the stroke widths are within a threshold value.
  - 7. The method according to claim 1, wherein the stroke width features comprise a feature value representing the percentage of neighborhoods in the image whose standard deviation of stroke width is within a threshold value, or percentage of neighborhoods having similar stroke widths vertically.
  - 8. The method according to claim 1, wherein the stroke width features comprise a feature value representing the percentage of the rows whose standard deviation of horizontal stroke width is within a threshold, or who can be clustered into groups and standard deviation of horizontal stroke width in each group is within a threshold, or the percentage of the rows having similar stroke widths or clusters of similar stroke widths.
  - 9. The method according to claim 1, wherein the stroke width feature comprises an average ratio of the current stroke width and the distance of the current stroke to a neighboring stroke.
  - 10. The method according to claim 1, wherein the stroke width feature comprises a ratio of two stroke widths that appear the most frequently.
  - 11. The method according to claim 1, wherein edge features are measurement of the smoothness of edges, uniformity of edges and amount of edges in the candidate region, wherein a smoothness of edges is represented by the percentage of neighborhoods that have the same direction, uniformity of edges is calculated as the frequency of the edge direction that appears the most often, and the amount of edges is measured by the ratio of the length of the total edges to the area of the region.
  - 12. The method according to claim 1, wherein fill factor features are extracted both in the whole candidate image and neighborhood-wise.
  - 13. The method according to claim 1 wherein regions of high likely-hood of being non-text are decided by the following:
    - (1) if region_height is smaller than some threshold T_low, or the region_height is larger than some threshold T_high, or(2) if region_area is smaller than some threshold T_area, or(3) if the region touches one of the four sides of the image border, and its height is larger than a threshold T, or(4) if a fill_factor defined as $\begin{matrix} fill_factor = \frac{Region Area}{Bounding Box Area}, & (11) \end{matrix}$ is lower than a threshold, then a region is considered to be a high likelihood non-text region.
  - 14. The method according to claim 1, wherein the binarization is carried out using a plurality of binarization methods with each binarized output being processed by an optical character reader to produce multiple outputs that are combined.

15. A text detection process, comprising:
- preprocessing an image by segmentation using statistical region merging, removing regions that are definitely not text and grouping regions based on the criteria of height similarity, color similarity, region distance and horizontal alignment defined as follows;
  
  height similarity is defined as $\frac{\max ({HEIGHT}_{1}, {HEIGHT}_{2})}{\min ({HEIGHT}_{1}, {HEIGHT}_{2})} < T_{h e i g h t_s i n i},$ where HEIGHT₁and HEIGHT₂are the height ofthe two regions;
  
  color similarity is defined as
  D(c₁,c₂)=√
  
  {square root over (( R₁−
  
  R₂)²+( G₁×
  
  −
  
  G₂)²+( B₁−
  
  B₂)²)}<
  
  T_color,where [ R₁ G₁ B₁] and c₂=[ R₂ G₂ B₂] are the average color of the two regions;
  
  region distance is defined as D_region<
  
  T_region,where D_regionis the horizontal distance of the two regions, andhorizontal alignment is defined as D_top<
  
  T_alignor D_bottom<
  
  T_align, where D_topand D_bottomare the vertical distances between the top boundary and bottom boundary;
  
  carrying out a feature extraction process to describe each remaining region, where each feature is represented by a stroke feature, an edge feature and a fill factor feature of the region; and
  
  classifying the feature vector by use of a support vector machine (SVM) classifier engine which outputs whether the region is text or not using the following equation;
  
  $sgn (\sum_{i = 1}^{l} y_{i} α_{i} K (x_{i}, x) + b),$ where (x_i, y_i)are the feature vectors and groundtruth labels of training samples, x is the feature vector of the regions to be classified, a_iand b are the parameters obtained by solving the optimization problem defined as $\min_{α} \frac{1}{2} α^{T} Q α - e^{T} α$ and subject to y^Ta=0 (0≦
  
  a_i≦
  
  C, i=1, . . . ,1), and K is defined as $K (X, X_{j}) = \exp {\frac{- { X - X_{j} }^{2}}{2 σ^{2}}}$ to obtain a classification output where 1 indicates the presence of text and −
  
  1 indicates the absence of text.
- View Dependent Claims (16, 17, 18)
- - 16. The method according to claim 15, wherein fill factor features are extracted both in the whole candidate image and neighborhood-wise.
  - 17. The method according to claim 15, wherein the preprocessing operates to remove the regions satisfying the following conditions:
    - (1) if region_height is smaller than some threshold T_low, or the region_height is larger than some threshold T_high, or(2) if region_area is smaller than some threshold T_area, or(3) if the region touches one of the four sides of the image border, and its height is larger than a threshold T, or(4) if a fill_factor defined as
18. The method according to claim 15, wherein the binarization is carried out using a plurality of binarization methods with each binarized output being processing by an optical character reader to produce multiple outputs that are combined.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Inventors
Guillou, Jean-Pierre, Yu, Yang

Granted Patent

US 8,320,674 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/164
CPC Class Codes

G06V 20/635   Overlay text, e.g. embedded...

G06V 30/10   Character recognition

G06V 30/158   using character size, text ...

Text localization for image and video OCR

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

55 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Text localization for image and video OCR

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links