Text localization for image and video OCR

US 8,320,674 B2
Filed: 02/26/2009
Issued: 11/27/2012
Est. Priority Date: 09/03/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method of text detection in a video image, comprising:

at an image processor, receiving a video frame that potentially contains text;

segmenting the video frame into regions having similar color;

where color similarity is defined as
D(c₁,c₂)=√

{square root over (( R₁−

R₂)²+( G₁−

G₂)²+( B₁−

B₂)²)}<

T_color,where c₁=[ R₁ G₁ B₁] and c₂=[ R₂ G₂ B₂] are average colors of two regions and T_coloris a merging threshold;

identifying high likelihood non-text regions from the regions having similar color and discarding the high likelihood non-text regions;

merging regions having similar color and having horizontal positions that are within a threshold;

describing the regions using features by carrying out a feature extraction process to extract stroke features, edge features, and fill factor features; and

passing the remaining regions through a trained binary classifier to obtain final text regions which are binarized for processing by OCR software.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In accord with embodiments consistent with the present invention, a first action in recognizing text from image and video is to locate accurately the position of the text in image and video. After that, the located and possibly low resolution text can be extracted, enhanced and binarized. Finally existing OCR technology can be applied to the binarized text for recognition. This abstract is not to be considered limiting, since other embodiments may deviate from the features described in this abstract.

Citations

18 Claims

1. A method of text detection in a video image, comprising:
- at an image processor, receiving a video frame that potentially contains text;
  
  segmenting the video frame into regions having similar color;
  
  where color similarity is defined as
  D(c₁,c₂)=√
  
  {square root over (( R₁−
  
  R₂)²+( G₁−
  
  G₂)²+( B₁−
  
  B₂)²)}<
  
  T_color,where c₁=[ R₁ G₁ B₁] and c₂=[ R₂ G₂ B₂] are average colors of two regions and T_coloris a merging threshold;
  
  identifying high likelihood non-text regions from the regions having similar color and discarding the high likelihood non-text regions;
  
  merging regions having similar color and having horizontal positions that are within a threshold;
  
  describing the regions using features by carrying out a feature extraction process to extract stroke features, edge features, and fill factor features; and
  
  passing the remaining regions through a trained binary classifier to obtain final text regions which are binarized for processing by OCR software.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1, further comprising passing the binarized final text regions through an optical character reader.
  - 3. The method according to claim 1, where the segmenting comprises:
    - calculating a color difference of neighboring pixels;
      
      sorting the pixels according to their color difference, and merging pixels with color difference smaller than a threshold so that regions are generated.
  - 4. The method according to claim 1, where the binary classifier comprises a support vector machine (SVM) based classifier.
  - 5. The method according to claim 1, where stroke width values are considered similar if the stroke widths are within a threshold value.
  - 6. The method according to claim 1, where the stroke width features comprise a feature value representing the percentage of neighborhoods in the image whose standard deviation of stroke width is within a threshold value, or percentage of neighborhoods having similar stroke widths vertically.
  - 7. The method according to claim 1, where the stroke width features comprise a feature value representing a percentage of rows whose standard deviation of horizontal stroke width is within a threshold, or that are clustered into groups and standard deviation of horizontal stroke width in each group is within a threshold, or the percentage of rows having similar stroke widths or clusters of similar stroke widths.
  - 8. The method according to claim 1, where the stroke width feature comprises an average ratio of the current stroke width and a distance of the current stroke to a neighboring stroke.
  - 9. The method according to claim 1, where the stroke width feature comprises a ratio of two stroke widths that appear the most frequently.
  - 10. The method according to claim 1, where edge features are measurement of a smoothness of edges, uniformity of edges and amount of edges in a candidate region, where the smoothness of edges is represented by a percentage of neighborhoods that have the same direction, uniformity of edges is calculated as a frequency of an edge direction that appears most often, and an amount of edges is measured by a ratio of a length of total edges to area of the candidate region.
  - 11. The method according to claim 1, where fill factor features are extracted both in the whole candidate image and neighborhood-wise.
  - 12. The method according to claim 1, where regions of high likely-hood of being non-text are decided by the following:
    - (1) if region_height is smaller than some threshold T_low, or the region_height is larger than some threshold T_high, or(2) if region_area is smaller than some threshold T_area, or(3) if the region touches one of the four sides of the image border, and its height is larger than a threshold T, or(4) if a fill_factor defined as
13. The method according to claim 1, where the binarization is carried out using a plurality of binarization methods with each binarized output being processed by an optical character reader to produce multiple outputs that are combined.

14. A text detection process, comprising:
- preprocessing an image by segmentation using statistical region merging, removing regions that are definitely not text and grouping regions based on criteria of height similarity, color similarity, region distance and horizontal alignment defined as follows;
  
  height similarity is defined as
- View Dependent Claims (15, 16, 17)
- - 15. The method according to claim 14, where fill factor features are extracted both in the whole candidate image and neighborhood-wise.
  - 16. The method according to claim 14, where the preprocessing operates to remove the regions satisfying the following conditions:
    - (1) if region_height is smaller than some threshold T_low, or the region_height is larger than some threshold T_high, or(2) if region_area is smaller than some threshold T_area, or(3) if the region touches one of the four sides of the image border, and its height is larger than a threshold T, or(4) if a fill_factor defined as
  - 17. The method according to claim 14, where the binarization is carried out using a plurality of binarization methods with each binarized output being processing by an optical character reader to produce multiple outputs that are combined.

18. A method of text detection in a video image, comprising:
- at an image processor, receiving a video frame that potentially contains text;
  
  segmenting the video frame into regions having similar color;
  
  identifying high likelihood non-text regions from the regions having similar color and discarding the high likelihood non-text regions;
  
  merging regions having similar color and having horizontal positions that are within a threshold;
  
  describing the regions using features by carrying out a feature extraction process to extract stroke features, edge features, and fill factor features;
  
  representing the extracted features as feature vectors;
  
  passing the remaining regions through a trained binary classifier to obtain final text regions which are binarized for processing by OCR software; and
  
  where the trained binary classifier classifies each feature by use of a support vector machine (SVM) classifier engine which outputs whether the region is text or not using the following equation;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Inventors
Guillou, Jean-Pierre, Yu, Yang
Primary Examiner(s)
Dang, Duy M

Application Number

US12/380,394
Publication Number

US 20100054585A1
Time in Patent Office

1,370 Days
Field of Search

382/173, 382/176, 382/177, 382179-180, 382/190, 382/199, 382/292, 382/309, 715/210
US Class Current

382/179
CPC Class Codes

G06V 20/635   Overlay text, e.g. embedded...

G06V 30/10   Character recognition

G06V 30/158   using character size, text ...

Text localization for image and video OCR

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Text localization for image and video OCR

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links