LINE SEGMENTATION METHOD APPLICABLE TO DOCUMENT IMAGES CONTAINING HANDWRITING AND PRINTED TEXT CHARACTERS OR SKEWED TEXT LINES

US 20150063699A1
Filed: 08/30/2013
Published: 03/05/2015
Est. Priority Date: 08/30/2013
Status: Active Grant

First Claim

Patent Images

1. A method for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:

(a) performing a connected component analysis on the document image to generate a plurality of connected components;

(b) computing a bounding box and centroid for each of the plurality of connected components;

(c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects;

(d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines;

(e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and

(f) removing extra spaces in each of the N object boards to obtain N text line segments.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text line segmentation method for a document image containing printed text and handwriting, or document image containing skewed lines or printed text. Connected component (CC) are obtained for the document, and their bounding boxes and centroids are calculated. The CCs are categorized into three categories based on bounding box sizes: small objects, regular text objects, and large objects involving handwriting. The centroids of regular text objects are used in a cluster analysis to find the vertical centers of the N text lines. Then, each CC is classified into one of the N lines based on the vertical distance between its centroid and the vertical centers of text lines, and copied into to a corresponding object board. Extra spaces are removed from the object boards to obtain the line segments. The large object involving handwriting will be classified into one of the lines but absent from other lines.

14 Citations

View as Search Results

12 Claims

1. A method for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:
- (a) performing a connected component analysis on the document image to generate a plurality of connected components;
  
  (b) computing a bounding box and centroid for each of the plurality of connected components;
  
  (c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects;
  
  (d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines;
  
  (e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and
  
  (f) removing extra spaces in each of the N object boards to obtain N text line segments.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising, before step (d), obtaining the number N of text lines in the document image, including:
    - (g1) calculating a horizontal projection profile of the document image;
      
      (g2) detecting a number of valleys in the horizontal projection profile; and
      
      (g3) calculating the number N of text lines in the document image as the number of valleys plus one.
  - 3. The method of claim 2, further comprising, after step (g1) and before step (g2), smoothing the horizontal projection profile using sliding window average.
  - 4. The method of claim 1, further comprising, after step (c) and before step (d):
    - determining whether a difference between the largest and smallest vertical positions of the centroids of the connected components in the category of regular text objects exceeds a threshold value.
  - 5. The method of claim 1, wherein step (c) includes comparing the bounding box height of each connected component to two threshold values.
  - 6. The method of claim 1, wherein step (d) employs a k-means++ algorithm.

7. A computer program product comprising a computer usable non-transitory medium having a computer readable program code embedded therein for controlling a data processing apparatus, the computer readable program code configured to cause the data processing apparatus to execute a process for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:
- (a) performing a connected component analysis on the document image to generate a plurality of connected components;
  
  (b) computing a bounding box and centroid for each of the plurality of connected components;
  
  (c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects;
  
  (d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines;
  
  (e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and
  
  (f) removing extra spaces in each of the N object boards to obtain N text line segments.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The computer program product of claim 7, wherein the process further comprises, before step (d), obtaining the number N of text lines in the document image, including:
    - (g1) calculating a horizontal projection profile of the document image;
      
      (g2) detecting a number of valleys in the horizontal projection profile; and
      
      (g3) calculating the number N of text lines in the document image as the number of valleys plus one.
  - 9. The computer program product of claim 8, wherein the process further comprises, after step (g1) and before step (g2), smoothing the horizontal projection profile using sliding window average.
  - 10. The computer program product of claim 7, wherein the process further comprises, after step (c) and before step (d):
    - determining whether a difference between the largest and smallest vertical positions of the centroids of the connected components in the category of regular text objects exceeds a threshold value.
  - 11. The computer program product of claim 7, wherein step (c) includes comparing the bounding box height of each connected component to two threshold values.
  - 12. The computer program product of claim 7, wherein step (d) employs a k-means++ algorithm.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Konica Minolta Laboratory U.S.A., Inc. (Konica Minolta Inc.)
Original Assignee
Konica Minolta Laboratory U.S.A., Inc. (Konica Minolta Inc.)
Inventors
Wu, Chaohong

Granted Patent

US 9,104,940 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/176
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/1478   of characters or characters...

G06V 30/15   Cutting or merging image el...

G06V 30/153   using recognition of charac...

LINE SEGMENTATION METHOD APPLICABLE TO DOCUMENT IMAGES CONTAINING HANDWRITING AND PRINTED TEXT CHARACTERS OR SKEWED TEXT LINES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

LINE SEGMENTATION METHOD APPLICABLE TO DOCUMENT IMAGES CONTAINING HANDWRITING AND PRINTED TEXT CHARACTERS OR SKEWED TEXT LINES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links