LINE SEGMENTATION METHOD APPLICABLE TO DOCUMENT IMAGES CONTAINING HANDWRITING AND PRINTED TEXT CHARACTERS OR SKEWED TEXT LINES
First Claim
1. A method for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:
- (a) performing a connected component analysis on the document image to generate a plurality of connected components;
(b) computing a bounding box and centroid for each of the plurality of connected components;
(c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects;
(d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines;
(e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and
(f) removing extra spaces in each of the N object boards to obtain N text line segments.
1 Assignment
0 Petitions
Accused Products
Abstract
A text line segmentation method for a document image containing printed text and handwriting, or document image containing skewed lines or printed text. Connected component (CC) are obtained for the document, and their bounding boxes and centroids are calculated. The CCs are categorized into three categories based on bounding box sizes: small objects, regular text objects, and large objects involving handwriting. The centroids of regular text objects are used in a cluster analysis to find the vertical centers of the N text lines. Then, each CC is classified into one of the N lines based on the vertical distance between its centroid and the vertical centers of text lines, and copied into to a corresponding object board. Extra spaces are removed from the object boards to obtain the line segments. The large object involving handwriting will be classified into one of the lines but absent from other lines.
14 Citations
12 Claims
-
1. A method for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:
-
(a) performing a connected component analysis on the document image to generate a plurality of connected components; (b) computing a bounding box and centroid for each of the plurality of connected components; (c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects; (d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines; (e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and (f) removing extra spaces in each of the N object boards to obtain N text line segments. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product comprising a computer usable non-transitory medium having a computer readable program code embedded therein for controlling a data processing apparatus, the computer readable program code configured to cause the data processing apparatus to execute a process for segmenting a binary document image containing multiple printed lines of text to obtain segmented lines of printed text, comprising:
-
(a) performing a connected component analysis on the document image to generate a plurality of connected components; (b) computing a bounding box and centroid for each of the plurality of connected components; (c) based on heights of the bounding boxes of the connected components, categorizing the plurality of connected components into three categories including small objects, regular text objects, and large objects; (d) performing cluster analysis on vertical positions of the centroids of the connected components in the category of regular text objects, using a number (N) of text lines in the document image as a number of cluster centers for the cluster analysis, to calculate N cluster centers which represent central vertical positions of the N text lines; (e) classifying each connected component obtained in step (a) as belonging to a text line based on vertical distances between the centroid of the connected component and the central vertical positions of the text lines calculated in step (d), and copying the connected component into one of N object boards designated for that text line, wherein each object board is a template having a size identical to a size of the document image, each object board being designated for one of the N lines of text of the document image; and (f) removing extra spaces in each of the N object boards to obtain N text line segments. - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification