System and method for segmenting text lines in documents
First Claim
Patent Images
1. A method of classifying marking types on images of a document, the method comprising:
- supplying the document containing the images to a segmenter;
segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings;
supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and
assigning a same label to all pixels in a fragment when the fragment is classified by the classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems of the present embodiment provide segmenting of connected components of markings found in document images. Segmenting includes detecting aligned text. From this detected material an aligned text mask is generated and used in processing of the images. The processing includes breaking connected components in the document images into smaller pieces or fragments by detecting and segregating the connected components and fragments thereof likely to belong to aligned text.
56 Citations
16 Claims
-
1. A method of classifying marking types on images of a document, the method comprising:
-
supplying the document containing the images to a segmenter; segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings; supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and assigning a same label to all pixels in a fragment when the fragment is classified by the classifier. - View Dependent Claims (2, 3, 9, 10, 11, 12, 13)
-
-
4. A method of classifying marking types on images of a document, the method comprising:
-
supplying the document containing the images to a segmenter; segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings; supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and assigning a same label to all pixels in a fragment when the fragment is classified by the classifier; wherein the segmenting includes processing the images to find text lines, the processing comprising; detecting upper and lower extrema of the connected components; identifying upper and lower contour extrema of the detected upper and lower extrema of the connected components; grouping the identified upper and lower contour extrema; identifying upper contour point groups and lower contour point groups; fitting the grouped upper and lower point groups; filtering out the fitted and grouped upper and lower point groups that are outside a predetermined alignment threshold; forming upper and lower alignment segments for the upper and lower point groups that remain after the filtering operation; matching as pairs the upper and lower segments that remain after the filtering operation; and forming text line bounding boxes based on the pairs of matched upper and lower segments that remain after the filtering operation, the bounding boxes identifying connected components believed to be aligned text.
-
-
5. A system of classifying marking types on images of a document, the system comprising:
-
a segmenter operated on a processor and configured to receive the document containing the images, the segmenter segmenting the images into fragments of foreground pixel structures that are identified as being likely to be of the same marking type by finding connected components, and dividing at least some of the connected components to obtain image fragments, the connected components being isolated, continuous regions of foreground pixels, the segmenter segmenting the images by; identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points being local extrema of bounding contours of the connected components; and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are partly part of the text lines and partly part of extraneous markings; and a classifier operated on a processor and configured to receive the fragments, the classifier providing a category score to each received fragment, wherein the classifier is trained from ground truth images whose pixels are labeled according to known marking types, the classifier assigning a same label to all pixels in a fragment when the fragment is classified by the classifier. - View Dependent Claims (6, 7, 8, 14, 15, 16)
-
Specification