System and method for segmenting text lines in documents

US 8,768,057 B2
Filed: 11/15/2012
Issued: 07/01/2014
Est. Priority Date: 07/10/2009
Status: Expired

First Claim

Patent Images

1. A method of classifying marking types on images of a document, the method comprising:

supplying the document containing the images to a segmenter;

segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings;

supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and

assigning a same label to all pixels in a fragment when the fragment is classified by the classifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems of the present embodiment provide segmenting of connected components of markings found in document images. Segmenting includes detecting aligned text. From this detected material an aligned text mask is generated and used in processing of the images. The processing includes breaking connected components in the document images into smaller pieces or fragments by detecting and segregating the connected components and fragments thereof likely to belong to aligned text.

56 Citations

View as Search Results

16 Claims

1. A method of classifying marking types on images of a document, the method comprising:
- supplying the document containing the images to a segmenter;
  
  segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings;
  
  supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and
  
  assigning a same label to all pixels in a fragment when the fragment is classified by the classifier.
- View Dependent Claims (2, 3, 9, 10, 11, 12, 13)
- - 2. The method according to claim 1 wherein the segmenting of the image lines is directed to segregation of the text lines believed to be aligned text, from other aspects of the images.
  - 3. The method according to claim 1, wherein the document images are electronic images stored in an electronic memory and are segmented by a processor associated with the electronic memory.
  - 9. The method according to claim 1, wherein the connected components are each an isolated, continuous region of foreground pixels, and wherein at least one of the fragments is a subset of one of the connected components.
  - 10. The method according to claim 1, wherein the segmenting includes:
    - grouping the selected feature points into strips;
      
      fitting lines to the strips; and
      
      forming the enclosing bounding boxes from pairs of fitted lines.
  - 11. The method according to claim 1, further including:
    - training the classifier from the groundtruth images using a machine learning algorithm.
  - 12. The method according to claim 1, further including:
    - providing the category score to each of the fragments by;
      
      determining feature measurements for the fragment;
      
      determining the category score for the fragment by applying the classifier to the determined feature measurements.
  - 13. The method according to claim 1, wherein the segmenting includes:
    - determining the enclosing boundary boxes for individual text lines based on the grouping.

4. A method of classifying marking types on images of a document, the method comprising:
- supplying the document containing the images to a segmenter;
  
  segmenting the images received by the segmenter including identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points including local extrema of bounding contours of connected components, and subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are part of the text lines and part of extraneous markings;
  
  supplying the fragments to a classifier, the classifier providing a category score to each fragment, wherein the classifier is trained from groundtruth images whose pixels are labeled according to known marking types; and
  
  assigning a same label to all pixels in a fragment when the fragment is classified by the classifier;
  
  wherein the segmenting includes processing the images to find text lines, the processing comprising;
  
  detecting upper and lower extrema of the connected components;
  
  identifying upper and lower contour extrema of the detected upper and lower extrema of the connected components;
  
  grouping the identified upper and lower contour extrema;
  
  identifying upper contour point groups and lower contour point groups;
  
  fitting the grouped upper and lower point groups;
  
  filtering out the fitted and grouped upper and lower point groups that are outside a predetermined alignment threshold;
  
  forming upper and lower alignment segments for the upper and lower point groups that remain after the filtering operation;
  
  matching as pairs the upper and lower segments that remain after the filtering operation; and
  
  forming text line bounding boxes based on the pairs of matched upper and lower segments that remain after the filtering operation, the bounding boxes identifying connected components believed to be aligned text.

5. A system of classifying marking types on images of a document, the system comprising:
- a segmenter operated on a processor and configured to receive the document containing the images, the segmenter segmenting the images into fragments of foreground pixel structures that are identified as being likely to be of the same marking type by finding connected components, and dividing at least some of the connected components to obtain image fragments, the connected components being isolated, continuous regions of foreground pixels, the segmenter segmenting the images by;
  
  identifying neatly written or printed text by grouping selected feature points along predetermined orientations, the feature points being local extrema of bounding contours of the connected components; and
  
  subtracting enclosing boundary boxes of text lines from remaining document material to fragment connected components that are partly part of the text lines and partly part of extraneous markings; and
  
  a classifier operated on a processor and configured to receive the fragments, the classifier providing a category score to each received fragment, wherein the classifier is trained from ground truth images whose pixels are labeled according to known marking types, the classifier assigning a same label to all pixels in a fragment when the fragment is classified by the classifier.
- View Dependent Claims (6, 7, 8, 14, 15, 16)
- - 6. The system according to claim 5 wherein the segmenter is further configured to find text lines including:
    - detecting upper and lower extrema of the connected components;
      
      identifying the upper and lower contour extrema of the detected upper and lower extrema of the connected components;
      
      grouping the identified upper and lower contour extrema;
      
      identifying upper contour point groups and lower contour point groups;
      
      fitting the grouped upper and lower point groups;
      
      filtering out the fitted and grouped upper and lower point groups that are outside a predetermined alignment threshold;
      
      forming upper and lower alignment segments for the upper and lower point groups that remain after the filtering operation;
      
      matching as pairs the upper and lower segments that remain after the filtering operation; and
      
      forming text line bounding boxes based on the pairs of matched upper and lower segments that remain after the filtering operation, the bounding boxes identifying connected components believed to be aligned text.
  - 7. The method according to claim 5, wherein the document images are electronic images stored in an electronic memory and are segmented by a processor associated with the electronic memory.
  - 8. The system according to claim 5 further including a scanner to receive a hardcopy document containing images, the scanner converting the hardcopy document into an electronic document, the electronic document being the document supplied to the segmenter.
  - 14. The system according to claim 5, wherein the segmenting includes:
    - grouping the selected feature points into strips;
      
      fitting lines to the strips; and
      
      forming the enclosing bounding boxes from pairs of fitted lines.
  - 15. The system according to claim 5, wherein the segmenter segments the images by further:
    - determining the enclosing boundary boxes for individual text lines based on the grouping.
  - 16. The system according to claim 5, wherein the classifier provides the category score to each received fragment by:
    - determining feature measurements for the fragment; and
      
      determining the category score for the fragment by applying the classifier to the determined feature measurements.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
III Holdings 6, LLC
Original Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Inventors
Saund, Eric
Primary Examiner(s)
Wu, Jingge

Application Number

US13/677,473
Publication Number

US 20130114890A1
Time in Patent Office

593 Days
Field of Search

None
US Class Current

382/175
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/155   Removing patterns interferi...

G06V 30/412   Layout analysis of document...

G06V 30/413   Classification of content, ...

System and method for segmenting text lines in documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

56 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for segmenting text lines in documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links