Document page analyzer and method
First Claim
1. A method for analyzing a digital image of a document page, comprising the steps of:
- detecting and removing one or more lines from the digital image, if any are present;
segmenting the line-removed image into one or more coarse blocks;
performing run-length smoothing on the coarse blocks;
performing connected component analysis on each of the smoothed coarse blocks to produce at least one connected component;
determining a bounding box for each of the connected components;
merging any overlapping bounding boxes to produce one or more finer blocks;
finding the block features of the digital image;
finding the page features of the digital image;
creating a geometric structure tree for the digital image;
said step of creating further comprising;
a) grouping the coarse blocks produced by said segmenting step into horizontal bands;
b) first determining column boundaries of each of the horizontal bands;
c) assigning finer blocks produced by the merging step to the horizontal bands;
d) second determining a geometric structure tree for all of the finer blocks in each of the horizontal bands;
e) merging the geometric structure trees producesd by the second determining step; and
logically transforming the geometric structure to produce a rearranged page image arranged in a proper reading order.
8 Assignments
0 Petitions
Accused Products
Abstract
Apparatus and method are provided which determine the geometric and logical structure of a document page from its image. The document image is partitioned into regions (both text and non-text) which are then organized into related "article" groups in the correct reading order. The invention uses image-based features, text-based features, and assumptions based on knowledge of expected layout, to find the correct reading order of the text blocks on a document page. It can handle complex layouts which have multiple configurations of columns on a page and inset material (such as figures and inset text blocks). The apparatus comprises two main components, a geometric page segmentor and a logical page organizer. The geometric page segmentor partitions a binary image of a document page into fundamental units of text or non-text, and produces a list of rectangular blocks, their locations on the page in points (1/72 inch), and the locations of horizontal rule lines on the page. The logical page organizer receives a list of text region locations, rule line locations, associated ASCII text (as found from an OCR) for the text blocks, and a list of text attributes (such as face style and point size). The logical page organizer groups appropriately the components (both text and non-text) which comprise a document page, sequences them in a correct reading order and establishes the dominance pattern (e.g., find the lead article on a newspaper page).
-
Citations
5 Claims
-
1. A method for analyzing a digital image of a document page, comprising the steps of:
-
detecting and removing one or more lines from the digital image, if any are present; segmenting the line-removed image into one or more coarse blocks; performing run-length smoothing on the coarse blocks; performing connected component analysis on each of the smoothed coarse blocks to produce at least one connected component; determining a bounding box for each of the connected components; merging any overlapping bounding boxes to produce one or more finer blocks; finding the block features of the digital image; finding the page features of the digital image; creating a geometric structure tree for the digital image; said step of creating further comprising; a) grouping the coarse blocks produced by said segmenting step into horizontal bands; b) first determining column boundaries of each of the horizontal bands; c) assigning finer blocks produced by the merging step to the horizontal bands; d) second determining a geometric structure tree for all of the finer blocks in each of the horizontal bands; e) merging the geometric structure trees producesd by the second determining step; and logically transforming the geometric structure to produce a rearranged page image arranged in a proper reading order. - View Dependent Claims (2, 3, 4, 5)
-
Specification