Method for inset detection in document layout analysis
First Claim
Patent Images
1. A document layout analysis method for determining document structure data from input data including the content and characteristics of regions of a portion of at least one page forming the document, the method comprising the steps of:
- segmenting the regions within the page to identify regions characterized as text and graphics;
analyzing text regions to identify and characterize certain text regions as insets comprising the steps of;
a) finding a pair of horizontal rulings in general vertical alignment with one another; and
b) identifying text present between said horizontal rulings;
producing an output of the recomposed text regions of the image in reading order, wherein the reading order is a function of the column boundaries; and
performing optical character recognition on the text regions.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention is a method for detecting insets in the structure of a document page so as to further complement the document layout and textual information provided in an optical character recognition system. A system employing the present method preferably includes a document layout analysis system wherein the inset detection methodology is used to extend the capability of an associated character recognition package to more accurately recreate the document being processed.
137 Citations
25 Claims
-
1. A document layout analysis method for determining document structure data from input data including the content and characteristics of regions of a portion of at least one page forming the document, the method comprising the steps of:
-
segmenting the regions within the page to identify regions characterized as text and graphics;
analyzing text regions to identify and characterize certain text regions as insets comprising the steps of;
a) finding a pair of horizontal rulings in general vertical alignment with one another; and
b) identifying text present between said horizontal rulings;
producing an output of the recomposed text regions of the image in reading order, wherein the reading order is a function of the column boundaries; and
performing optical character recognition on the text regions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25)
determining whether the text region consists essentially of an inset-like font;
finding frame insets;
finding credit insets;
finding center insets;
finding column insets; and
finding stray (non-column) insets.
-
-
3. The method of claim 1, wherein the step of finding frame insets further comprises the steps of:
-
for each horizontal ruling, find a right connecting lower vertical ruling, then finding a connecting bottom-left horizontal ruling, and if found finding a connecting left top vertical ruling; and
identifying as a frame inset any text completely surrounded on the top, bottom, left and right by rulings.
-
-
4. The method of claim 2, wherein the step of finding credit insets further comprises the step of executing at least one of a group of credit inset determining steps consisting of:
-
determining whether a text region is within a predefined spacing (D) of the bottom of the page;
determining whether the text region is within a predefined spacing (E) of the left of the page;
determining whether the text region has less than a predefined number of lines of text;
determining whether the text region has a ruling above that overlaps a horizontal-direction coordinates of a bounding box surrounding the text region by at least a predefined percent of the horizontal width of the text region;
determining whether any associated rulings surrounding the text region are within a predefined spacing (H) from the top of the text region and have a width less than predefined width (I); and
determining whether the height of the text region is less than a predefined height (J).
-
-
5. The method of claim 2, wherein the step of finding center insets further comprises at least one of the steps from the group including:
-
determining whether the text region has columnar text adjacent to at least one side thereof;
determining whether the text region spans at least two columns;
determining whether a width of the text region is greater than a predefined width (I);
identifying a left-hand column where a left edge of the text region is located, then determining a first portion of the left-hand column'"'"'s width that the text region comprises and determining that the text region is a center inset whenever the first portion is greater than a predetermined value (A) and less than a second predetermined value (B);
identifying a right-hand column where a right edge of the text region is located, then determining a second portion of the right-hand column'"'"'s width that the text region comprises and determining that the text region is a center inset whenever the second portion is greater than the predetermined value (A) and less than the predetermined value (B); and
determining that a column portion difference value, representing the difference between the first portion of the left-hand column'"'"'s width and the second portion of the right-hand column'"'"'s width, is less than a predetermined value (C).
-
-
6. The method of claim 2, wherein the step of finding center insets further comprises at least one of the steps from the group including:
-
determining if the width of the text region is greater than the predefined width (I);
determining whether the text region has a number of text lines at least equal to a predetermined value (F);
determining whether there is another text region above and vertically aligned with said text region, and having a width greater than the predefined width (I);
determining whether there is another text region below and vertically aligned with said text region, and having a width greater than the predefined width (I);
determining whether there is a first column having a left edge, and whether a left edge of said text region is located within a predetermined distance (D) of the left edge of the first column;
determining whether there is a second column having a right edge, and whether a right edge of said text region is located within a predetermined distance (D) of the right edge of the second column;
determining whether there is columnar text above and below the text region, wherein the columnar text regions above and below the text region are each lined up with the text region, and where each columnar text region includes at least a predetermined number of lines (G), and each columnar text region has a width that is less than a predetermined percentage (H) of the width of the text region, and wherein each columnar text region is closer than a predetermined value vertical distance (I) from the text region; and
determining that a text region immediately above the text region, is likely to continue on the same page.
-
-
7. The method of claim 6, wherein the step of determining if a text region is likely to continue on the same page further comprises the steps of:
-
determining the width of the last line of the text region;
determining the width of the average line of that text region; and
determining that it is unlikely to continue whenever the width of the last line is less than a predetermined percentage (K) of the width of the average line of that text region.
-
-
8. The method of claim 1, wherein the step of finding column insets further comprises at least one of the steps of the group comprising:
-
determining whether the text region has at least a predetermined number of columns (R);
setting a predetermined percentage (L) equal to a predetermined value (M) if the text region within a section has an inset-like font, otherwise setting the predetermined percentage (L) equal to a predetermined value (N), where M and N are not equal, and then determining whether the text region has a width less than the predetermined percentage (L) of the width of an average width of other columns in the section;
setting a predetermined percentage (O) equal to a predetermined value (P) if the text region within a section has an inset-like font, otherwise setting the predetermined percentage (O) equal to a predetermined value (Q), where P and Q are not equal, and then determining whether the text region has a number of lines less than value (O) percent of an average number of lines of the other columns in the section.
-
-
9. The method of claim 8 wherein R is an integer greater than 1.
-
10. The method of claim 1, wherein the step of finding stray insets further comprises at least one of the steps selected from the group consisting of:
-
determining a distance from a left edge of a leftmost column in a text region to a right edge of a rightmost column in the text region, wherein the text region is a part of document having columnar regions, and identifying as a stray inset a the text region having a width less than a predetermined percentage (S) of the distance;
determining whether the width of the text region is narrower than a predefined width (I);
determining whether a portion of the text region lies outside of a columnar space defined by the outermost edges of outermost columns within a section of the document;
whenever a portion of the text region lies within the columnar space, determining if the text region has less than a predetermined number of lines (T), and if so, characterizing the text region as a stray inset; and
whenever a portion of the text region lies within the columnar space, determining that the width of the text region is less than a predetermined fraction of the width of the columnar space.
-
-
11. The method of claim 10, wherein the step of determining whether a portion of the text region lies outside of a columnar space defined by an outer edge of an outermost column requires that the portion of the text region is 100 percent.
-
12. The method of claim 11, wherein the step of determining whether a portion of the text region lies outside of a columnar space defined by an outer edge of an outermost column further includes the step of determining that an innermost edge of the text region is spaced apart horizontally from the outermost edge of the outermost column.
-
13. The method of claim 2, wherein the step of determining whether the text region consists essentially of an inset-like font further comprises the steps of:
-
determining the most common font attributes used in the text region; and
determining if the most common font has at least one attribute selected from the group of attributes consisting of bold, italic, reverse video, having a font height at least equal to a predetermined percentage (Z) of the height of an average font on the page.
-
-
14. The method of claim 4 wherein the step of analyzing text regions to identify and characterize certain text regions as insets further includes the step of applying a different weighting factor to each of a plurality of credit inset determining steps within the group.
-
15. The method of claim 5 wherein the step of analyzing text regions to identify and characterize certain text regions as center insets further includes the step of applying a different weighting factor to each of a plurality of inset identification steps within the group.
-
16. The method of claim 6 wherein the step of analyzing text regions to identify and characterize certain text regions as center insets further includes the step of applying a different weighting factor to each of a plurality of inset identification steps within the group.
-
17. The method of claim 8 wherein the step of analyzing text regions to identify and characterize certain text regions as column insets further includes the step of applying a different weighting factor to each of a plurality of inset identification steps within the group.
-
18. The method of claim 10 wherein the step of analyzing text regions to identify and characterize certain text regions as stray insets further includes the step of applying a different weighting factor to each of a plurality of inset identification steps within the group.
-
19. The method of claims 4 wherein at least one of a plurality of predetermined values is a programmable variable that is set at a value as a function of the type of document to processed.
-
20. The method of claim 19, wherein the value of the programmable variable and the associated type of document to be processed is determined in a training step that includes analyzing a plurality of similarly structured test documents.
-
21. The method of claim 11, wherein the step of identifying those remaining text regions that are frame and credit insets comprises the steps of:
-
analyzing rulings within the page, further including the steps of analyzing full frames to identify characteristics of any full frames, and locating pictures and text in the full frames and identifying locations thereof;
after locating the pictures and text in the full frames, locating any frame insets associated with the full frame based upon the characteristics of the full frame; and
then locating any credit insets associated with the pictures.
-
-
23. The method of claim 1, further comprising the steps of:
-
identifying, within the text regions, those text regions representing headers footers and captions, and for the remaining text regions not identified as representing headers footers and captions, recomposing the text regions of the document;
combining the text regions into columns; and
determining boundaries of at least one column on the page.
-
-
24. The method of claim 6, wherein the step of determining if a text region is likely to continue on the same page further comprises the steps of:
-
determining the width of the last line of the text region;
determining the width of a representative line of that text region; and
determining that it is unlikely to continue whenever the width of the last line is less than a predetermined percentage (K) of the width of the average line of that text region.
-
-
25. The method of claim 24, wherein the representative line is characterized as having a line width equal to the average line width of lines in the text region.
-
22. A document layout analysis method for determining document structure data from input data including the content and characteristics of regions of at least one page forming the document, the method comprising the steps of:
-
receiving page data;
segmenting the regions within the page to identify regions characterized as text, graphics, and rulings;
within the text regions, identifying those text regions representing headers footers and captions and for the remaining text regions, identifying those remaining text regions that are frame and credit insets, recomposing the text regions of the document, identifying, within the recomposed text, any center, column and stray insets; and
recalculating the sections of the document so as to produce output data indicative of the reading order of text regions therein.
-
Specification