Method and apparatus for complex column segmentation by major white region pattern matching
First Claim
1. A method for logically identifying document elements in a complex column document image, comprising the steps of:
- identifying major background regions in the document image;
generating an ordered data string corresponding to a type and a location of the major background regions in the document image;
comparing the ordered data string with a finite state machine to determine an optimal path from at least one candidate path for the ordered data string that best aligns with the finite state machine; and
identifying a columnar layout based on the identified optimal path, wherein the step of identifying the columnar layout based on the identified optimal path comprises;
selecting a current candidate path from the at least one candidate path;
identifying editing costs in the current candidate path;
correcting the current candidate path, wherein correcting the current candidate path comprises;
deleting any insertions;
inserting any deletions; and
correcting any substitutions, wherein the step of inserting any deletions comprises;
identifying matched major background regions in the document image;
locating at least one missing major background region;
selecting adjacent matched major background regions for each of the at least one missing major background region;
determining a type of the at least one missing major background region based on the finite state machine;
searching the document image for the at least one missing major background region using reduced threshold values in the background regions identifying step;
adding the at least one missing major background region to the current candidate path if the at least one missing major background region is returned by the identifying major background regions step;
selecting a next candidate path if the at least one missing major background region is not found and making a next best optimal path the current optimal path and repeating the identifying matched major background regions step through the selecting a next candidate path step until each of the at least one candidate path has been evaluated; and
setting the optimal path to the current candidate path if the at least one missing major background region is returned;
identifying major background regions based on the optimal path in the document image;
replacing four major background region margins around the major background regions based on the optimal path; and
selecting closed loops of the major background regions to identify at least one column of document elements in the columnar layout of the document image.
4 Assignments
0 Petitions
Accused Products
Abstract
A system for logically segmenting document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system that extracts major white regions separating document elements in the input document image, and a string translation device that generates matching one-dimensional data string that corresponds to the extracted major white regions in a document image, a comparison device that selects the optimum path through a finite state machine representing acceptable column layouts for the source document, and a columnar layout identification device that identifies the column layout defined by the optimum path. Then, the identified column of document elements may be processed to logically tag or extract document elements. The method for logically segmenting document element columns includes providing at least one structural model of a corresponding source document, each structural model including at least one finite state machine defining relationships between document elements of the source document. Identifying major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a one-dimensional data string corresponding to the major white regions, generating at least one optimum path that matches the data string, and identifying the column layout of the input document image based on the optimum path.
-
Citations
2 Claims
-
1. A method for logically identifying document elements in a complex column document image, comprising the steps of:
-
identifying major background regions in the document image; generating an ordered data string corresponding to a type and a location of the major background regions in the document image; comparing the ordered data string with a finite state machine to determine an optimal path from at least one candidate path for the ordered data string that best aligns with the finite state machine; and identifying a columnar layout based on the identified optimal path, wherein the step of identifying the columnar layout based on the identified optimal path comprises; selecting a current candidate path from the at least one candidate path; identifying editing costs in the current candidate path; correcting the current candidate path, wherein correcting the current candidate path comprises; deleting any insertions;
inserting any deletions; and
correcting any substitutions, wherein the step of inserting any deletions comprises;identifying matched major background regions in the document image; locating at least one missing major background region; selecting adjacent matched major background regions for each of the at least one missing major background region; determining a type of the at least one missing major background region based on the finite state machine; searching the document image for the at least one missing major background region using reduced threshold values in the background regions identifying step; adding the at least one missing major background region to the current candidate path if the at least one missing major background region is returned by the identifying major background regions step; selecting a next candidate path if the at least one missing major background region is not found and making a next best optimal path the current optimal path and repeating the identifying matched major background regions step through the selecting a next candidate path step until each of the at least one candidate path has been evaluated; and setting the optimal path to the current candidate path if the at least one missing major background region is returned; identifying major background regions based on the optimal path in the document image; replacing four major background region margins around the major background regions based on the optimal path; and selecting closed loops of the major background regions to identify at least one column of document elements in the columnar layout of the document image.
-
-
2. A method for logically identifying document elements in a complex column document image, comprising the steps of:
-
identifying major background regions in the document image; generating an ordered data string corresponding to a type and a location of the major background regions in the document image; comparing the ordered data string with a finite state machine to determine an optimal path from at least one candidate path for the ordered data string that best aligns with the finite state machine; and identifying a columnar layout based on the identified optimal path, wherein the step of identifying the columnar layout based on the identified optimal path comprises; selecting a current candidate path from the at least one candidate path; identifying editing costs in the current candidate path; correcting the current candidate path, wherein correcting the current candidate path comprises; deleting any insertions;
inserting any deletions; and
correcting any substitutions wherein the step of deleting any insertions comprises;identifying matched major background regions in the document image; locating at least one extraneous major background region; selecting adjacent matched major background regions for each of the at least one extraneous major background region; determining a type of the at least one extraneous major background region based on the finite state machine; deleting the at least one extraneous major background region from the current candidate path if the at least one extraneous major background region is located; selecting a next current candidate path if the at least one extraneous major background region is not located and repeating the matched major background regions identifying step through the selecting a next current candidate path step until each of the at least one candidate path has been evaluated; and setting the optimal path to the current candidate path if each of the at least one extraneous major background regions in the current candidate path has been deleted; identifying major background regions based an the optimal path in the document image; replacing four major background region margins around the major background regions based on the optimal path; and selecting closed loops of the major background regions to identify at least one column of document elements in the columnar layout of the document image.
-
Specification