Method and system for identifying lines of text in a document
First Claim
1. A document processing method which is capable of processing complex documents comprising regions of text and graphics comprising the steps ofscanning a document utilizing a scanner to form a digital image of the document,transmitting the digital image of the document to a recognition engine,electronically processing said digital image at said recognition engine by:
- vertically dividing said digital image into columns of uniform width,identifying vertical transitions between substantially blank and substantially non-blank portions within each uniform width column based solely on the corresponding portion of said digital image within each respective column, andhorizontally dividing each uniform width column at said vertical transitions between substantially blank and substantially non-blank portions of each column to form uniform width units which units each contains one substantially non-blank portion of each column, said units being vertically separated from each other by said substantially blank portions of said columns,identifying units containing portions of text by eliminating units which have a height in pixels below a minimum threshold or a height above a maximum threshold,utilizing said recognition engine to link units from adjacent columns which overlap vertically to identify lines of text, andutilizing said recognition engine to arrange said units in a two dimensional array and storing in a memory associated with said recognition engine the height of each unit and information indicating the horizontal extend of each unit,wherein said step of eliminating units whose height is above a maximum threshold comprises utilizing said recognition engine to determine the average height of the units and multiplying by a tolerance factor to obtain said maximum threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
An optical recognition method and system utilizes a scanner to digitize a complex document comprising regions of text and graphics. A recognition engine, illustratively in the form of a general purpose computer, separates regions of text and graphics and identifies lines of text. This is accomplished by dividing the digitized image into columns. The columns are then separated into small rectangular units by obtaining a horizontal histogram or shadow thereof. The units are arranged into a two dimensional linking for ease of processing and access. Units which are too short are eliminated as noise and units which are too tall are eliminated or otherwise identified as comprising graphics. The remaining units are organized into lines of text by connecting units horizontally whose shadows on the vertical axis overlap.
41 Citations
2 Claims
-
1. A document processing method which is capable of processing complex documents comprising regions of text and graphics comprising the steps of
scanning a document utilizing a scanner to form a digital image of the document, transmitting the digital image of the document to a recognition engine, electronically processing said digital image at said recognition engine by: -
vertically dividing said digital image into columns of uniform width, identifying vertical transitions between substantially blank and substantially non-blank portions within each uniform width column based solely on the corresponding portion of said digital image within each respective column, and horizontally dividing each uniform width column at said vertical transitions between substantially blank and substantially non-blank portions of each column to form uniform width units which units each contains one substantially non-blank portion of each column, said units being vertically separated from each other by said substantially blank portions of said columns, identifying units containing portions of text by eliminating units which have a height in pixels below a minimum threshold or a height above a maximum threshold, utilizing said recognition engine to link units from adjacent columns which overlap vertically to identify lines of text, and utilizing said recognition engine to arrange said units in a two dimensional array and storing in a memory associated with said recognition engine the height of each unit and information indicating the horizontal extend of each unit, wherein said step of eliminating units whose height is above a maximum threshold comprises utilizing said recognition engine to determine the average height of the units and multiplying by a tolerance factor to obtain said maximum threshold. - View Dependent Claims (2)
-
Specification