Method and system for identifying lines of text in a document

US 5,307,422 A
Filed: 06/25/1991
Issued: 04/26/1994
Est. Priority Date: 06/25/1991
Status: Expired due to Term

First Claim

Patent Images

1. A document processing method which is capable of processing complex documents comprising regions of text and graphics comprising the steps ofscanning a document utilizing a scanner to form a digital image of the document,transmitting the digital image of the document to a recognition engine,electronically processing said digital image at said recognition engine by:

vertically dividing said digital image into columns of uniform width,identifying vertical transitions between substantially blank and substantially non-blank portions within each uniform width column based solely on the corresponding portion of said digital image within each respective column, andhorizontally dividing each uniform width column at said vertical transitions between substantially blank and substantially non-blank portions of each column to form uniform width units which units each contains one substantially non-blank portion of each column, said units being vertically separated from each other by said substantially blank portions of said columns,identifying units containing portions of text by eliminating units which have a height in pixels below a minimum threshold or a height above a maximum threshold,utilizing said recognition engine to link units from adjacent columns which overlap vertically to identify lines of text, andutilizing said recognition engine to arrange said units in a two dimensional array and storing in a memory associated with said recognition engine the height of each unit and information indicating the horizontal extend of each unit,wherein said step of eliminating units whose height is above a maximum threshold comprises utilizing said recognition engine to determine the average height of the units and multiplying by a tolerance factor to obtain said maximum threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An optical recognition method and system utilizes a scanner to digitize a complex document comprising regions of text and graphics. A recognition engine, illustratively in the form of a general purpose computer, separates regions of text and graphics and identifies lines of text. This is accomplished by dividing the digitized image into columns. The columns are then separated into small rectangular units by obtaining a horizontal histogram or shadow thereof. The units are arranged into a two dimensional linking for ease of processing and access. Units which are too short are eliminated as noise and units which are too tall are eliminated or otherwise identified as comprising graphics. The remaining units are organized into lines of text by connecting units horizontally whose shadows on the vertical axis overlap.

41 Citations

View as Search Results

2 Claims

1. A document processing method which is capable of processing complex documents comprising regions of text and graphics comprising the steps ofscanning a document utilizing a scanner to form a digital image of the document,transmitting the digital image of the document to a recognition engine,electronically processing said digital image at said recognition engine by:
- vertically dividing said digital image into columns of uniform width,identifying vertical transitions between substantially blank and substantially non-blank portions within each uniform width column based solely on the corresponding portion of said digital image within each respective column, andhorizontally dividing each uniform width column at said vertical transitions between substantially blank and substantially non-blank portions of each column to form uniform width units which units each contains one substantially non-blank portion of each column, said units being vertically separated from each other by said substantially blank portions of said columns,identifying units containing portions of text by eliminating units which have a height in pixels below a minimum threshold or a height above a maximum threshold,utilizing said recognition engine to link units from adjacent columns which overlap vertically to identify lines of text, andutilizing said recognition engine to arrange said units in a two dimensional array and storing in a memory associated with said recognition engine the height of each unit and information indicating the horizontal extend of each unit,wherein said step of eliminating units whose height is above a maximum threshold comprises utilizing said recognition engine to determine the average height of the units and multiplying by a tolerance factor to obtain said maximum threshold.
- View Dependent Claims (2)
- - 2. The method of claim 1 wherein said units whose height is above maximum threshold correspond to graphics regions of said digital image.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Transpacific IP Group Limited
Original Assignee
Industrial Technology Research Institute
Inventors
Wang, Jar-Long
Primary Examiner(s)
Razavi, Michael T.
Assistant Examiner(s)
Fox, David

Application Number

US07/720,236
Time in Patent Office

1,036 Days
Field of Search

382/9, 382/18, 382/48, 382/61, 382/54, 358/462, 358/464
US Class Current

382/177
CPC Class Codes

G06V 30/414 Extracting the geometrical ...

Method and system for identifying lines of text in a document

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for identifying lines of text in a document

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links