Automatic separation of text from background in scanned images of complex documents
First Claim
1. A computer implemented process for separating text information from background information in a scanned electronic image of a document, said computer implemented process comprising the steps of:
- (a) electronically scanning the document to convert the document into said scanned electronic image of the document;
(b) examining said scanned electronic image and dividing said scanned electronic image into a plurality of blocks;
(c) constructing a histogram of gray scale values of pixels within one of said blocks;
(d) dividing said histogram into three regions comprising a first region, a middle region and a last region;
(d) determining a number of peaks of said histogram in each of said three regions;
(f) if said histogram contains no peak in said middle region, setting a threshold gray scale level between a gray scale level of a peak having a highest gray scale level in said first region and a gray scale level of a peak having a lowest gray scale level in said last region;
(g) separating said text information from said background information by reexamining said block using said threshold gray scale level set in step (f); and
(h) repeating steps (c) through (g) for each of said plurality of blocks.
2 Assignments
0 Petitions
Accused Products
Abstract
A system that converts a scanned image of a complex document into an image where text has been preserved and separated from the background. The system first subdivides the scanned image into blocks and then examines each block pixel by pixel to construct a histogram of the gray scale values of the pixels. The histogram is partitioned into a first, middle and last regions. If one or more peaks occur in the first and last regions, and a single histogram peak occurs within the middle region, the pixels are reexamined to determine the frequency of occurrence of pixels having a gray scale level of the middle peak nearby pixels which have a level of a first region peak. If this frequency is high, the middle peak is assumed to be background information. After determining the threshold, the system rescans the block applying the threshold to separate the text from background information within the block.
-
Citations
15 Claims
-
1. A computer implemented process for separating text information from background information in a scanned electronic image of a document, said computer implemented process comprising the steps of:
-
(a) electronically scanning the document to convert the document into said scanned electronic image of the document; (b) examining said scanned electronic image and dividing said scanned electronic image into a plurality of blocks; (c) constructing a histogram of gray scale values of pixels within one of said blocks; (d) dividing said histogram into three regions comprising a first region, a middle region and a last region; (d) determining a number of peaks of said histogram in each of said three regions; (f) if said histogram contains no peak in said middle region, setting a threshold gray scale level between a gray scale level of a peak having a highest gray scale level in said first region and a gray scale level of a peak having a lowest gray scale level in said last region; (g) separating said text information from said background information by reexamining said block using said threshold gray scale level set in step (f); and (h) repeating steps (c) through (g) for each of said plurality of blocks. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer implemented process for separating text information from background information in a scanned electronic image of a document, said computer implemented process comprising the steps of:
-
(a) electronically scanning the document to convert the document into said scanned electronic image of the document; (b) examining said scanned electronic image and dividing said scanned electronic image into a plurality of blocks; (c) constructing a histogram of gray scale values of pixels within one of said blocks; (d) dividing said histogram into three regions, comprising a first region, a middle region and a last region; (e) locating all peaks in said histogram; (f) removing all except a highest peak in said first region; (g) removing all except a lowest peak in said last region; (h) determining a number of peaks remaining in said histogram; (i) if said histogram contains only two peaks, setting a threshold gray scale level between a gray scale level of a first peak and a gray scale level of a second peak; (j) separating said text information from said background information by reexamining said block using said threshold gray scale level set in step (i); and (k) repeating steps (c) through (j) for each of said plurality of blocks. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer implemented process for separating text information from background information in a scanned electronic image of a document, said computer implemented process comprising the steps of:
-
(a) electronically scanning the document to convert the document into said scanned electronic image of the document; (b) examining said scanned electronic image and constructing a histogram of gray scale values of pixels within said scanned electronic image; (c) dividing said histogram into three regions comprising a first region, a middle region and a last region; (d) determining a number of peaks of said histogram in each of said three regions; (e) if said histogram contains no peak in said middle region, setting a threshold gray scale level between a gray scale level of a peak having a highest gray scale level in said first region and a gray scale level of a peak having a lowest gray scale level in said last region; (f) separating said text information from said background information by reexamining said scanned electronic image using said threshold gray scale level set in step (e). - View Dependent Claims (12, 13, 14, 15)
-
Specification