Methods and apparatus for selecting semantically significant images in a document image without decoding image content
First Claim
1. A method for electronically processing at least one document stored as an electronic document image containing undecoded text to identify a selected portion thereof, said method comprising the steps or:
- segmenting said at least one document image into words, each word having an undecoded textual content;
classifying the textual content of at least some of said words relative to other said words, without decoding the words, based on an evaluation of predetermined morphological characteristics of said words; and
selecting words for further processing according to the classification of said words obtained in said classifying step.
4 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for processing a document image, using a programmed general or special purpose computer, includes forming the image into image units, and at least one image unit classifier of at least one of the image units is determined, without decoding the content of the at least one of the image units. The classifier of the at least one of the image units is then compared with a classifier of another image unit. The classifier may be image unit length, width, location in the document, font, typeface, cross-section, the number of ascenders, the number of descenders, the average pixel density, the length of the top line contour, the length of the base contour, the location of image units with respect to neighboring image units, vertical position, horizontal inter-image unit spacing, and so forth. The classifier comparison can be a comparison with classifiers of image units of words in a reference table, or with classifiers of other image units in the document. Equivalent classes of image units can be generated, from which word frequency and significance can be determined. The image units can be determined by creating bounding boxes about identifiable segments or extractable units of the image, and can contain a word, a phrase, a letter, a number, a character, a glyph or the like.
-
Citations
9 Claims
-
1. A method for electronically processing at least one document stored as an electronic document image containing undecoded text to identify a selected portion thereof, said method comprising the steps or:
-
segmenting said at least one document image into words, each word having an undecoded textual content; classifying the textual content of at least some of said words relative to other said words, without decoding the words, based on an evaluation of predetermined morphological characteristics of said words; and selecting words for further processing according to the classification of said words obtained in said classifying step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 9)
-
-
8. A method for electronically processing at least one document stored as an electronic document image containing undecoded information to identify a selected portion thereof, said method comprising the steps of:
-
segmenting said at least one document image into image units; classifying the image units relative to other said image units, without decoding the image units being classified or referring to decoded image data, based on an evaluation of predetermined morphological image characteristics of said image units being classified; selecting image units for further processing according to the classification of said image units obtained in said classifying step, wherein; prior to performing said classifying step, said image units are processed for discriminating which of said image units are useful for evaluation of the subject matter contained in said document image; and said classifying step is performed only with the image units not discriminated by said process for discriminating; and
said process for discriminating is performed based on an evaluation of predetermined image characteristics of said image units, without decoding the image units or referring to decoded image data.
-
Specification