Methods and apparatus for selecting semantically significant images in a document image without decoding image content

US 5,390,259 A
Filed: 11/19/1991
Issued: 02/14/1995
Est. Priority Date: 11/19/1991
Status: Expired due to Term

First Claim

Patent Images

1. A method for electronically processing at least one document stored as an electronic document image containing undecoded text to identify a selected portion thereof, said method comprising the steps or:

segmenting said at least one document image into words, each word having an undecoded textual content;

classifying the textual content of at least some of said words relative to other said words, without decoding the words, based on an evaluation of predetermined morphological characteristics of said words; and

selecting words for further processing according to the classification of said words obtained in said classifying step.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for processing a document image, using a programmed general or special purpose computer, includes forming the image into image units, and at least one image unit classifier of at least one of the image units is determined, without decoding the content of the at least one of the image units. The classifier of the at least one of the image units is then compared with a classifier of another image unit. The classifier may be image unit length, width, location in the document, font, typeface, cross-section, the number of ascenders, the number of descenders, the average pixel density, the length of the top line contour, the length of the base contour, the location of image units with respect to neighboring image units, vertical position, horizontal inter-image unit spacing, and so forth. The classifier comparison can be a comparison with classifiers of image units of words in a reference table, or with classifiers of other image units in the document. Equivalent classes of image units can be generated, from which word frequency and significance can be determined. The image units can be determined by creating bounding boxes about identifiable segments or extractable units of the image, and can contain a word, a phrase, a letter, a number, a character, a glyph or the like.

Citations

9 Claims

1. A method for electronically processing at least one document stored as an electronic document image containing undecoded text to identify a selected portion thereof, said method comprising the steps or:
- segmenting said at least one document image into words, each word having an undecoded textual content;
  
  classifying the textual content of at least some of said words relative to other said words, without decoding the words, based on an evaluation of predetermined morphological characteristics of said words; and
  
  selecting words for further processing according to the classification of said words obtained in said classifying step.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 9)
- - 2. The method of claim 1 wherein said evaluation of predetermined morphological characteristics includes a determination of whether said words being classified are located within selected regions within the document image.
  - 3. The method of claim 1 wherein said classifying step is also based on a determination of the relative frequencies with which words having similar predetermined morphological characteristics are present among the words being classified.
  - 4. The method of claim 1 wherein said predetermined morphological characteristics include at least one of a dimension, font, typeface, number of ascender elements, number of descender elements, pixel density, pixel cross-sectional characteristics, the location of words with respect to neighboring words, vertical position, horizontal inter-word spacing, and contour characteristic of said words.
  - 5. The method of claim 1 wherein:
    - prior to performing said classifying step, said words are processed for discriminating which of said words do not contain sufficient textual content for useful evaluation of the subject matter of the text contained in said document image;
      
      said classifying step is performed only with the words not discriminated by said process for discriminating; and
      
      said process for discriminating is performed based on an evaluation of the predetermined morphological characteristics of said words, without decoding the words or referring to decoded word image data.
  - 6. The method of claim 1 wherein a document corpus containing a plurality of documents is processed, and said segmenting, classifying and selecting steps are performed with respect to the document image for each document in the document corpus.
  - 7. The method of claim 6 further comprising the step of classifying the documents in the document corpus according to the classification of the words obtained in said classifying step.
  - 9. The method of claim 1, wherein said words comprise at least one of numbers, alphanumerical sequences, symbols and graphic representations.

8. A method for electronically processing at least one document stored as an electronic document image containing undecoded information to identify a selected portion thereof, said method comprising the steps of:
- segmenting said at least one document image into image units;
  
  classifying the image units relative to other said image units, without decoding the image units being classified or referring to decoded image data, based on an evaluation of predetermined morphological image characteristics of said image units being classified;
  
  selecting image units for further processing according to the classification of said image units obtained in said classifying step, wherein;
  
  prior to performing said classifying step, said image units are processed for discriminating which of said image units are useful for evaluation of the subject matter contained in said document image; and
  
  said classifying step is performed only with the image units not discriminated by said process for discriminating; and
  
  said process for discriminating is performed based on an evaluation of predetermined image characteristics of said image units, without decoding the image units or referring to decoded image data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Withgott, M. Margaret, Cutting, Douglass R., Kaplan, Ronald M., Rao, Ramana B., Cass, Todd A., Bloomberg, Dan S., Huttenlocher, Daniel P., Bagley, Steven C., Halvorsen, Per-Kristian
Primary Examiner(s)
Boudreau, Leo H.

Application Number

US07/794,191
Time in Patent Office

1,183 Days
Field of Search

382/9, 382/18, 382/25, 382/27, 382/54, 382/55, 382/40
US Class Current

382/173
CPC Class Codes

G06F 40/103 Formatting, i.e. changing o...

G06V 30/40 Document-oriented image-bas...

Methods and apparatus for selecting semantically significant images in a document image without decoding image content

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for selecting semantically significant images in a document image without decoding image content

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links