Method and apparatus for determining the frequency of words in a document without document image decoding

US 5,325,444 A
Filed: 10/29/1993
Issued: 06/28/1994
Est. Priority Date: 11/19/1991
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:

segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image;

deriving a word shape representation of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units;

identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar word shape representations; and

quantifying the word units in each equivalence class.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for determining word frequency from a document without first converting the document to character codes. The method includes morphological image processing to determine word unit characteristics for placement into equivalence classes utilizing non-content based information. Word shape representations are preferably determined and compared to define equivalent word units.

88 Citations

View as Search Results

20 Claims

1. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:
- segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image;
  
  deriving a word shape representation of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units;
  
  identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar word shape representations; and
  
  quantifying the word units in each equivalence class.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein said step of identifying equivalence classes of word units comprises correlating said word shape representations of said word units using a decision network.
  - 3. The method of claim 1 wherein said step of identifying equivalence classes comprises comparing word shape representations of said word units.
  - 4. The method of claim 3 wherein said word shape representations are derived by deriving at least one, one-dimensional signal characterizing the shape of each word unit.
  - 5. The method of claim 3 wherein said word shape representations are derived by deriving an image function defining a boundary enclosing the word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying the character or characters making up the word unit.
  - 6. The method of claim 1 wherein said step of quantifying the word units in each equivalence class comprises linking word units together.
  - 7. The method of claim 6 wherein said step of linking word units together comprises determining an equivalence class label for each word unit, and mapping each word unit tot he determined equivalence class label.
  - 8. The method of claim 1, further comprising the step of optically scanning a document to form said document image prior to segmenting the document image.
  - 9. The method of claim 1 wherein said steps of segmenting the document image into word units, deriving a word shape representation of the word units, identifying equivalence classes of the word units, clustering the word units, and quantifying the word units are performed by operating a programmed digital computer.
  - 10. The method of claim 1, further comprising producing an output based on the identified equivalence classes.
  - 11. The method of claim 10, wherein said output is also produced based on the quantification of each equivalence class.
  - 12. The method of claim 11, wherein said output is a list of the words, represented by the word units, in order of frequency of appearance of said words in said document image.

13. In a method for electronically processing an electronic document comprising text images, the steps of:
- identifying word units in said text images without decoding the text images, each word unit corresponding to a word in said document image;
  
  deriving a word shape representation of said word units without detecting or identifying any characters making up the words corresponding to the word units;
  
  clustering word units having similar word shape representations into equivalence classes; and
  
  quantifying the number of word units in each equivalence class.
- View Dependent Claims (14)
- - 14. The method of claim 13, further comprising:
    - outputting a list of the words, represented by the word units, in order of frequency of appearance of said words in said electronic document based on said clustering and quantifying steps.

15. An apparatus for processing a digital image of text on a document to determine word frequency in the text, comprising:
- means for segmenting the digital image into word units without decoding the digital image of text, each word unit corresponding to a word in said digital image;
  
  means for deriving a word shape representation of selected ones of said word units without detecting or identifying any characters making up the words corresponding to the selected word units;
  
  means for comparing the word shape representations of each of said selected word units to identify equivalent word units; and
  
  an output device for producing an output responsive to the relative frequencies of occurrence of the selected word units identified as being equivalent.
- View Dependent Claims (16, 17, 18)
- - 16. The apparatus of claim 15 wherein said word shape representation deriving means comprises means for deriving at least one, one-dimensional signal characterizing a shape of said word units.
  - 17. The apparatus of claim 15 wherein said word shape representation deriving means comprises means for deriving an image function defining a boundary enclosing each word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying the character or characters making up the word units.
  - 18. The apparatus of claim 15, wherein said output device outputs a list of the words represented by said word units in order of frequency of appearance of said words in said text.

19. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:
- segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image;
  
  determining at least one significant morphological image characteristic of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units;
  
  identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar morphological image characteristics; and
  
  quantifying the word units in each equivalence class.
- View Dependent Claims (20)
- - 20. The method according to claim 19, wherein said step of determining at least one significant morphological characteristic of said word units includes determining at least one of a dimension, font, typeface, number of ascender elements, number of descender elements, pixel density, pixel cross-sectional characteristic and contour characteristic of said selected word units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Withgott, M. Margaret, Kaplan, Ronald M., Cass, Todd A., Huttenlocher, Daniel P., Halvorsen, Per-Kristian
Primary Examiner(s)
Moore, David K.
Assistant Examiner(s)
Johns, Andrew W.

Application Number

US08/144,620
Time in Patent Office

242 Days
Field of Search

382/9, 382/18, 382/36, 382/40, 382/55, 382/25
US Class Current

382/177
CPC Class Codes

G06V 30/10 Character recognition

G06V 30/262 using context analysis, e.g...

Method and apparatus for determining the frequency of words in a document without document image decoding

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

88 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for determining the frequency of words in a document without document image decoding

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

88 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others