Method and apparatus for determining the frequency of words in a document without document image decoding
First Claim
Patent Images
1. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:
- segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image;
deriving a word shape representation of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units;
identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar word shape representations; and
quantifying the word units in each equivalence class.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for determining word frequency from a document without first converting the document to character codes. The method includes morphological image processing to determine word unit characteristics for placement into equivalence classes utilizing non-content based information. Word shape representations are preferably determined and compared to define equivalent word units.
88 Citations
20 Claims
-
1. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:
-
segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image; deriving a word shape representation of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units; identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar word shape representations; and quantifying the word units in each equivalence class. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. In a method for electronically processing an electronic document comprising text images, the steps of:
-
identifying word units in said text images without decoding the text images, each word unit corresponding to a word in said document image; deriving a word shape representation of said word units without detecting or identifying any characters making up the words corresponding to the word units; clustering word units having similar word shape representations into equivalence classes; and quantifying the number of word units in each equivalence class. - View Dependent Claims (14)
-
-
15. An apparatus for processing a digital image of text on a document to determine word frequency in the text, comprising:
-
means for segmenting the digital image into word units without decoding the digital image of text, each word unit corresponding to a word in said digital image; means for deriving a word shape representation of selected ones of said word units without detecting or identifying any characters making up the words corresponding to the selected word units; means for comparing the word shape representations of each of said selected word units to identify equivalent word units; and an output device for producing an output responsive to the relative frequencies of occurrence of the selected word units identified as being equivalent. - View Dependent Claims (16, 17, 18)
-
-
19. A method for determining a frequency of occurrence of word units in an electronic document image having words represented as an undecoded content, comprising the steps of:
-
segmenting the document image into word units without decoding the document image content, each word unit corresponding to a word in said document image; determining at least one significant morphological image characteristic of selected word units in the document image without detecting or identifying any characters making up the word corresponding to the selected word units; identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units having similar morphological image characteristics; and quantifying the word units in each equivalence class. - View Dependent Claims (20)
-
Specification