Method and apparatus for determining the frequency of phrases in a document without document image decoding
First Claim
1. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:
- segmenting the document image into word units;
determining at least one significant morphological image characteristic of selected word units in the document image;
identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label;
equating the equivalence class labels to said selected word nits arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appear in the document image, said master-sequence being comprised of sub-sequences;
evaluating said equivalence class label sub-sequences to determine the frequency of each equivalence class label sub-sequence, andoutputting to an optical or electrical output device a list of significant phrases corresponding to the equivalence class label sub-sequences without having determined their content beyond the at least one significant morphological image characteristic.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus for determining phrase frequency in an undecoded document text image without first converting the document to character codes. The method includes segmenting of the document image into word units without document image decoding, and morphological image processing to determine word unit characteristics for placement into equivalence classes utilizing non-content based information. All of the possible sequences of selected word units in reading order in the document constituting phrases are mapped into a list of corresponding sequences of the associated equivalence class labels for each selected image unit in the phrase, and the corresponding equivalence class sequences are analyzed to determine the frequency of the phrases.
83 Citations
19 Claims
-
1. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:
-
segmenting the document image into word units; determining at least one significant morphological image characteristic of selected word units in the document image; identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label; equating the equivalence class labels to said selected word nits arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appear in the document image, said master-sequence being comprised of sub-sequences; evaluating said equivalence class label sub-sequences to determine the frequency of each equivalence class label sub-sequence, and outputting to an optical or electrical output device a list of significant phrases corresponding to the equivalence class label sub-sequences without having determined their content beyond the at least one significant morphological image characteristic. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. An apparatus for processing a digital image of text on a document to determine the frequency of word phrases in the text, comprising:
-
means for segmenting the digital image into word units; means for determining at least one morphological characteristic of selected ones of said word units; means for identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label; means for equating the equivalence class labels to said selected word units arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appears in the document image, said master-sequence being comprised of sub-sequences; and means for classifying said sub-sequences of equivalence class labels to determine the frequency of each equivalence class label sub-sequence; and an output device for producing an output responsive to the relative frequencies of occurrence of the selected equivalence class label sub-sequences which correspond to phrases, wherein informational content of the selected equivalence class label sub-sequences has not been determined beyond the at least one morphological image characteristic. - View Dependent Claims (16, 17)
-
-
18. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:
-
segmenting the document image into word units; determining at least one significant morphological image characteristic of selected word units in the document image; identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label, said identifying step including comparing word unit shape representations of said selected word units, said word unit shape representations being determined by deriving an image function defining a boundary enclosing the selected word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying a character or characters making up the word unit; determining the sequences of equivalence class labels corresponding to all sequences of said selected word units arranged in the order in which the selected word units appear in the document image; and evaluating said equivalence class label sequences to determine the frequency of each equivalence class label sequence.
-
-
19. An apparatus for processing a digital image of text on a document to determine the frequency of word phrases in the text, comprising:
-
means for segmenting the digital image into word units; means for determining at least one morphological characteristic of selected ones of said word units, said means for determining including means for deriving an image function defining a boundary enclosing the word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying the character or characters making up the word unit; means for identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label; means for determining the sequences of equivalence class labels corresponding to all sequences of said selected word units arranged in the order in which the selected word units appear in the document image; and means for classifying said sequences of equivalence class labels to determine the frequency of each equivalence class label sequence; and an output device for producing an output responsive to the relative frequencies of occurrence of the selected equivalence class label sequences.
-
Specification