Method and apparatus for determining the frequency of phrases in a document without document image decoding

US 5,369,714 A
Filed: 11/19/1991
Issued: 11/29/1994
Est. Priority Date: 11/19/1991
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:

segmenting the document image into word units;

determining at least one significant morphological image characteristic of selected word units in the document image;

identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label;

equating the equivalence class labels to said selected word nits arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appear in the document image, said master-sequence being comprised of sub-sequences;

evaluating said equivalence class label sub-sequences to determine the frequency of each equivalence class label sub-sequence, andoutputting to an optical or electrical output device a list of significant phrases corresponding to the equivalence class label sub-sequences without having determined their content beyond the at least one significant morphological image characteristic.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for determining phrase frequency in an undecoded document text image without first converting the document to character codes. The method includes segmenting of the document image into word units without document image decoding, and morphological image processing to determine word unit characteristics for placement into equivalence classes utilizing non-content based information. All of the possible sequences of selected word units in reading order in the document constituting phrases are mapped into a list of corresponding sequences of the associated equivalence class labels for each selected image unit in the phrase, and the corresponding equivalence class sequences are analyzed to determine the frequency of the phrases.

83 Citations

View as Search Results

19 Claims

1. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:
- segmenting the document image into word units;
  
  determining at least one significant morphological image characteristic of selected word units in the document image;
  
  identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label;
  
  equating the equivalence class labels to said selected word nits arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appear in the document image, said master-sequence being comprised of sub-sequences;
  
  evaluating said equivalence class label sub-sequences to determine the frequency of each equivalence class label sub-sequence, andoutputting to an optical or electrical output device a list of significant phrases corresponding to the equivalence class label sub-sequences without having determined their content beyond the at least one significant morphological image characteristic.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein said step of identifying equivalence classes of selected word units comprises correlating word unit morphological image characteristics using a decision network.
  - 3. The method of claim 1, wherein said step of identifying equivalence classes comprises comparing word unit shape representations of said selected word units.
  - 4. The method of claim 3 wherein said word unit shape representations are determined by deriving at least one, one-dimensional signal characterizing the shape of the word unit.
  - 5. The method of claim 3 wherein said word unit shape representations are determined by deriving an image function defining a boundary enclosing the selected word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying a character or characters making up the word unit.
  - 6. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a dimension, of said selected image units.
  - 7. The method of claim 3 wherein said comparison of said word shape representations compares only length and height dimensions of said word shape representations.
  - 8. A method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a font of said selected word units.
  - 9. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a typeface of said selected word units.
  - 10. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a number of ascender elements of said selected word units.
  - 11. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a number of descender elements of said selected word units.
  - 12. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a pixel density of said selected word units.
  - 13. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a pixel cross-sectional characteristic of said selected word units.
  - 14. The method of claim 1, wherein said step of determining at least one significant morphological image characteristic of said word units comprises determining a contour characteristic of said selected word units.

15. An apparatus for processing a digital image of text on a document to determine the frequency of word phrases in the text, comprising:
- means for segmenting the digital image into word units;
  
  means for determining at least one morphological characteristic of selected ones of said word units;
  
  means for identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label;
  
  means for equating the equivalence class labels to said selected word units arranged in the order in which the selected word units appear in the document image to form a master-sequence of equivalence class labels, said master-sequence including the equivalence class labels of the selected word units in the document image arranged in the order in which the selected word units appears in the document image, said master-sequence being comprised of sub-sequences; and
  
  means for classifying said sub-sequences of equivalence class labels to determine the frequency of each equivalence class label sub-sequence; and
  
  an output device for producing an output responsive to the relative frequencies of occurrence of the selected equivalence class label sub-sequences which correspond to phrases, wherein informational content of the selected equivalence class label sub-sequences has not been determined beyond the at least one morphological image characteristic.
- View Dependent Claims (16, 17)
- - 16. The apparatus of claim 15 wherein said morphological image characteristic determining means comprises means for deriving at least one, one-dimensional signal characterizing the shape of the word unit.
  - 17. The apparatus of claim 15, wherein said morphological image characteristic determining means comprises means for deriving an image function defining a boundary enclosing the word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying the character or characters making up the word unit.

18. A method for determining a frequency of occurrence of significant word sequences in an undecoded electronic document text image, comprising the steps of:
- segmenting the document image into word units;
  
  determining at least one significant morphological image characteristic of selected word units in the document image;
  
  identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label, said identifying step including comparing word unit shape representations of said selected word units, said word unit shape representations being determined by deriving an image function defining a boundary enclosing the selected word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying a character or characters making up the word unit;
  
  determining the sequences of equivalence class labels corresponding to all sequences of said selected word units arranged in the order in which the selected word units appear in the document image; and
  
  evaluating said equivalence class label sequences to determine the frequency of each equivalence class label sequence.

19. An apparatus for processing a digital image of text on a document to determine the frequency of word phrases in the text, comprising:
- means for segmenting the digital image into word units;
  
  means for determining at least one morphological characteristic of selected ones of said word units, said means for determining including means for deriving an image function defining a boundary enclosing the word unit, and augmenting the image function so that an edge function representing edges of a character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individually detecting and/or identifying the character or characters making up the word unit;
  
  means for identifying equivalence classes of the selected word units in the document image by clustering the ones of the selected word units with similar morphological image characteristics, each equivalence class being assigned a label;
  
  means for determining the sequences of equivalence class labels corresponding to all sequences of said selected word units arranged in the order in which the selected word units appear in the document image; and
  
  means for classifying said sequences of equivalence class labels to determine the frequency of each equivalence class label sequence; and
  
  an output device for producing an output responsive to the relative frequencies of occurrence of the selected equivalence class label sequences.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Withgott, M. Margaret, Rao, Ramana R.
Primary Examiner(s)
Boudreau, Leo H.
Assistant Examiner(s)
Prikockis, Larry J.

Application Number

US07/794,555
Time in Patent Office

1,106 Days
Field of Search

382/9, 382/40, 382/36, 382/18, 382/25, 381/43
US Class Current

382/177
CPC Class Codes

G06V 30/10 Character recognition

G06V 30/262 using context analysis, e.g...

Method and apparatus for determining the frequency of phrases in a document without document image decoding

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

83 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for determining the frequency of phrases in a document without document image decoding

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

83 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others