Method and apparatus for character recognition accommodating diacritical marks

US 4,611,346 A
Filed: 09/29/1983
Issued: 09/09/1986
Est. Priority Date: 09/29/1983
Status: Expired due to Fees

First Claim

Patent Images

1. A method of processing data for recognizing unknown characters of a known character set, some of the characters having diacritical marks associated therewith, said method comprising the steps of:

storing the image data representing an entire unknown character, including any overlapping or non-overlapping diacritical marks associated therewith;

segmenting the stored image data to represent individual unknown characters including any diacritical mark associated therewith;

extracting from the stored image data that portion of the image data representing a predetermined localized area of the unknown character corresponding to the expected location of a diacritical mark;

classifying the segmented image data to provisionally distinguish larger characters which may include a diacritical mark from smaller characters which may not include a diacritical mark;

examining the extracted diacritical mark image data and at least a portion of the non-extracted stored image data of the unknown character with the provisional distinction between the larger and smaller characters to recognize the unknown character and any diacritical mark associated therewith.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus of processing data is disclosed for recognizing unknown characters of a known character set, some of the characters having diacritical marks. The method includes the steps of storing the image data of an unknown character which may contain a diacritical mark. From the stored image data a predetermined localized area of data is extracted that corresponds to the expected location of the diacritical mark. The extracted diacritical mark image data and at least a portion of the stored image data of the unknown character are examined to recognize the character and any diacritical mark associated therewith. Also disclosed are video preprocessing techniques for segmenting the characters using profiles thereof, inclusive-bit-coding to separate characters based upon differences in size, justification of the extracted diacritical mark image data, unique encoding of the recognition results, and post-processing verification for characters including diacritical marks.

174 Citations

18 Claims

1. A method of processing data for recognizing unknown characters of a known character set, some of the characters having diacritical marks associated therewith, said method comprising the steps of:
- storing the image data representing an entire unknown character, including any overlapping or non-overlapping diacritical marks associated therewith;
  
  segmenting the stored image data to represent individual unknown characters including any diacritical mark associated therewith;
  
  extracting from the stored image data that portion of the image data representing a predetermined localized area of the unknown character corresponding to the expected location of a diacritical mark;
  
  classifying the segmented image data to provisionally distinguish larger characters which may include a diacritical mark from smaller characters which may not include a diacritical mark;
  
  examining the extracted diacritical mark image data and at least a portion of the non-extracted stored image data of the unknown character with the provisional distinction between the larger and smaller characters to recognize the unknown character and any diacritical mark associated therewith.
- View Dependent Claims (2, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein said step of extracting from the stored image data the image data contained in a predetermined localized area includes justifying the extracted data and separately storing the extracted and justified data.
  - 4. The method of claim 1 wherein the step of segmenting the stored image data includes generating from the stored image data a profile representing the unknown characters and any diacritical marks associated therewith, and separating the profile into segments, each segment representing an unknown character including any diacritical mark associated therewith.
  - 5. The method of claim 4 wherein the step of separating the profile into segments representing the unknown characters including any diacritical marks associated therewith includes:
    - (a) testing the profile for natural segmentation points at predetermined initial intervals and establishing as separations the natural segmentation points that coincide with the predetermined intervals;
      
      (b) expanding controlled amounts those predetermined initial intervals not coinciding with natural segmentation points, testing the profile for natural segmentation points within the expanded intervals, and establishing as separations the natural segmentation points that coincide with the expanded intervals;
      
      (c) contracting controlled amounts those predetermined initial intervals and expanded intervals not coinciding with natural segmentation points, testing the profile for natural segmentation points within the contracted intervals, and establishing as separations the natural segmentation points that coincide with the contracted intervals; and
      
      (d) fixing as a separation for profiles not previously divided at a natural segmentation point the predetermined initial interval, whereby the individual portions of the profile define the size of the separated individual segments of the unknown characters.
  - 6. The method of claim 4 wherein the step of generating a profile comprises generating a profile parallel to the reading line, and the step of separating the profile comprises separating the profile into segments with each segment representing the relative width of an unknown character.
  - 7. The method of claim 6 wherein the step of generating a profile parallel to the reading line comprises horizontally scanning horizontally read lines of characters to generate lines of data and logically combining the lines of data with an OR function so that the logically combined data corresponds to character widths and horizontal separations between characters.
  - 8. The method of claim 1 wherein said step of classifying the segmented image data comprises generating an inclusive-bit-encoded word representative of the size of the character, and testing a given bit in said word to distinguish larger characters from smaller characters.
  - 9. The method of claim 1 wherein the step of examining the extracted diacritical mark image data comprises testing the unnormalized diacritical mark image data to recognize the diacritical mark.
  - 10. The method of claim 1 wherein the step of examining at least a portion of the non-extracted stored image data of the unknown character comprises testing the normalized image data to recognize the base character.
  - 11. The method of claim 1 further including the final step of verifying that any recognized diacritical mark is associated with a character that may properly include a diacritical mark.
  - 12. The method of claim 1 wherein the step of examining the extracted diacritical mark image data comprises testing the unnormalized diacritical mark image data to recognize the diacritical mark and the step of examining at least a portion of the non-extracted stored image data of the unknown character comprises testing the normalized image data to recognize the base character.

3. The method of cliam 2 wherein the step of justifying the extracted diacritical mark image data comprises justifying said data in a direction away from the unknown character.

13. A method of processing data for recognizing unknown characters of a known character set, the unknown characters being represented by scan data representing an entire unknown character including any diacritical marks associated therewith, said method comprising the steps of;
- storing the image data representing an entire unknown character, including any overlapping or non-overlapping diacritical marks associated therewith;
  
  generating from the stored image data profiles parallel to the reading line representing the unknown characters and any diacritical marks associated therewith, and separating the profiles into segments, each segment representing the relative width of an unknown character including any diacritical mark associated therewith;
  
  extracting from the stored unknown character image data that portion of the image data contained in a predetermined localized area of the unknown character corresponding to the expected location of a diacritical mark;
  
  classifying the segmented profiles and associated character data by generating an inclusive-bit-encoded word representative of the size of the character and testing a given bit in said word to provisionally distinguish larger characters which may include a diacritical mark from smaller characters which may not include a diacritical mark;
  
  justifying the extracted diacritical mark image data;
  
  examining the justified diacritical mark image data and at least a portion of the non-extracted stored image data of the unknown character with the provisional distinction between the larger and smaller characters to recognize the unknown character and any diacritical mark associated therewith.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The method of claim 13 wherein the step of separating the profile into segments representing the unknown characters including any diacritical marks associated therewith includes:
    - (a) testing the profile for natural segmentation points at predetermined initial intervals and establishing as separations the natural segmentation points that coincide with the predetermined intervals;
      
      (b) expanding controlled amounts those predetermined initial intervals not coinciding with natural segmentation points, testing the profile for natural segmentation points within the expanded intervals, and establishing as separations the natural segmentation points that coincide with the expanded intervals;
      
      (c) contracting controlled amounts those predetermined initial intervals and expanded intervals not coinciding with natural segmentation points, testing the profile for natural segmentation points within the contracted intervals, and establishing as separations the natural segmentation points that coincide with the contracted intervals; and
      
      (d) fixing as a separation for profiles not previously divided at a natural segmentation point the predetermined initial interval, whereby the individual portions of the profile define the size of the separated individual segments of the unknown characters.
  - 15. The method of claim 13 wherein the step of examining the extracted diacritical mark image data comprises testing the unnormalized diacritical mark image data to recognize the diacritical mark.
  - 16. The method of claim 13 wherein the step of examining at least a portion of the non-extracted stored image data of the unknown character comprises testing the normalized image data to recognize the base character.
  - 17. The method of claim 13 further including the final step of verifying that any recognized diacritical mark is associated with a character that may properly include a diacritical mark.

18. Apparatus for recognizing unknown characters of a known character set, some of the characters having diacritical marks associated therewith, said apparatus comprising:
- means for storing the image data representing an entire unknown character including any overlapping or non-overlapping diacritical mark which may be associated therewith;
  
  means associated with said means for storing for segmenting the stored image data to represent individual unknown characters including any diacritical mark associated therewith;
  
  means associated with said means for storing the image of an entire unknown character for extracting from the stored image data that portion of the image data representing a predetermined localized area of the unknown character corresponding to the expected location of a diacritical mark;
  
  means associated with said means for segmenting for classifying the segmented image data to provisionally distinguish larger characters which may include a diacritical mark from smaller characters which may not include a diacritical mark;
  
  means associated with said means for extracting for examining the extracted diacritical mark image data and at least a portion of the non-extracted stored image data of the unknown character with the provisional distinction between the larger and smaller characters to recognize the unknown character and any diacritical mark associated therewith.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bednar, Gregory M., Narasimha, Manthri S., Fryer, George B.
Primary Examiner(s)
Boudreau, Leo H.

Application Number

US06/537,279
Time in Patent Office

1,076 Days
Field of Search

382/9, 382/16, 382/19, 382/23, 382/37
US Class Current

382/174
CPC Class Codes

G06F 18/00   Pattern recognition

G06V 10/10   Image acquisition document ...

G06V 10/70   using pattern recognition o...

G06V 30/10   Character recognition

G06V 30/148   Segmentation of character r...

G06V 30/15   Cutting or merging image el...

Method and apparatus for character recognition accommodating diacritical marks

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

174 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for character recognition accommodating diacritical marks

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

174 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links