Method for optical recognition of a multi-language set of letters with diacritics
First Claim
1. A method of identifying multi-language characters in an optical recognition system, comprising:
- digitizing a document having characters printed thereon, targeting each of the characters for recognition analysis, separating characters having a single component from characters having more than a single component, recognizing characters having a single component, segmenting characters having more than a single component into constituent components, said constituent components including a base component and at least one diacritic component forming a diacritic, recognizing the base component, recognizing the diacritic through analysis of the at least one diacritic component, determining through a match analysis whether the diacritic can be used in combination with the base component, and recognizing the combination of the diacritic and base component in response to a match.
5 Assignments
0 Petitions
Accused Products
Abstract
The present invention is a method for recognizing non-English alpha characters that contain diacritics. An image analysis separates the character into its constituent components. The one or more diacritic components are then distinguished and isolated from the base portion of the character. Optical recognition is performed separately on the base portion. The diacritic is recognized through a special image analysis and pattern recognition algorithms. The image analysis extracts geometric information from the one or more diacritic components. The extracted information is used as input for the pattern recognition algorithms. The output is a code that corresponds to a particular diacritic. The recognized base portion and diacritic are combined and a check is performed for acceptable combinations in a chosen language. By separately recognizing the base portion and diacritic, the character sets used by the recognizer can be narrowed, resulting in greater recognition.
24 Citations
18 Claims
-
1. A method of identifying multi-language characters in an optical recognition system, comprising:
-
digitizing a document having characters printed thereon, targeting each of the characters for recognition analysis, separating characters having a single component from characters having more than a single component, recognizing characters having a single component, segmenting characters having more than a single component into constituent components, said constituent components including a base component and at least one diacritic component forming a diacritic, recognizing the base component, recognizing the diacritic through analysis of the at least one diacritic component, determining through a match analysis whether the diacritic can be used in combination with the base component, and recognizing the combination of the diacritic and base component in response to a match. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method of identifying multi-language characters in an optical recognition system, comprising:
-
digitizing a document having characters printed thereon, targeting each of the characters for recognition analysis, separating characters having a single component from characters having more than a single component, recognizing characters having a single component, segmenting characters having more than a singe component into constituent components, said constituent components including a base component and at least one diacritic component forming a diacritic, recognizing the base component, recognizing the diacritic through analysis of the at least one diacritic component, determining through a match analysis whether the, diacritic can be used in combination with the base component, recognizing the combination of the diacritic and base component in response to a match, determining during the match analysis whether the base component is one of a plurality of commonly misrecognized base components, determining if the commonly misrecognized base component can be matched with the recognized diacritic, determining the base component that is commonly misrecognized as the commonly misrecognized base component, and matching the base component to the diacritic when the commonly misrecognized base component does not match with the diacritic.
-
Specification