Automatic language identification by stroke geometry analysis
First Claim
1. A computer automated method for identifying an unknown language used to create a document, including the steps of:
- defining a set of training documents in a variety of known languages and formed from a variety of text styles;
forming black and white pixel images of text material defining said training documents and said document in said unknown language;
locating a plurality of seed black pixels from a region growing algorithm;
progressively locating black pixels having a selected relationship with said seed pixels to define a plurality of line stroke segments that connect to form a line stroke;
identifying black pixels to define a head and a tail black pixel for each said line stroke;
extracting point features from said line stroke segments, where the point features include a vertical position and slope of individual line stroke segments, and locally-averaged radius of curvature that are effective to characterize each of said languages;
forming feature profiles from said point features for an unknown language and each of said known languages; and
comparing said feature profile from said unknown language with each of said feature profiles from said known languages to identify one of said known languages that best represents said unknown language.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented process identifies an unknown language used to create a document. A set of training documents is defined in a variety of known languages and formed from a variety of text styles. Black and white electronic pixel images are formed of text material forming the training documents and the document in the unknown language. A plurality of line strokes are defined from the black pixels and point features are extracted from the strokes that are effective to characterize each of the languages. Point features from the unknown language are compared with point features from the known languages to identify one of the known languages that best represents the unknown language.
199 Citations
2 Claims
-
1. A computer automated method for identifying an unknown language used to create a document, including the steps of:
-
defining a set of training documents in a variety of known languages and formed from a variety of text styles; forming black and white pixel images of text material defining said training documents and said document in said unknown language; locating a plurality of seed black pixels from a region growing algorithm; progressively locating black pixels having a selected relationship with said seed pixels to define a plurality of line stroke segments that connect to form a line stroke; identifying black pixels to define a head and a tail black pixel for each said line stroke; extracting point features from said line stroke segments, where the point features include a vertical position and slope of individual line stroke segments, and locally-averaged radius of curvature that are effective to characterize each of said languages; forming feature profiles from said point features for an unknown language and each of said known languages; and comparing said feature profile from said unknown language with each of said feature profiles from said known languages to identify one of said known languages that best represents said unknown language. - View Dependent Claims (2)
-
Specification