Method of automatic language identification for multi-lingual text recognition
First Claim
1. A method for automatically determining one or more languages associated with text in a bit-mapped image, comprising the steps of:
- segmenting the image into a plurality of images of word token, recognition of separate characters in said images of word token, joining separate characters into groups presumably comprising words, forming at least one hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language;
the said step of forming a hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, further comprises at least the following steps definition of selected language models set, estimation of word correspondence with lingual and non-lingual models.
1 Assignment
0 Petitions
Accused Products
Abstract
The disclosed invention utilizes a complex estimation-based approach to identify languages of portions of a multi-lingual text, recognized from a bit-mapped image. The method comprises besides the traditional steps like the document segmentation, new ones such as generating and testing of a hypothesis about the characters in the word tokens.
The method further includes definition of selected language models set, word estimation via language models, dictionaries set definition for language selection, estimation of word correspondence with chosen languages, calculating a complex estimation for the word taking into account the most or all of above mentioned estimations.
The complex estimation may also include factor of characters and/or words mutual correspondence within the line and/or the text, mutual geometric correspondence of characters within the word and/or the line, linguistic correspondence of the word with neighbors, estimation of image of word token reconstruction accuracy in the presence of distortion.
203 Citations
13 Claims
-
1. A method for automatically determining one or more languages associated with text in a bit-mapped image, comprising the steps of:
-
segmenting the image into a plurality of images of word token, recognition of separate characters in said images of word token, joining separate characters into groups presumably comprising words, forming at least one hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language;
the said step of forming a hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, further comprises at least the following steps definition of selected language models set, estimation of word correspondence with lingual and non-lingual models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
Specification