Automatic language identification system for multilingual optical character recognition
First Claim
1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
- segmenting the document into a plurality of word tokens;
forming at least one hypothesis of the characters in said word tokens;
defining a dictionary for each one of plural languages;
determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language;
defining a plurality of regions in the document, each of which contains at least one word;
determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and
clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.
10 Assignments
0 Petitions
Accused Products
Abstract
The disclosed invention utilizes a dictionary-based approach to identify languages within different zones in a multi-lingual document. As a first step, a document image is segmented into various zones, regions and word tokens, using suitable geometric properties. Within each zone, the word tokens are compared to dictionaries associated with various candidate languages, and the language that exhibits the highest confidence factor is initially identified as the language of the zone. Subsequently, each zone is further split into regions. The language for each region is then identified, using the confidence factors for the words of that region. For any language determination having a low confidence value, the previously determined language of the zone is employed to assist the identification process.
-
Citations
16 Claims
-
1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
-
segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of:
-
defining at least one zone in the document which contains a plurality of words; defining a dictionary for each one of plural languages; for each word in the zone, determining a confidence factor with respect to each of said plural languages, which factor is based on whether the respective dictionaries contain the word; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language.
-
-
10. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
-
segmenting the document into a plurality of zones containing regions of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; for each hypothesized word, determining which ones of said dictionaries contain the hypothesis for the word and determining a confidence value for each language; identifying a zone language for each zone, based upon the confidence values associated with the words in the zone; identifying a region language for each region, based upon the confidence values associated with the words in the region; designating the zone language as the region language if the confidence values associated with the words in the region are not sufficiently high; and clustering regions in a zone which have the same region language to form a subzone that is identified with a particular language. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
-
segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; for each word hypothesis, determining a confidence factor that indicates whether the word is contained in each of said plural languages; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.
-
-
16. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of:
-
defining at least one zone in the document which contains a plurality of words; for each word in the zone, determining a confidence factor that indicates whether the word is contained in each of said plural languages; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language.
-
Specification