Method and apparatus for automatic language determination of European script documents
First Claim
1. An automatic language determining apparatus for determining a language of a text portion of document having a known script-type, comprising:
- input means for inputting a digital data signal representative of the text portion of the document, the text portion being in an unknown language;
word token generating means for converting the digital data signal to a plurality of word tokens, each word token comprising at least one of a limited number of abstract-coded character classes, each abstract-coded character class representing a group of characters of the known script-type;
feature determining means for determining at least one word token occurrence value of word tokens occurring within the plurality of word tokens and corresponding to at least one predetermined word token; and
language determining means for determining the language of the text portion of the document based on the at least one word token occurrence value.
4 Assignments
0 Petitions
Accused Products
Abstract
An automatic language-determining apparatus automatically determines the particular European language of the text image of a document when the gross-script-type is known to be, or is determined to be, an European script-type. A word token generating means generates word tokens from the text image. A feature determining means determines the frequency of appearance of word tokens of the text portion which correspond to predetermined word tokens. A language determining means converts the determined frequency of appearance rates to a point in a new coordinate space, then determines which predetermined region of the new coordinate space the point is closes to, to determine the language of the text portion.
45 Citations
8 Claims
-
1. An automatic language determining apparatus for determining a language of a text portion of document having a known script-type, comprising:
-
input means for inputting a digital data signal representative of the text portion of the document, the text portion being in an unknown language; word token generating means for converting the digital data signal to a plurality of word tokens, each word token comprising at least one of a limited number of abstract-coded character classes, each abstract-coded character class representing a group of characters of the known script-type; feature determining means for determining at least one word token occurrence value of word tokens occurring within the plurality of word tokens and corresponding to at least one predetermined word token; and language determining means for determining the language of the text portion of the document based on the at least one word token occurrence value. - View Dependent Claims (2)
-
-
3. An automatic language determining apparatus for determining a language of a text portion of a document having a known script-type, comprising:
-
input means for inputting a digital data signal representative of the text portion of the document, the text portion being in an unknown language; word token generating means for converting the digital data signal to a plurality of word tokens, each word token comprising at least one of a limited number of abstract-coded character classes, each abstract-coded character class representing a group of characters of the known script-type of the document; feature determining means for determining at least one word token occurrence value of word tokens occurring within the plurality of word tokens and corresponding to at least one predetermined word token; and language determining means for determining the language of the document based on the at least one word token occurrence value, said language determining means comprising; means for determining frequency-of-occurrence-rates for word tokens within the text portion for each at least one predetermined word token from the word token occurrence value; means for converting the determined frequency rates to a point in a coordinate space; and means for determining a closest one of a plurality of predetermined regions within the coordinate space to the point, each predetermined region having a corresponding language, the language corresponding to the closest region being determined as the language of the text portion. - View Dependent Claims (4, 5)
-
-
6. A method for automatically determining a language of a European script-type document, comprising the steps of:
-
converting characters of a text portion of the document to word tokens of an abstract character code to form a converted text portion; determining for each of at least one predetermined word token, a number of occurrences of each predetermined word token within the converted text portion; determining a frequency of occurrence rate for each at least one predetermined word tokens within the converted text portion; converting the frequency of occurrence rates to a point in a coordinate space; and determining the language of the text portion based on the location of the text point in the coordinate space. - View Dependent Claims (7, 8)
-
Specification