Determining language for character sequence
First Claim
1. A method for selecting the language for a character sequence fed into a data processing device, the method comprising:
- storing in the data processing device character-specific decision trees to describe probabilities for at least two different languages on the basis of characters in the environments of the characters, traversing the decision trees for at least some of the characters of the character sequence fed into the data processing device, thus obtaining a probability of at least one language for each character, and selecting the language for the character sequence on the basis of said language probabilities.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for selecting the language for a character sequence fed into a data processing device, wherein decision trees are trained for different characters on the basis of lexicons of predetermined languages. The decision trees describe language probabilities on the basis of characters in the environments of the characters. The decision trees for at least some of the characters of the character sequence fed into the data processing device are traversed, thus obtaining a probability of at least one language for each character. The language for the character sequence is selected on the basis of the probabilities obtained.
-
Citations
17 Claims
-
1. A method for selecting the language for a character sequence fed into a data processing device, the method comprising:
-
storing in the data processing device character-specific decision trees to describe probabilities for at least two different languages on the basis of characters in the environments of the characters, traversing the decision trees for at least some of the characters of the character sequence fed into the data processing device, thus obtaining a probability of at least one language for each character, and selecting the language for the character sequence on the basis of said language probabilities. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for training a decision tree in a data processing device, the method comprising:
-
selecting lexicons to be used of at least two languages which comprise character sequences and language tags associated therewith, combining the lexicons into one training lexicon, and forming decision trees for different characters to be used for selecting the language for a character sequence by taking the following steps of;
forming questions concerning characters in the environment of the selected character on the basis of the training lexicon, comparing said questions with each other, adding nodes to the decision tree of the character, the nodes comprising questions, in order to maximize the information gain, growing the decision tree using the nodes until a predetermined ending criterion is met, adding leaves to the decision tree, the leaves comprising a probability of at least one language. - View Dependent Claims (11, 13, 14, 15)
-
-
12. A data processing device comprising a language selector block for determining the language from a character sequence fed into the device and memory for storing character-specific decision trees determined on the basis of lexicons of at least two languages, the decision trees describing language probabilities on the basis of characters in the environment of the characters, wherein said language selector block is arranged to retrieve the character-specific decision trees from the memory in response to the character sequence fed into the data processing device,
said language selector block is arranged to traverse character-specific decision trees for at least some of the characters of the character sequence until a language can be assigned to each character, and said language selector block is arranged to select the language for the character sequence on the basis of the languages assigned to the characters.
-
16. A computer program product for controlling a data processing device processing a character sequence, said computer program product comprising program code causing the data processing device to
retrieve character-specific decision trees stored in the memory of the data processing device that describe probabilities for at least two languages on the basis of characters in the environments of the characters, traverse the decision trees for at least some of the characters of the character sequence fed into the data processing device, thus obtaining a probability of at least one language for each character, and select the language for the character sequence on the basis of said language probabilities.
-
17. A computer program product for controlling a data processing device, said computer program product comprising program code causing the data processing device to
select lexicons to be used of at least two languages, the lexicons comprising character sequences and language tags associated therewith, combine the lexicons into one training lexicon, and form decision trees for different characters to be used for selecting the language for a character sequence by taking the following steps of: -
producing questions concerning characters in the environment of the selected character on the basis of the training lexicon, comparing said questions with each other, adding nodes to the decision tree of the character, the nodes comprising questions, in order to maximize the information gain, growing the decision tree using the nodes until a predetermined ending criterion is met.
-
Specification