Cross lingual text classification apparatus and method
First Claim
1. A text classification, comprising:
- a text input device for receiving an entered text;
a storage device for storing a concept thesaurus file for use in classifying an entered text to be classified, a cross lingual word sense-based knowledge file corresponding to a plurality of languages including a first and a second language, and a word-based classification knowledge file;
a processing unit for executing a classification of the entered text to be classified to assign a category to the entered text; and
an output device for outputting the classification result, wherein;
said text input device receives first entered text to be classified in the first language,said processing unit is configured to;
extract a word from said first entered text to be classified;
convert the extracted a word into a word sense using said concept thesaurus file;
compare the word sense resulting from the conversion with information on each category included in said cross lingual word sense-based classification knowledge file to calculate a first score for each category;
compare the extracted word with word classification information included in said word-based classification knowledge file to calculate a second score for each category; and
integrate said first and second scores for each category to determine a category for the first text to be classified in the first language for assigning a category to the first entered text, andsaid word-based classification knowledge file is generated by learning a word-based classification knowledge using words included in a labeled text in the first language,wherein;
said text to be classified which has been assigned the category by said text classification apparatus is used for learning the word-based classification knowledge as a labeled text in the first language used in the generation of said word-based classification knowledge file.
1 Assignment
0 Petitions
Accused Products
Abstract
A text classification apparatus directed to a plurality of languages, includes a unit for extracting information for converting a word from non-classified (unlabeled) texts, in a plurality of languages, into a word sense, a unit for learning a classification knowledge at a word sense level after converting a word extracted from a labeled text into a word sense, a unit for learning a classification knowledge at a word level from the labeled text, a unit for learning the classification knowledge at the word level from the classification knowledge at the word sense level and information on a relation between words extracted from the unlabeled text, and a unit for combining the respective classification knowledges to assign a category.
21 Citations
12 Claims
-
1. A text classification, comprising:
-
a text input device for receiving an entered text; a storage device for storing a concept thesaurus file for use in classifying an entered text to be classified, a cross lingual word sense-based knowledge file corresponding to a plurality of languages including a first and a second language, and a word-based classification knowledge file; a processing unit for executing a classification of the entered text to be classified to assign a category to the entered text; and an output device for outputting the classification result, wherein; said text input device receives first entered text to be classified in the first language, said processing unit is configured to; extract a word from said first entered text to be classified; convert the extracted a word into a word sense using said concept thesaurus file; compare the word sense resulting from the conversion with information on each category included in said cross lingual word sense-based classification knowledge file to calculate a first score for each category; compare the extracted word with word classification information included in said word-based classification knowledge file to calculate a second score for each category; and integrate said first and second scores for each category to determine a category for the first text to be classified in the first language for assigning a category to the first entered text, and said word-based classification knowledge file is generated by learning a word-based classification knowledge using words included in a labeled text in the first language, wherein;
said text to be classified which has been assigned the category by said text classification apparatus is used for learning the word-based classification knowledge as a labeled text in the first language used in the generation of said word-based classification knowledge file. - View Dependent Claims (2, 3, 4)
-
-
5. A text classification apparatus comprising:
-
a text input device for receiving an entered text; a storage device for storing a concept thesaurus file for use in classifying an entered text to be classified, a cross lingual word sense knowledge file corresponding to a plurality of languages including a first and a second language, and a word-based classification knowledge file; a processing unit for executing a classification of the entered text to be classified to assign a category to the text; and an output device for outputting the classification result, wherein; said text input device receives an entered text to be classified in the first language, said processing unit is configured to; extract a word from said first text to be classified; convert the extracted word into a word sense using said concept thesaurus file; compare the word sense resulting from the conversion with information on each category included in said cross lingual word sense-based classification knowledge file to calculate a first score for each category; compare the extracted word with word-based classification information included in said word-based classification knowledge file to calculate a second score for each category; and integrate said first and second scores for each category to determine a category for the first text to be classified in the first language for assigning a category to the first text, and said word-based classification knowledge file is generated by extracting information indicative of a relation between a plurality of words from a labeled text in the first language, and extracting a word-based classification knowledge of each category using the extracted information on the relation between words, and word classification information on each category included in said cross lingual word sense-based classification knowledge file. - View Dependent Claims (6, 7, 8)
-
-
9. A method of assigning a category to a text to be classified, said text being entered into a text classification apparatus having a storage device for storing a concept thesaurus file for use in classifying an entered text to be classified, a cross lingual word sense knowledge file corresponding to a plurality of languages including a first and a second language, and a word-based classification knowledge file, and a processing unit for executing a classification of the entered text to be classified to assign a category to the text, said method comprising the steps of:
-
receiving a text in the first language entered for classification; extracting a word from the text to be classified in the first language; converting the extracted word into a word sense using said concept thesaurus file; comparing the word sense resulting from the conversion with information on each category included in said cross lingual word sense-based classification knowledge file to calculate a first score for each category; comparing the extracted word with word-based classification information included in said word-based classification knowledge file to calculate a second score for each category; and integrating said first and second scores for each category to determine a category for the text to be classified in the first language for assigning a category to the text to be classified, and said word-based classification knowledge file being generated by learning a word-based classification knowledge using words included in the labeled text in the first language, wherein; said text to be classified which has been assigned the category by said text classification apparatus is used for learning the word-based classification knowledge as the labeled text in the first language used in the production of said word-based classification knowledge file. - View Dependent Claims (10, 11, 12)
-
Specification