TEXT CATEGORIZATION BASED ON CO-CLASSIFICATION LEARNING FROM MULTILINGUAL CORPORA
First Claim
1. A method for enhancing a performance of a first classifier used for classifying a first subset of documents written in a first language, the method comprising:
- a) providing a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;
b) running the first classifier over the first subset to generate a first classification;
c) running a second classifier over the second subset to generate a second classification;
d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum; and
e) outputting at least one of said first classification and said first classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.
51 Citations
22 Claims
-
1. A method for enhancing a performance of a first classifier used for classifying a first subset of documents written in a first language, the method comprising:
-
a) providing a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset; b) running the first classifier over the first subset to generate a first classification; c) running a second classifier over the second subset to generate a second classification; d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum; and e) outputting at least one of said first classification and said first classifier. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16)
-
-
15. A method for generating classifiers from multilingual corpora, the method comprising:
-
extracting textual data from each one of a set of documents which form part of the multilingual corpora, the multilingual corpora comprising a first and a second subset of content-equivalent documents written in one of two respective languages; transforming the textual data into a respective one of feature vectors x1 and x2, each one of the feature vectors being associated to a document classification y for categorizing different language versions of a same document; generating a first classifier f1 from the first subset, the first classifier f1 being associated to the feature vector x1; generating a second classifier f2 from the second subset, the second classifier f2 being associated to the feature vector x2; re-training the first classifier f1 on the first subset based on classification results obtained from the second classifier f2, to obtain a re-trained first classifier f1; re-training the second classifier f2 on the second subset based on other classification results obtained from the re-trained first classifier f1′
, to obtain a re-trained second classifier f2;repeating the steps of re-training until a training cost between the re-trained first and second classifiers is minimized, thereby producing final first and second re-trained classifiers; and outputting the final first and second re-trained classifiers. - View Dependent Claims (17)
-
-
18. A system for classifying content-equivalent documents written in different languages, said system comprising
a first classifier for classifying a first set of documents written in a first language to generate a first classification; -
a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification; a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum; wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum. - View Dependent Claims (19, 20, 21, 22)
-
Specification