Text categorization based on co-classification learning from multilingual corpora
First Claim
1. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:
- a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;
b) running the first classifier over the first subset to generate a first classification;
c) running a second classifier implemented on the computing device over the second subset to generate a second classification;
d) reducing a training cost between the first and second classifications, including repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;
the reducing comprising applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications; and
e) outputting at least one of said first classification and said first classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.
26 Citations
26 Claims
-
1. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:
-
a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset; b) running the first classifier over the first subset to generate a first classification; c) running a second classifier implemented on the computing device over the second subset to generate a second classification; d) reducing a training cost between the first and second classifications, including repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;
the reducing comprising applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications; ande) outputting at least one of said first classification and said first classifier. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method implemented on a computing device for generating classifiers from multilingual corpora, the method comprising:
-
extracting, using the computing device, textual data from each one of a set of documents which form part of the multilingual corpora, the multilingual corpora comprising a first and a second subset of content-equivalent documents written in one of two respective languages; transforming the textual data into a respective one of feature vectors x1 and x2, each one of the feature vectors being associated to a document classification y for categorizing different language versions of a same document; generating, using the computing device, a first classifier f1 from the first subset, the first classifier f1 being associated to the feature vector x1; generating, using the computing device, a second classifier f2 from the second subset, the second classifier f2 being associated to the feature vector x2; re-training the first classifier f1 on the first subset based on classification results obtained from the second classifier f2, to obtain a re-trained first classifier f1; re-training the second classifier f2 on the second subset based on other classification results obtained from the re-trained first classifier f1′
, to obtain a retrained second classifier f2′
;
the re-training comprising applying at least one of a gradient based algorithm for reducing the training cost between classification results, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classification results;repeating the steps of re-training until a training cost between the retrained first and second classifiers is minimized, thereby producing final first and second re-trained classifiers; and outputting at least one of the final first re-trained classifier and the final second re-trained classifier. - View Dependent Claims (15, 16)
-
-
17. A system for classifying content-equivalent documents written in different languages, said system comprising
a first classifier for classifying a first set of documents written in a first language to generate a first classification; -
a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification; a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum, wherein adjusting the parameters includes applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications; wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:
-
a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset; b) running the first classifier over the first subset to generate a first classification; c) running a second classifier over the second subset to generate a second classification; d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;
the repeating being performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier; ande) outputting at least one of said first classification and said first classifier. - View Dependent Claims (24)
-
-
25. A system for classifying content-equivalent documents written in different languages, said system comprising
a first classifier for classifying a first set of documents written in a first language to generate a first classification; -
a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification; a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum; wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum, wherein the re-classification is performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier. - View Dependent Claims (26)
-
Specification