Text categorization based on co-classification learning from multilingual corpora

US 8,438,009 B2
Filed: 10/21/2010
Issued: 05/07/2013
Est. Priority Date: 10/22/2009
Status: Active Grant

First Claim

Patent Images

1. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:

a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;

b) running the first classifier over the first subset to generate a first classification;

c) running a second classifier implemented on the computing device over the second subset to generate a second classification;

d) reducing a training cost between the first and second classifications, including repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;

the reducing comprising applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications; and

e) outputting at least one of said first classification and said first classifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.

26 Citations

View as Search Results

26 Claims

1. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:
- a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;
  
  b) running the first classifier over the first subset to generate a first classification;
  
  c) running a second classifier implemented on the computing device over the second subset to generate a second classification;
  
  d) reducing a training cost between the first and second classifications, including repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;
  
  the reducing comprising applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications; and
  
  e) outputting at least one of said first classification and said first classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein reducing further comprises updating one classification based on a probability associated with each class in the other classification.
  - 3. The method of claim 2, wherein updating comprises reducing classification errors.
  - 4. The method of claim 2, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.
  - 5. The method of claim 2, wherein reducing comprises adjusting parameters of each classifier to reduce the training cost between classifications.
  - 6. The method of claim 1, wherein, each classifier updates its own classification in view of the latest version of updated classification generated by the other classifier.
  - 7. The method of claim 6, wherein repeating is performed at least partially in parallel by the first and second classifiers.
  - 8. The method of claim 6, wherein repeating is performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier.
  - 9. The method of claim 1, wherein providing the second subset comprises machine-translating said first subset into the second language.
  - 10. The method of claim 1, wherein providing the second subset comprises providing a subset which is comparable to the first subset.
  - 11. The method of claim 1, wherein providing the second subset comprises providing a subset which is a parallel translation of the first subset.
  - 12. The method of claim 1, wherein the minimum is determined on the basis of a level of difference between the first and second languages.
  - 13. A computer readable memory having recorded thereon non-transitory statements and instructions for execution by a processor for implementing the method of claim 1.

14. A method implemented on a computing device for generating classifiers from multilingual corpora, the method comprising:
- extracting, using the computing device, textual data from each one of a set of documents which form part of the multilingual corpora, the multilingual corpora comprising a first and a second subset of content-equivalent documents written in one of two respective languages;
  
  transforming the textual data into a respective one of feature vectors x1 and x2, each one of the feature vectors being associated to a document classification y for categorizing different language versions of a same document;
  
  generating, using the computing device, a first classifier f1 from the first subset, the first classifier f1 being associated to the feature vector x1;
  
  generating, using the computing device, a second classifier f2 from the second subset, the second classifier f2 being associated to the feature vector x2;
  
  re-training the first classifier f1 on the first subset based on classification results obtained from the second classifier f2, to obtain a re-trained first classifier f1;
  
  re-training the second classifier f2 on the second subset based on other classification results obtained from the re-trained first classifier f1′
  
  , to obtain a retrained second classifier f2′
  
  ;
  
  the re-training comprising applying at least one of a gradient based algorithm for reducing the training cost between classification results, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classification results;
  
  repeating the steps of re-training until a training cost between the retrained first and second classifiers is minimized, thereby producing final first and second re-trained classifiers; and
  
  outputting at least one of the final first re-trained classifier and the final second re-trained classifier.
- View Dependent Claims (15, 16)
- - 15. A computer readable memory having recorded thereon non-transitory statements and instructions for execution by a processor for implementing the method of claim 14.
  - 16. The method of claim 14, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.

17. A system for classifying content-equivalent documents written in different languages, said system comprisinga first classifier for classifying a first set of documents written in a first language to generate a first classification;
- a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification;
  
  a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and
  
  an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum, wherein adjusting the parameters includes applying at least one of a gradient based algorithm for reducing the training cost between classifications, and an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications;
  
  wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. A system according to claim 17, wherein each classifier updates its own classification based on a probability associated with each class in the other classification.
  - 19. A system according to claim 17, wherein one of the first and second sets is a machine-translation of the other.
  - 20. A system according to claim 19, wherein the system comprises a translator for translating one of the sets to a different language.
  - 21. A system according to claim 17, wherein the minimum is determined on the basis of a level of difference between the first and second languages.
  - 22. The system of claim 17, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.

23. A method for enhancing a performance of a first classifier implemented on a computing device used for classifying a first subset of documents written in a first language, the method comprising:
- a) receiving, at the computing device, a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;
  
  b) running the first classifier over the first subset to generate a first classification;
  
  c) running a second classifier over the second subset to generate a second classification;
  
  d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum;
  
  the repeating being performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier; and
  
  e) outputting at least one of said first classification and said first classifier.
- View Dependent Claims (24)
- - 24. The method of claim 23, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.

25. A system for classifying content-equivalent documents written in different languages, said system comprisinga first classifier for classifying a first set of documents written in a first language to generate a first classification;
- a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification;
  
  a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and
  
  an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum;
  
  wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum, wherein the re-classification is performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier.
- View Dependent Claims (26)
- - 26. The system of claim 25, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Research Council Canada
Original Assignee
National Research Council Canada
Inventors
Goutte, Cyril, Amini, Massih
Primary Examiner(s)
ALBERTALLI, BRIAN LOUIS

Application Number

US12/909,389
Publication Number

US 20110098999A1
Time in Patent Office

929 Days
Field of Search

None
US Class Current

704/8
CPC Class Codes

G06F 16/353 into predefined classes

G06F 40/42 Data-driven translation

Text categorization based on co-classification learning from multilingual corpora

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

26 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Text categorization based on co-classification learning from multilingual corpora

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links