TEXT CATEGORIZATION BASED ON CO-CLASSIFICATION LEARNING FROM MULTILINGUAL CORPORA

US 20110098999A1
Filed: 10/21/2010
Published: 04/28/2011
Est. Priority Date: 10/22/2009
Status: Active Grant

First Claim

Patent Images

1. A method for enhancing a performance of a first classifier used for classifying a first subset of documents written in a first language, the method comprising:

a) providing a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;

b) running the first classifier over the first subset to generate a first classification;

c) running a second classifier over the second subset to generate a second classification;

d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum; and

e) outputting at least one of said first classification and said first classifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.

51 Citations

View as Search Results

22 Claims

1. A method for enhancing a performance of a first classifier used for classifying a first subset of documents written in a first language, the method comprising:
- a) providing a second subset of documents written in a second language different than the first language, said second subset including substantially the same content as the first subset;
  
  b) running the first classifier over the first subset to generate a first classification;
  
  c) running a second classifier over the second subset to generate a second classification;
  
  d) reducing a training cost between the first and second classifications, said reducing comprises repeating steps b) and c) wherein each classifier updates its own classification in view of the classification generated by the other classifier until the training cost is set to a minimum; and
  
  e) outputting at least one of said first classification and said first classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16)
- - 2. The method of claim 1, wherein reducing further comprises updating one classification based on a probability associated with each class in the other classification.
  - 3. The method of claim 2, wherein updating comprises reducing classification errors.
  - 4. The method of claim 2, wherein the training cost includes a mis-classification cost associated with each classifier and a disagreement cost between the two classifiers.
  - 5. The method of claim 2, wherein reducing comprises adjusting parameters of each classifier to reduce the training cost between classifications.
  - 6. The method of claim 1, wherein reducing comprises applying a gradient based algorithm for reducing the training cost between classifications.
  - 7. The method of claim 1, wherein reducing comprises applying an analytical algorithm for finding an approximate solution that reduces classification losses to reduce the training cost between classifications.
  - 8. The method of claim 1, wherein, each classifier updates its own classification in view of the latest version of updated classification generated by the other classifier.
  - 9. The method of claim 8, wherein repeating is performed at least partially in parallel by the first and second classifiers.
  - 10. The method of claim 8, wherein repeating is performed in series wherein one classifier is fixed and the other classifier updates its own classification using the classification of the fixed classifier.
  - 11. The method of claim 1, wherein providing the second subset comprises machine-translating said first subset into the second language.
  - 12. The method of claim 1, wherein providing the second subset comprises providing a subset which is comparable to the first subset.
  - 13. The method of claim 1, wherein providing the second subset comprises providing a subset which is a parallel translation of the first subset.
  - 14. The method of claim 1, wherein the minimum is determined on the basis of a level of difference between the first and second languages.
  - 16. A computer readable memory having recorded thereon statements and instructions for execution by a processor for implementing the method of claim 1.

15. A method for generating classifiers from multilingual corpora, the method comprising:
- extracting textual data from each one of a set of documents which form part of the multilingual corpora, the multilingual corpora comprising a first and a second subset of content-equivalent documents written in one of two respective languages;
  
  transforming the textual data into a respective one of feature vectors x1 and x2, each one of the feature vectors being associated to a document classification y for categorizing different language versions of a same document;
  
  generating a first classifier f1 from the first subset, the first classifier f1 being associated to the feature vector x1;
  
  generating a second classifier f2 from the second subset, the second classifier f2 being associated to the feature vector x2;
  
  re-training the first classifier f1 on the first subset based on classification results obtained from the second classifier f2, to obtain a re-trained first classifier f1;
  
  re-training the second classifier f2 on the second subset based on other classification results obtained from the re-trained first classifier f1′
  
  , to obtain a re-trained second classifier f2;
  
  repeating the steps of re-training until a training cost between the re-trained first and second classifiers is minimized, thereby producing final first and second re-trained classifiers; and
  
  outputting the final first and second re-trained classifiers.
- View Dependent Claims (17)
- - 17. A computer readable memory having recorded thereon statements and instructions for execution by a processor for implementing the method of claim 15.

18. A system for classifying content-equivalent documents written in different languages, said system comprisinga first classifier for classifying a first set of documents written in a first language to generate a first classification;
- a second classifier for classifying a second set of documents written in a second language different the first language to generate a second classification;
  
  a comparator operatively connected to outputs of said first and second classifiers for detecting a training cost between said first and second classifications; and
  
  an optimizer for adjusting parameters of said first and second classifiers based on the second and first classifications respectively, when the training cost is higher than a minimum;
  
  wherein the optimizer orders the first and second classifiers to re-classify the first and second sets of documents until the training cost reaches the minimum.
- View Dependent Claims (19, 20, 21, 22)
- - 19. A system according to claim 18, wherein each classifier updates its own classification based on a probability associated with each class in the other classification.
  - 20. A system according to claim 18, wherein one of the first and second sets is a machine-translation of the other.
  - 21. A system according to claim 20, wherein the system comprises a translator for translating one of the sets to a different language.
  - 22. A system according to claim 18, wherein the minimum is determined on the basis of a level of difference between the first and second languages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Research Council Canada
Original Assignee
National Research Council Canada
Inventors
Amini, Massih, Goutte, Cyril

Granted Patent

US 8,438,009 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 16/353 into predefined classes

G06F 40/42 Data-driven translation

TEXT CATEGORIZATION BASED ON CO-CLASSIFICATION LEARNING FROM MULTILINGUAL CORPORA

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

51 Citations

22 Claims

Specification

Use Cases

Quick Links

Others

TEXT CATEGORIZATION BASED ON CO-CLASSIFICATION LEARNING FROM MULTILINGUAL CORPORA

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

22 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others