Method and apparatus for training a translation disambiguation classifier

US 7,318,022 B2
Filed: 06/12/2003
Issued: 01/08/2008
Est. Priority Date: 06/12/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method of training a classifier, the method comprising:

applying a first classifier to a first set of unlabeled data to form a first set of labeled data, the first classifier capable of assigning data to classes in a first set of classes;

applying a second classifier to a second set of unlabeled data to form a second set of labeled data, the second classifier capable of assigning data to classes in a second set of classes that is different from the first set of classes; and

using the first set of labeled data and the second set of labeled data to retrain the first classifier to form a retrained classifier that can be used to assign data to classes, retraining comprising;

determining a first probability component from the first set of labeled data;

determining a second probability component from the second set of labeled data; and

combining the first probability component and the second probability component to form a probability term used to define the first classifier.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training a classifier includes applying a first classifier to a first set of unlabeled data to form a first set of labeled data. The first classifier is able to assign data to classes in a first set of classes. A second classifier is applied to a second set of unlabeled data to from a second set of labeled data. The second classifier is able to assign data to classes in a second set of classes that is different from the first set of classes. The first and second sets of labeled data are used to retrain the first classifier.

Citations

29 Claims

1. A method of training a classifier, the method comprising:
- applying a first classifier to a first set of unlabeled data to form a first set of labeled data, the first classifier capable of assigning data to classes in a first set of classes;
  
  applying a second classifier to a second set of unlabeled data to form a second set of labeled data, the second classifier capable of assigning data to classes in a second set of classes that is different from the first set of classes; and
  
  using the first set of labeled data and the second set of labeled data to retrain the first classifier to form a retrained classifier that can be used to assign data to classes, retraining comprising;
  
  determining a first probability component from the first set of labeled data;
  
  determining a second probability component from the second set of labeled data; and
  
  combining the first probability component and the second probability component to form a probability term used to define the first classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1 wherein the first set of unlabeled data comprises text in a first language and the second set of unlabeled data comprises text in a second language.
  - 3. The method of claim 2 wherein the first set of classes comprises senses for a word in the first language.
  - 4. The method of claim 3 wherein the second set of classes comprises senses for a word in the second language.
  - 5. The method of claim 1 wherein determining the first probability component comprises executing a first algorithm using the first set of labeled data and wherein determining the second probability component comprises executing a second algorithm that operates differently from the first algorithm using the second set of labeled data.
  - 6. The method of claim 5 wherein the first algorithm comprises a maximum likelihood estimation algorithm.
  - 7. The method of claim 6 wherein the second algorithm comprises an expectation maximization algorithm.
  - 8. The method of claim 1 wherein combining the first probability component and the second probability component comprises linearly combining the first probability component and the second probability component.
  - 9. The method of claim 1 further comprising using the first set of labeled data and the second set of labeled data to retrain the second classifier to form a second retrained classifier.
  - 10. The method of claim 9 wherein using the first set of labeled data to retrain the first classifier comprises executing a first algorithm on the first set of labeled data and wherein using the first set of labeled data to retrain the second classifier comprises executing a second algorithm on the first set of labeled data, where the second algorithm operates differently than the first algorithm.
  - 11. The method of claim 10 wherein the first algorithm comprises a maximum likelihood estimation algorithm.
  - 12. The method of claim 11 wherein the second algorithm comprises an expectation maximization algorithm.
  - 13. The method of claim 9 further comprising:
    - applying the retrained classifier to a set of unlabeled data to form a first additional set of labeled data;
      
      applying the second retrained classifier to a set of unlabeled data to form a second additional set of labeled data; and
      
      using the first additional set of labeled data and the second additional set of labeled data to retrain the retrained classifier.

14. A computer-readable storage medium storing computer-executable instructions for performing steps comprising:
- generating first language labeled data that indicates a sense of at least one word in a first language;
  
  generating second language labeled data that indicates a sense of at least one word in a second language; and
  
  using the first language labeled data and the second language labeled data to train a classifier for the first language, where the classifier can be used to identify a sense of a word in the first language, wherein using the first language labeled data and the second language labeled data to train a classifier comprises;
  
  determining a first probability component comprising a probability using the first language labeled data;
  
  determining a second probability component comprising a probability using the second language labeled data; and
  
  combining the first probability component and the second probability component to form a probability term for the classifier.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 15. The computer-readable medium of claim 14 wherein generating first language labeled data comprises applying unlabeled data in the first language to a classifier for the first language.
  - 16. The computer-readable storage medium of claim 15 wherein training a classifier for the first language comprises retraining the classifier used to generate the first language labeled data.
  - 17. The computer-readable storage medium of claim 15 wherein generating second language labeled data comprises applying unlabeled data in the second language to a classifier for the second language.
  - 18. The computer-readable storage of claim 14 wherein combining the first probability component and the second probability component comprises linearly combining the first probability component and the second probability component.
  - 19. The computer-readable storage medium of claim 14 wherein determining the first probability component comprises using a first algorithm and wherein determining the second probability component comprises using a second algorithm that is different from the first algorithm.
  - 20. The computer-readable storage medium of claim 19 wherein the first algorithm is a maximum likelihood estimation algorithm.
  - 21. The computer-readable storage medium of claim 20 wherein the second algorithm is an expectation maximization algorithm.
  - 22. The computer-readable storage medium of claim 14 further comprising using the classifier trained from the first language labeled data and the second language labeled data to classify unlabeled data in the first language to form additional first language labeled data.
  - 23. The computer-readable storage medium of claim 14 further comprising using the first language labeled data and the second language labeled data to train a classifier for the second language.
  - 24. The computer readable storage medium of claim 23 wherein using the first language labeled data and the second language labeled data to train a classifier for the second language comprises:
    - determining a first probability component using the first language labeled data;
      
      determining a second probability component using the second language labeled data; and
      
      combining the first probability component and the second probability component to form a probability term for the classifier for the second language.
  - 25. The computer-readable storage medium of claim 24 wherein combining the first probability component and the second probability component comprises linearly combining the first probability component and the second probability component.
  - 26. The computer-readable storage medium of claim 24 wherein determining the first probability component comprises using a first algorithm and wherein determining the second probability component comprises using a second algorithm that is different from the first algorithm.
  - 27. The computer-readable storage medium of claim 26 wherein the second algorithm is a maximum likelihood estimation algorithm.
  - 28. The computer-readable storage medium of claim 27 wherein the first algorithm is an expectation maximization algorithm.
  - 29. The computer-readable storage medium of claim 23 further comprising using the classifier for the second language to classify unlabeled data in the second language to form additional second language labeled data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Li, Hang
Primary Examiner(s)
Edouard; Patrick N.
Assistant Examiner(s)
GODBOLD, DOUGLAS

Application Number

US10/459,816
Publication Number

US 20040254782A1
Time in Patent Office

1,671 Days
Field of Search

704/2, 704/8, 704/9, 704/10
US Class Current

704/10
CPC Class Codes

G06F 40/30 Semantic analysis

Method and apparatus for training a translation disambiguation classifier

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for training a translation disambiguation classifier

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links