Semi-supervised data integration model for named entity classification
First Claim
1. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:
- comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;
in view of the positive training seed set, populating a decision tree;
in view of populating the decision tree, creating classification rules for classifying the named entity candidates;
sampling a number of entities from the named entity candidates;
in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;
in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;
in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and
in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment, a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data is provided. Training data are compared to named entity candidates taken from the first repository to form a positive training seed set. A decision tree is populated and classification rules are created for classifying the named entity candidates. A number of entities are sampled from the named entity candidates. The sampled entities are labeled as positive examples and/or negative examples. The positive training seed set is updated to include identified commonality between the positive examples and the auxiliary repository. A negative training seed set is updated to include negative examples which lack commonality with the auxiliary repository. In view of both the updated positive and negative training seed sets, the decision tree and the classification rules are updated.
33 Citations
20 Claims
-
1. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:
-
comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the computer program product comprising:
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code being executable by a computer to perform a method comprising; comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules. - View Dependent Claims (9, 10, 11, 12, 13)
-
14. A system for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the system comprising:
-
memory having computer readable computer instructions; and a processor for executing the computer readable instructions to perform a method comprising; comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, populating a decision tree; in view of populating the decision tree, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules. - View Dependent Claims (15, 16, 17, 18, 19)
-
-
20. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:
-
comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates; in view of the positive training seed set, creating classification rules for classifying the named entity candidates; sampling a number of entities from the named entity candidates; in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples; in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository; in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; in view of both the updated positive and negative training seed sets, updating the classification rules; determining a change in a number of rules between the classification rules and the updated classification rules; repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the classification rules until the change in the number of rules between iterations is less than a threshold amount; and applying the updated classification rules to the named entity candidates to produce a set of classified named entities.
-
Specification