Semi-supervised data integration model for named entity classification

US 9,292,797 B2
Filed: 12/14/2012
Issued: 03/22/2016
Est. Priority Date: 12/14/2012
Status: Active Grant

First Claim

Patent Images

1. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:

comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;

in view of the positive training seed set, populating a decision tree;

in view of populating the decision tree, creating classification rules for classifying the named entity candidates;

sampling a number of entities from the named entity candidates;

in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;

in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;

in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and

in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to one embodiment, a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data is provided. Training data are compared to named entity candidates taken from the first repository to form a positive training seed set. A decision tree is populated and classification rules are created for classifying the named entity candidates. A number of entities are sampled from the named entity candidates. The sampled entities are labeled as positive examples and/or negative examples. The positive training seed set is updated to include identified commonality between the positive examples and the auxiliary repository. A negative training seed set is updated to include negative examples which lack commonality with the auxiliary repository. In view of both the updated positive and negative training seed sets, the decision tree and the classification rules are updated.

33 Citations

View as Search Results

20 Claims

1. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:
- comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;
  
  in view of the positive training seed set, populating a decision tree;
  
  in view of populating the decision tree, creating classification rules for classifying the named entity candidates;
  
  sampling a number of entities from the named entity candidates;
  
  in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;
  
  in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;
  
  in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and
  
  in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, comprising:
    - repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the decision tree and the classification rules until a threshold condition is met, the threshold condition comprising one of;
      
      a maximum number of iterations and a change in a number of rules in the classification rules between iterations.
  - 3. The method of claim 1, comprising:
    - performing the method for each of a plurality of named entity types to determine the classification rules for each of the named entity types, wherein the training data comprise a plurality of data sources comprising only positive examples associated with each of the plurality of named entity types.
  - 4. The method of claim 1, comprising:
    - removing aliases from the first repository to determine the named entity candidates;
      
      eliminating common stop words and non-content-bearing words from candidate entity content of the named entity candidates;
      
      populating a feature dictionary in view of high frequency words in the candidate entity content of the named entity candidates; and
      
      representing the candidate entity content as a vector space model by applying weights to each word of the feature dictionary in the candidate entity content.
  - 5. The method of claim 1, comprising:
    - preprocessing the auxiliary repository to remove false positive examples.
  - 6. The method of claim 1, comprising:
    - applying a plurality of resolution rules to identify both exact matches and similar matches.
  - 7. The method of claim 1, wherein the decision tree comprises a plurality of tree nodes, the method comprising:
    - determining whether to grow the decision tree by splitting one or more of the tree nodes into child nodes; and
      
      determining whether to prune the decision tree to remove child nodes that lack a meaningful distinction between them.

8. A computer program product for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the computer program product comprising:
- a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code being executable by a computer to perform a method comprising;
  
  comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;
  
  in view of the positive training seed set, populating a decision tree;
  
  in view of populating the decision tree, creating classification rules for classifying the named entity candidates;
  
  sampling a number of entities from the named entity candidates;
  
  in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;
  
  in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;
  
  in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and
  
  in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer program product of claim 8, comprising:
    - repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the decision tree and the classification rules until a threshold condition is met, the threshold condition comprising one of;
      
      a maximum number of iterations and a change in a number of rules in the classification rules between iterations.
  - 10. The computer program product of claim 8, comprising:
    - performing the method for each of a plurality of named entity types to determine the classification rules for each of the named entity types, wherein the training data comprise a plurality of data sources comprising only positive examples associated with each of the plurality of named entity types.
  - 11. The computer program product of claim 8, comprising:
    - removing aliases from the first repository to determine the named entity candidates;
      
      eliminating common stop words and non-content-bearing words from candidate entity content of the named entity candidates;
      
      populating a feature dictionary in view of high frequency words in the candidate entity content of the named entity candidates;
      
      representing the candidate entity content as a vector space model by applying weights to each word of the feature dictionary in the candidate entity content; and
      
      preprocessing the auxiliary repository to remove false positive examples.
  - 12. The computer program product of claim 8, comprising:
    - applying a plurality of resolution rules to identify both exact matches and similar matches.
  - 13. The computer program product of claim 8, wherein the decision tree comprises a plurality of tree nodes, and the method further comprising:
    - determining whether to grow the decision tree by splitting one or more of the tree nodes into child nodes; and
      
      determining whether to prune the decision tree to remove child nodes that lack a meaningful distinction between them.

14. A system for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the system comprising:
- memory having computer readable computer instructions; and
  
  a processor for executing the computer readable instructions to perform a method comprising;
  
  comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;
  
  in view of the positive training seed set, populating a decision tree;
  
  in view of populating the decision tree, creating classification rules for classifying the named entity candidates;
  
  sampling a number of entities from the named entity candidates;
  
  in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;
  
  in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;
  
  in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository; and
  
  in view of both the updated positive and negative training seed sets, updating the decision tree and the classification rules.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The system of claim 14, comprising:
    - repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the decision tree and the classification rules until a threshold condition is met, the threshold condition comprising one of;
      
      a maximum number of iterations and a change in a number of rules in the classification rules between iterations.
  - 16. The system of claim 14, comprising:
    - performing the method for each of a plurality of named entity types to determine the classification rules for each of the named entity types, wherein the training data comprise a plurality of data sources comprising only positive examples associated with each of the plurality of named entity types.
  - 17. The system of claim 14, comprising:
    - removing aliases from the first repository to determine the named entity candidates;
      
      eliminating common stop words and non-content-bearing words from candidate entity content of the named entity candidates;
      
      populating a feature dictionary in view of high frequency words in the candidate entity content of the named entity candidates;
      
      representing the candidate entity content as a vector space model by applying weights to each word of the feature dictionary in the candidate entity content; and
      
      preprocessing the auxiliary repository to remove false positive examples.
  - 18. The system of claim 14, comprising:
    - applying a plurality of resolution rules to identify both exact matches and similar matches.
  - 19. The system of claim 14, wherein the decision tree comprises a plurality of tree nodes, and the method further comprising:
    - determining whether to grow the decision tree by splitting one of more of the treenodes into child nodes; and
      
      determining whether to prune the decision tree to remove child nodes that lack ameaning distinction between them.

20. A method for providing a semi-supervised data integration model for named entity classification from a first repository of entity information in view of an auxiliary repository of classification assistance data, the method comprising:
- comparing training data to named entity candidates taken from the first repository, thereby forming a positive training seed set in view of identified commonality between the training data and the named entity candidates;
  
  in view of the positive training seed set, creating classification rules for classifying the named entity candidates;
  
  sampling a number of entities from the named entity candidates;
  
  in view of the classification rules, labeling the sampled entities as positive examples and/or negative examples;
  
  in view of the positive examples and the auxiliary repository, updating the positive training seed set to include identified commonality between the positive examples and the auxiliary repository;
  
  in view of the negative examples and the auxiliary repository, updating a negative training seed set to include negative examples which lack commonality with the auxiliary repository;
  
  in view of both the updated positive and negative training seed sets, updating the classification rules;
  
  determining a change in a number of rules between the classification rules and the updated classification rules;
  
  repeating the sampling, the labeling of the sampled entities, the updating of the positive and negative training seed sets, and the updating of the classification rules until the change in the number of rules between iterations is less than a threshold amount; and
  
  applying the updated classification rules to the named entity candidates to produce a set of classified named entities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
He, Qi, Spangler, W. Scott
Primary Examiner(s)
Chen, Alan
Assistant Examiner(s)
Smith, Paulinho E

Application Number

US13/714,667
Publication Number

US 20140172754A1
Time in Patent Office

1,194 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06N 20/00 Machine learning

G06N 5/02 Knowledge representation; S...

Semi-supervised data integration model for named entity classification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

33 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Semi-supervised data integration model for named entity classification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links