Decision tree refinement
First Claim
1. A method, comprising:
- identifying, by at least one data processing device, split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs;
removing, by at least one data processing device, at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented;
classifying, by at least one data processing device, the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules;
removing, by at least one data processing device and based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set;
generating, by at least one data processing device, an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and
generating, by at least one data processing device, final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.
2 Assignments
0 Petitions
Accused Products
Abstract
A model refinement system refines initial split rules that define an initial decision tree to generate final split-rules. The model refinement refines the initial split rules by removing clauses that are satisfied by match scores that are less than a threshold match score to generate initial trimmed rules. Using the initial trimmed rules, the model refinement system classifies an initial training set and filters the initial training set to remove negative training pairs that are classified as duplicate pairs resulting in a filtered training set. An intermediate decision tree defined by intermediate split-rules is generated based on the filtered training set. Final split-rules are generated based on the intermediate split rules and input pairs of data records are classified as duplicate pairs based on attribute values of the input pairs and the final split-rules.
-
Citations
29 Claims
-
1. A method, comprising:
-
identifying, by at least one data processing device, split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs; removing, by at least one data processing device, at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented; classifying, by at least one data processing device, the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules; removing, by at least one data processing device and based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set; generating, by at least one data processing device, an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and generating, by at least one data processing device, final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
a datastore storing split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs; and at least one processor of a model refinement system coupled to the datastore, the at least one processor configured to; remove removing at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented; classify the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules; remove, based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set; generate an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and generate final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A non-transitory computer readable medium encoded with a computer program comprising instructions that when executed cause a computer to perform operations:
-
identifying split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs; removing at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented; classifying the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules; removing, based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set; generating an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and generating final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.
-
Specification