Decision tree refinement

US 8,417,654 B1
Filed: 07/18/2012
Issued: 04/09/2013
Est. Priority Date: 09/22/2009
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

identifying, by at least one data processing device, split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs;

removing, by at least one data processing device, at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented;

classifying, by at least one data processing device, the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules;

removing, by at least one data processing device and based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set;

generating, by at least one data processing device, an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and

generating, by at least one data processing device, final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A model refinement system refines initial split rules that define an initial decision tree to generate final split-rules. The model refinement refines the initial split rules by removing clauses that are satisfied by match scores that are less than a threshold match score to generate initial trimmed rules. Using the initial trimmed rules, the model refinement system classifies an initial training set and filters the initial training set to remove negative training pairs that are classified as duplicate pairs resulting in a filtered training set. An intermediate decision tree defined by intermediate split-rules is generated based on the filtered training set. Final split-rules are generated based on the intermediate split rules and input pairs of data records are classified as duplicate pairs based on attribute values of the input pairs and the final split-rules.

Citations

29 Claims

1. A method, comprising:
- identifying, by at least one data processing device, split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs;
  
  removing, by at least one data processing device, at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented;
  
  classifying, by at least one data processing device, the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules;
  
  removing, by at least one data processing device and based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set;
  
  generating, by at least one data processing device, an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and
  
  generating, by at least one data processing device, final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein generating the final split rules comprises adjusting at least a portion of the intermediate split rules to generate the final split-rules.
  - 3. The method of claim 1, further comprising classifying input pairs of data records as duplicate pairs based on attribute values of the input pairs and the final split-rules.
  - 4. The method of claim 3, wherein classifying a pair of data records as duplicate pairs comprises classifying two user accounts as duplicate user accounts based on the attribute values of the two user accounts and the final split-rules.
  - 5. The method of claim 4, further comprising suspending duplicate user accounts based on the classification.
  - 6. The method of claim 1, further comprising:
    - computing rule weights for the initial trimmed rules based on training pairs in the initial training set that satisfy the initial trimmed rules;
      
      computing classification scores for the training pairs in the initial training set, the classification scores for the training pairs being based on the rule weights of the initial trimmed rules that the training records satisfy; and
      
      classifying positive training pairs in the initial training set based on the classification scores.
  - 7. The method of claim 6, wherein computing rule weights comprises computing, for each initial trimmed rule, a ratio of a number of positive training pairs that satisfy the initial trimmed rule and a number of negative training pairs that satisfy the initial trimmed rule.
  - 8. The method of claim 6, wherein computing classification scores comprises computing, for one or more of the training pairs, a result of a function of the rule weights for initial trimmed rules that are satisfied by the training pair.
  - 9. The method of claim 8, wherein classifying the negative training pairs comprises:
    - classifying negative training pairs having a classification score that meets a threshold classification score as duplicate pairs; and
      
      classifying negative training pairs having a classification score that fails to meet the threshold classification score as non-duplicate pairs.
  - 10. The method of claim 1, further comprising:
    - determining quality scores for the intermediate split-rules based on precision measures and coverage measures of the intermediate split-rules;
      
      selecting intermediate split-rules for adjustment based on the quality scores; and
      
      adjusting the selected intermediate split-rules to generate final split-rules.
  - 11. The method of claim 10, wherein selecting intermediate split-rules comprises selecting intermediate split-rules having less than a threshold number of clauses and having a split-rule quality measure that exceeds a high quality threshold.
  - 12. The method of claim 11, wherein adjusting the selected intermediate split-rules comprises selecting, for each selected intermediate split-rule, an additional clause to include in the selected intermediate split-rule, the additional clause specifying an additional match score for an attribute.
  - 13. The method of claim 12, wherein selecting an additional clause comprises selecting an additional clause for an attribute having a highest attribute weight based on an error rate associated with the attribute and a coverage measure for the attribute.
  - 14. The method of claim 10, wherein adjusting the selected intermediate split-rules comprises selecting additional clauses that maximize a result of an adjusted rule weight function for the selected intermediate split-rules.

15. A system, comprising:
- a datastore storing split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs; and
  
  at least one processor of a model refinement system coupled to the datastore, the at least one processor configured to;
  
  remove removing at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented;
  
  classify the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules;
  
  remove, based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set;
  
  generate an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and
  
  generate final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 16. The system of claim 15, wherein the model refinement system is further configured to adjust at least a portion of the intermediate split rules to generate the final split-rules.
  - 17. The system of claim 15, wherein the model refinement system is further configured to classify input pairs of data records as duplicate pairs based on attribute values of the input pairs and the final split-rules.
  - 18. The system of claim 17, wherein the model refinement system is further configured to classify two user accounts as duplicate user accounts based on the attribute values of the two user accounts and the final split-rules.
  - 19. The system of claim 18, herein the model refinement system is further configured to suspend duplicate user accounts based on the classification.
  - 20. The system of claim 15, herein the model refinement system is further configured to:
    - compute rule weights for the initial trimmed rules based on training pairs in the initial training set that satisfy the initial trimmed rules;
      
      compute classification scores for the training pairs in the initial training set, the classification scores for the training pairs being based on the rule weights of the initial trimmed rules that the training records satisfy; and
      
      classify positive training pairs in the initial training set based on the classification scores.
  - 21. The system of claim 20, wherein the model refinement system is further configured to compute, for each initial trimmed rule, a ratio of a number of positive training pairs that satisfy the initial trimmed rule and a number of negative training pairs that satisfy the initial trimmed rule and compute the rule weights based, at least in part, on the ratio.
  - 22. The system of claim 20, wherein the model refinement system is further configured to compute, for one or more of the training pairs, a result of a function of the rule weights for initial trimmed rules that are satisfied by the training pair and compute the classification score based, at least in part, on the result.
  - 23. The system of claim 22, wherein the model refinement system is further configured to:
    - classify negative training pairs having a classification score that meets a threshold classification score as duplicate pairs; and
      
      classify negative training pairs having a classification score that fails to meet the threshold classification score as non-duplicate pairs.
  - 24. The system of claim 15, wherein the model refinement system is further configured to:
    - determine quality scores for the intermediate split-rules based on precision measures and coverage measures of the intermediate split-rules;
      
      select intermediate split-rules for adjustment based on the quality scores; and
      
      adjust the selected intermediate split-rules to generate final split-rules.
  - 25. The system of claim 24, wherein the model refinement system is further configured to select intermediate split-rules by selecting those intermediate split rules having less than a threshold number of clauses and having a split-rule quality measure that exceeds a high quality threshold.
  - 26. The system of claim 25, wherein the model refinement system is further configured to select, for each selected intermediate split-rule, an additional clause to include in the selected intermediate split-rule, the additional clause specifying an additional match score for an attribute and adjust the selected intermediate split-rules based, at least in part, on the additional clause.
  - 27. The system of claim 26, wherein the model refinement system is further configured to select the additional clause for an attribute having a highest attribute weight based on an error rate associated with the attribute and a coverage measure for the attribute.
  - 28. The system of claim 24, wherein the model refinement system is further configured to select additional clauses that maximize a result of an adjusted rule weight function for the selected intermediate split-rules and adjust the selected intermediate split-rules, based at least in part on the selected additional clauses.

29. A non-transitory computer readable medium encoded with a computer program comprising instructions that when executed cause a computer to perform operations:
- identifying split-rules and an initial training set of data records used to generate the split rules, the initial training set of data records including negative training pairs that each include at least two data records that have not been identified as duplicate data records, each training pair having match scores specifying a measure of similarity for attributes of training pairs;
  
  removing at least one clause from the split rules to generate initial trimmed rules, the removing being based at least in part on a threshold match score specifying a match score at which the initial training set is segmented;
  
  classifying the negative training pairs in the initial training set based on the match scores for the negative training pairs and the initial trimmed rules;
  
  removing, based on the classification, negative training pairs that are classified as duplicate pairs from the initial training set to create a filtered training set;
  
  generating an intermediate decision tree with the filtered training set, the intermediate decision tree defining intermediate split-rules; and
  
  generating final split rules based on the intermediate split rules, the final split rules including at least one final split rule that differs from each of the intermediate split rules.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Cao, Zhen, Verma, Naval
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Bharadwaj, Kalpana

Application Number

US13/551,779
Time in Patent Office

265 Days
Field of Search

None
US Class Current

706/14
CPC Class Codes

G06Q 30/02 Marketing; Price estimation...

Decision tree refinement

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Decision tree refinement

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links