Systems and methods for partitioning sets of features for a bayesian classifier

US 9,349,101 B2
Filed: 08/29/2014
Issued: 05/24/2016
Est. Priority Date: 08/29/2014
Status: Active Grant

First Claim

Patent Images

1. A method of building a partition of features in an input set, in which feature subsets listed in a partition list have probabilistic interdependence among features in the feature subset, the method including:

accessing an input set including at least one input tuple comprising feature-values assigned to features;

identifying input subtuples comprising unique feature subsets in the input tuple;

accessing a tuple instance count data structure stored in memory that provides counts of tuples in a data set;

computing class entropy scores for the input subtuples that have at least a threshold support count of instances in the tuple instance count data structure;

adding feature subsets corresponding to the input subtuples to a partition list, including;

ordering at least some of the input subtuples by non-decreasing class entropy score;

traversing the ordered input subtuples, including;

adding the feature subset of a current ordered input subtuple to the partition list, andpruning from subsequent consideration other input subtuples that include any features in the current ordered input subtuple; and

reaching completion when all of the features of the input tuple have been added to the partition list; and

storing the partition list in a memory, whereby it becomes available to use with a classifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology disclosed relates to methods for partitioning sets of features for a Bayesian classifier, finding a data partition that makes the classification process faster and more accurate, while discovering and taking into account feature dependence among sets of features in the data set. It relates to computing class entropy scores for a class label across all tuples that share the feature-subset and arranging the tuples in order of non-decreasing entropy scores for the class label, and constructing a data partition that offers the highest improvement in predictive accuracy for the data set. Also disclosed is a method for partitioning a complete set of records of features in a batch computation, computing increasing predictive power; and also relates to starting with singleton partitions, and using an iterative process to construct a data partition that offers the highest improvement in predictive accuracy for the data set.

161 Citations

15 Claims

1. A method of building a partition of features in an input set, in which feature subsets listed in a partition list have probabilistic interdependence among features in the feature subset, the method including:
- accessing an input set including at least one input tuple comprising feature-values assigned to features;
  
  identifying input subtuples comprising unique feature subsets in the input tuple;
  
  accessing a tuple instance count data structure stored in memory that provides counts of tuples in a data set;
  
  computing class entropy scores for the input subtuples that have at least a threshold support count of instances in the tuple instance count data structure;
  
  adding feature subsets corresponding to the input subtuples to a partition list, including;
  
  ordering at least some of the input subtuples by non-decreasing class entropy score;
  
  traversing the ordered input subtuples, including;
  
  adding the feature subset of a current ordered input subtuple to the partition list, andpruning from subsequent consideration other input subtuples that include any features in the current ordered input subtuple; and
  
  reaching completion when all of the features of the input tuple have been added to the partition list; and
  
  storing the partition list in a memory, whereby it becomes available to use with a classifier.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further including the threshold support count being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the input subtuples to be present in the tuple instance count data structure.
  - 3. The method of claim 1, wherein:
    - the tuple instance count data structure includes tuple instances, class values and counts; and
      
      the input tuple includes a class label that accepts an assigned class value.
  - 4. The method of claim 2, wherein the tuple instance count data structure is subject to a minimum threshold support count and returns non-zero counts for tuple instances that have a count in the data set that meets the minimum support threshold.

5. A method of building a partition of features in a data set into feature subsets with probabilistic interdependence among features in each feature subset, the method including:
- accessing a tuple instance count data structure stored in memory;
  
  accessing a class value to use;
  
  computing predictive power for the class value of tuple instances using counts in the tuple instance count data structure for supported tuples among the tuple instances that have at least a threshold support of count instances;
  
  adding feature subsets corresponding to the supported tuples to a partition list, including;
  
  ordering at least some of the tuple instances by non-increasing predictive power;
  
  traversing the ordered tuple instances, including;
  
  adding the feature subset of a current ordered input subtuple to the partition list, andpruning from subsequent consideration other input subtuples that include any features in the current ordered input subtuple; and
  
  reaching completion when all supported features of the data set have been added to the partition list; and
  
  storing the partition list in a memory, whereby it becomes available to use with a classifier.
- View Dependent Claims (6, 7)
- - 6. The method of claim 5, further including the supported tuples satisfying a threshold support count being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the supported tuples to be present in the tuple instance count data structure.
  - 7. The method of claim 5, wherein:
    - the tuple instance count data structure includes tuple instances, class values and counts.

8. A method of building a partition of features in a data set into feature subsets with probabilistic interdependence among features in each feature subset, the method including:
- accessing a tuple instance count data structure stored in memory;
  
  accessing an input set including at least one input tuple comprising feature-values assigned to features;
  
  beginning with singleton input features present in the input set as pending feature subsets in a partition list, repeatedly merging pairs of pending feature subsets in the partition list, including;
  
  computing reduction in class entropy scores that would result from merging the pairs of pending feature subsets using the tuple instance count data structure, limiting consideration of mergers resulting in merged feature subsets to the mergers that have at least a threshold support count of instances in the tuple instance count data structure;
  
  selecting a selected pair of pending feature subsets that yields a reduction in class entropy score resulting from merger, wherein the reduction in class entropy score meets a predetermined class entropy reduction threshold;
  
  merging the selected pair of feature subsets into a merged pending feature subset in the partition list; and
  
  reaching completion when available mergers of pending feature subsets would not produce a decrease in entropy that meets the predetermined class entropy reduction threshold; and
  
  storing the partition list in the memory, whereby it becomes available to use with a classifier.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8, further including selecting the selected pair of pending feature subsets that yields a greatest reduction in class entropy among pairs of pending feature subsets available for merger.
  - 10. The method of claim 8, further including the threshold support count for the merged feature subsets being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the merged feature subsets to be present in the tuple instance count data structure.
  - 11. The method of claim 8, wherein:
    - the tuple instance count data structure includes tuple instances, class values and counts; and
      
      the input tuple includes a class label that accepts an assigned class value.

12. A method of building a partition of features in a data set into feature subsets with probabilistic interdependence among features in each feature subset, the method including:
- accessing a tuple instance count data structure stored in memory;
  
  accessing a class value to use;
  
  beginning with singleton feature subsets as pending feature subsets in a partition list, repeatedly merging pairs of the pending feature subsets in the partition list, including;
  
  computing increase in class predictive power for the class value that would result from merging the pairs of pending feature subsets using the tuple instance count data structure, limiting consideration of mergers to the mergers resulting in merged feature subsets that have at least a threshold support count of instances of the class value for the pairs of pending feature subsets;
  
  selecting a selected pair of feature subsets which yields an increase in predictive power for the class value that meets a predetermined predictive power increase threshold;
  
  merging the selected pair of feature subsets into a merged single feature subset in the partition list; and
  
  reaching completion when available mergers of pending feature subsets would not produce an increase in predictive power that meets the predetermined predictive power increase threshold; and
  
  storing the partition list in the memory, whereby it becomes available to use with a classifier.
- View Dependent Claims (13, 14, 15)
- - 13. The method of claim 12, further including selecting the selected pair of feature subsets that yields a greatest increase in predictive power for the class value among pairs of feature subsets available for merger.
  - 14. The method of claim 12, further including the threshold support count for the merged feature subsets being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the merged feature subsets to be present in the tuple instance count data structure.
  - 15. The method of claim 12, wherein:
    - the tuple instance count data structure includes tuple instances, class values and counts.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Fuchs, Matthew D., Jagota, Arun
Primary Examiner(s)
Hill, Stanley K.
Assistant Examiner(s)
Misir, Dave

Application Number

US14/473,717
Publication Number

US 20160063389A1
Time in Patent Office

634 Days
Field of Search

706/12
US Class Current

1/1
CPC Class Codes

G06F 16/211   Schema design and management

G06N 20/00   Machine learning

G06N 5/02   Knowledge representation; S...

G06N 7/01   Probabilistic graphical mod...

Systems and methods for partitioning sets of features for a bayesian classifier

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

161 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for partitioning sets of features for a bayesian classifier

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

161 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links