Feature generation and model selection for generalized linear models
First Claim
Patent Images
1. A computer-implemented method comprising:
- identifying a dataset that stores values for a target attribute and input attributes, where the input attributes are under consideration for inclusion in a generalized linear model that predicts a value of the target attribute based on a selection of features, where each feature comprises a combination of one or more of the input attributes;
identifying candidate features, where a candidate feature comprises a combination of one or more of the input attributes;
computing respective inclusion scores for respective candidate features, based, at least in part on a likelihood that the candidate feature will be selected for inclusion in the generalized linear model;
ordering the candidate features according to inclusion score;
constructing a set of one or more branches of candidate features, where each branch includes candidate features ordered according to inclusion score from highest inclusion score to lowest inclusion score, where the one or more branches do not include candidate features having an inclusion score below a predetermined minimum score; and
providing a branch of candidate features to a streamwise feature selection process configured to construct the generalized linear model by considering candidate features in the branch, in turn, starting with the candidate feature with the highest inclusion score, and including selected candidate features in the generalized linear model.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and other embodiments associated with feature generation and model selection for generalized linear models are described. In one embodiment, a method includes ordering candidate features in a dataset being considered by a streamwise feature selection process according to an inclusion score that reflects a likelihood that a given candidate feature will be included in the GLM. The ordered candidate features are provided to the streamwise feature selection process for acceptance testing. In one embodiment, the method also includes selecting penalty criterion for use in the acceptance testing that is based on characteristics of the dataset.
-
Citations
38 Claims
-
1. A computer-implemented method comprising:
-
identifying a dataset that stores values for a target attribute and input attributes, where the input attributes are under consideration for inclusion in a generalized linear model that predicts a value of the target attribute based on a selection of features, where each feature comprises a combination of one or more of the input attributes; identifying candidate features, where a candidate feature comprises a combination of one or more of the input attributes; computing respective inclusion scores for respective candidate features, based, at least in part on a likelihood that the candidate feature will be selected for inclusion in the generalized linear model; ordering the candidate features according to inclusion score; constructing a set of one or more branches of candidate features, where each branch includes candidate features ordered according to inclusion score from highest inclusion score to lowest inclusion score, where the one or more branches do not include candidate features having an inclusion score below a predetermined minimum score; and providing a branch of candidate features to a streamwise feature selection process configured to construct the generalized linear model by considering candidate features in the branch, in turn, starting with the candidate feature with the highest inclusion score, and including selected candidate features in the generalized linear model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computing system, comprising:
-
a branch construction logic configured to; identify a dataset that stores values for a target attribute and input attributes identifying candidate features, where a candidate feature comprises a combination of one or more input attributes; compute respective inclusion scores for respective candidate features in the branch, based, at least in part on a likelihood that the candidate feature will be selected for inclusion in a generalized linear model that combines selected candidate features to predict a value of the target attribute; order the candidate features according to inclusion score; and construct a branch of candidate features ordered according to inclusion score from highest inclusion score to lowest inclusion score, where the branch does not include candidate features having an inclusion score below a predetermined minimum score; and a streamwise feature selection logic configure to construct the generalized linear model by performing acceptance testing on candidate generalized linear models that include a next candidate feature in the branch, where candidate generalized linear models are compared to a last accepted generalized linear model and a candidate generalized linear model is accepted when one or more acceptance criteria are met. - View Dependent Claims (21, 22, 23)
-
-
24. A non-transitory computer-readable medium storing computer-executable instructions that when executed by a computer cause the computer to at least:
-
identify a dataset that stores values for one or more input attributes and a target attribute; until one or more model termination criteria are met, construct a set of branches, where each branch comprises candidate features ordered according to respective inclusion score, where a candidate features comprises a combination of one or more input attributes, further where respective inclusion scores estimate a likelihood that respective candidate features will be selected for inclusion in a generalized linear model that comprises a combination of features that predict the target attribute; until one or more branch termination criteria are met for each branch in the set of branches; construct a candidate generalized linear model that includes a next candidate feature in the branch; perform acceptance testing on the candidate generalized linear model such that when the candidate generalized linear model meets acceptance criteria, the candidate generalized linear model is accepted; when one or more reorder criteria are met, compute updated inclusion scores for remaining candidate features in the branch based, at least in part, on a correlation between respective candidate features and a residual error of the last accepted generalized linear model; re-order the remaining candidate features in the branch according to updated inclusion scores; and when branch termination criteria are met for the branch, access a next branch in the set of branches; and provide the last accepted generalized linear model as an output. - View Dependent Claims (25, 26, 27, 28, 29)
-
-
30. A non-transitory computer-readable medium storing computer-executable instructions that when executed by a computer cause the computer to at least:
-
identify a dataset that stores values for a target attribute and input attributes, where the input attributes are under consideration for inclusion in a generalized linear model that predicts a value of the target attribute based on a selection of features, where each feature comprises a combination of one or more of the input attributes; identify candidate features, where a candidate feature comprises a combination of one or more of the input attributes; compute respective inclusion scores for respective candidate features, based, at least in part on a likelihood that the candidate feature will be selected for inclusion in the generalized linear model; order the candidate features according to inclusion score; construct a set of one or more branches of candidate features, where each branch includes candidate features ordered according to inclusion score from highest inclusion score to lowest inclusion score, where the one or more branches do not include candidate features having an inclusion score below a predetermined minimum score; and provide a branch of candidate features to a streamwise feature selection process configured to construct the generalized linear model by considering candidate features in the branch, in turn, starting with the candidate feature with the highest inclusion score, and including selected candidate features in the generalized linear model. - View Dependent Claims (31, 32)
-
-
33. A computer-implemented method, comprising:
-
identifying a dataset that stores values for one or more input attributes and a target attribute; until one or more model termination criteria are met, constructing a set of branches, where each branch comprises candidate features ordered according to respective inclusion score, where a candidate features comprises a combination of one or more input attributes, further where respective inclusion scores estimate a likelihood that respective candidate features will be selected for inclusion in a generalized linear model that comprises a combination of features that predict the target attribute; until one or more branch termination criteria are met for each branch in the set of branches; constructing a candidate generalized linear model that includes a next candidate feature in the branch; performing acceptance testing on the candidate generalized linear model such that when the candidate generalized linear model meets acceptance criteria, the candidate generalized linear model is accepted; when one or more reorder criteria are met, computing updated inclusion scores for remaining candidate features in the branch based, at least in part, on a correlation between respective candidate features and a residual error of the last accepted generalized linear model; re-ordering the remaining candidate features in the branch according to updated inclusion scores; and when branch termination criteria are met for the branch, accessing a next branch in the set of branches; and providing the last accepted generalized linear model as an output. - View Dependent Claims (34, 35, 36, 37, 38)
-
Specification