Cascade boosting of predictive models
First Claim
1. A computer implemented method of boosting of predictive models that apply subordinate models to data points in mutually exclusive segments, called cascade boosting, for resolving an interpretability problem of previous boosting methods, while mitigating a fragmentation problem when applied to decision trees, said method comprising the steps:
- (a) building an initial predictive model, which initially is a current model, that applies at least one subordinate model to a plurality of data points in mutually exclusive segments, the initial predictive model being built from an initial population of training data points;
(b) observing performance of the current model, which is initially the initial predictive model, when applied to each mutually exclusive segment of a current population of observing data points, which is initially either the initial population of training data points or a separate initial population of data points reserved for observing performance;
(c) applying statistical learning theory to derive a reliable estimate bounding future performance of the current model on each mutually exclusive segment, the reliable estimate being derived for each mutually exclusive segment from the observed performance together with a number of observing data points falling within a mutually exclusive segment;
(d) sorting mutually exclusive segments by estimates;
(e) selecting and retaining a fraction of the mutually exclusive segments, and also retaining each subordinate model associated with the mutually exclusive segment, the selection resulting in retention of segments with better estimates;
(f) forming a subpopulation of training points by sampling from the current population of training points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments;
(g) forming a subpopulation of observing data points by sampling from the current population of observing data points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments;
(h) building another predictive model which becomes the current model, the current model applying at least one subordinate model to a plurality of data points in mutually exclusive segments, and being built from the subpopulation of training points formed in step (f);
(i) repeating steps (b) to (h) a desired number of times, with the subpopulation of training points as the current population of training points and with the subpopulation of observing data points as the current population of observing data points;
(j) arranging selected and retained mutually exclusive segments having better estimates in an open-ended decision list, where each item in the open-ended decision list specifies a test for membership in one of the selected and retained segments having better estimates and specifies that, if a point passes the membership test, then a prediction for the point is to be obtained by using the prediction of the subordinate model corresponding to the retained mutually exclusive segment; and
(k) terminating the open-ended decision list with one of the models built in step (a) or step (h) and resulting in a terminated decision list, thereby solving the problem of model interpretability while mitigating the fragmentation problem.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of boosting of predictive models, called cascade boosting, for resolving the interpretability problem of previous boosting methods while mitigating the fragmentation problem when applied to decision trees. This method of cascade boosting always applies a single weak model to any given data point. An improvement to the common method of boosting lies in how weak models are organized in a decision list rather than a weighted average. Cascade boosting resolves the interpretability problem of previous boosting methods while mitigating the fragmentation problem when applied to decision trees. Cascade boosting is simplest when applied to segmented predictive models but may also be applied to predictive models that do not explicitly segment the space of possible data points. The predictive model resulting from cascade boosting has fewer rules, or tree leaves, thereby enabling a modeler to better understand the correlations among the data.
-
Citations
15 Claims
-
1. A computer implemented method of boosting of predictive models that apply subordinate models to data points in mutually exclusive segments, called cascade boosting, for resolving an interpretability problem of previous boosting methods, while mitigating a fragmentation problem when applied to decision trees, said method comprising the steps:
-
(a) building an initial predictive model, which initially is a current model, that applies at least one subordinate model to a plurality of data points in mutually exclusive segments, the initial predictive model being built from an initial population of training data points;
(b) observing performance of the current model, which is initially the initial predictive model, when applied to each mutually exclusive segment of a current population of observing data points, which is initially either the initial population of training data points or a separate initial population of data points reserved for observing performance;
(c) applying statistical learning theory to derive a reliable estimate bounding future performance of the current model on each mutually exclusive segment, the reliable estimate being derived for each mutually exclusive segment from the observed performance together with a number of observing data points falling within a mutually exclusive segment;
(d) sorting mutually exclusive segments by estimates;
(e) selecting and retaining a fraction of the mutually exclusive segments, and also retaining each subordinate model associated with the mutually exclusive segment, the selection resulting in retention of segments with better estimates;
(f) forming a subpopulation of training points by sampling from the current population of training points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments;
(g) forming a subpopulation of observing data points by sampling from the current population of observing data points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments;
(h) building another predictive model which becomes the current model, the current model applying at least one subordinate model to a plurality of data points in mutually exclusive segments, and being built from the subpopulation of training points formed in step (f);
(i) repeating steps (b) to (h) a desired number of times, with the subpopulation of training points as the current population of training points and with the subpopulation of observing data points as the current population of observing data points;
(j) arranging selected and retained mutually exclusive segments having better estimates in an open-ended decision list, where each item in the open-ended decision list specifies a test for membership in one of the selected and retained segments having better estimates and specifies that, if a point passes the membership test, then a prediction for the point is to be obtained by using the prediction of the subordinate model corresponding to the retained mutually exclusive segment; and
(k) terminating the open-ended decision list with one of the models built in step (a) or step (h) and resulting in a terminated decision list, thereby solving the problem of model interpretability while mitigating the fragmentation problem. - View Dependent Claims (2, 3, 4, 5)
wherein the step (d) of sorting further comprises the step of comparing the observed performance of the current model when applied to the entire current population of observing points, as determined in step (b) with sorted estimates as derived in step (d) and continuing with step (j), if an estimate as derived in step (d) is not substantially better than the observed performance.
-
-
3. A computer implemented method as recited in claim 2, wherein a counter K is initialized to a value of 1 in step (a) and incremented to K+1 in step (h) when a new predictive model is built so that the current model is called a K-th current model for each value of counter K, the arranging step (j) further comprising the steps:
-
selecting a value J in a range of 1 to the value of counter K;
discarding segments and subordinate models that were selected and retained in step (e) for the K-th current model, for all repetitions where K is greater than J;
discarding the K-th current model that was built in step (a) or (h), for all repetitions where K is not equal to J, wherein the terminated decision list as generated in step (k) is terminated with the K-th current model, where K is equal to J, as selected in the selecting step, thereby choosing a number of repetitions from an adaptively determined range of possibilities.
-
-
4. A computer implemented method as recited in claim 3, wherein the number of repetitions J is chosen from an ad adaptively determined range of possibilities by optimizing observed performance when applied to an initial population of observing points and the step of selecting a value J in the arranging step (j) further comprises the steps:
-
combining, for each possible choice of values for J in a range of 1 to K, the observed performance of the K-th current model when applied to every mutually exclusive segment of a K-th population of observing points selected and retained in step (e), where the value of counter K is less than J, with the observed performance of the K-th current model when applied to an entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J; and
choosing a value for J to optimize observed performance as determined in step (a), and choosing the number of repetitions J from an adaptively determined range of possibilities by optimizing observed performance when applied to the initial population of observing points.
-
-
5. A computer implemented method as recited in claim 3, wherein the number of repetitions J is chosen from an adaptively determined range of possibilities by optimizing a reliable statistical estimate bounding future performance, and the step of selecting a value J in the arranging step (j) further comprises the steps:
-
combining, for each possible choice for J in a range from 1 to the value of counter K as selected in step (j), the observed performance of the K-th current model when applied to every mutually exclusive segment of an entire K-th population of observing points that was selected and retained in step (e), where the value of counter K is less than J, with the observed performance of the K-th current model when applied to the entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J;
forming an increasing sequence of sets of possible choices for the value J in a range from 1 to the value of counter K and ending the sequence of sets with a set comprising all possible choices;
selecting, for each set of possible choices formed in the forming step, the value of J that optimizes observed performance for the set, as determined in the combining step;
deriving a reliable estimate bounding future performance with the value of J selected in the selecting step by applying statistical learning theory, for each set of possibilities formed in the forming step, the estimate being derived from parameters, wherein the parameters are selected from a group consisting of the observed performance with J, the number of possibilities in the set, and the number of observing points; and
selecting the value of J to optimize estimated performance as determined in the deriving step.
-
-
6. A computer implemented method of boosting of predictive models that apply subordinate models to data points in possibly intersecting segments and arbitrate among the predictions of applicable subordinate models whenever a point falls within two or more segments, called cascade boosting, for resolving an interpretability problem of previous boosting methods, said method comprising the steps:
-
(a) building an initial predictive model, which initially is a current model, that applies at least one subordinate model to a plurality of data points in possibly intersecting segments and arbitrates among the predictions of the applicable subordinate models whenever a point falls within two or more segments, the initial predictive model being built from an initial population of training data points;
(b) observing performance of the current model, which is initially the initial predictive model, when applied to each segment of a current population of observing data points, which is initially either the initial population of training data points or a separate initial population of data points reserved for observing performance;
(c) applying statistical learning theory to derive a reliable estimate bounding future performance of the current model on each segment, the estimate being derived for each segment from the observed performance together with a number of the observing data points falling within the segment;
(d) sorting the segments by the estimates;
(e) selecting and retaining a fraction of the segments, and also retaining each subordinate model associated with the segment, the selection resulting in retention of segments with better estimates, while separately selecting and retaining each additional segment intersecting the segments with better estimates, and also retaining each subordinate model associated with each additional segment;
(f) forming a subpopulation of training points by sampling from the current population of training points so as to exclude, either with certainty or with high probability, each point falling within selected and retained segments having better estimates and the additional segments;
(g) forming a subpopulation of observing data points by sampling from the current population of observing data points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments having better estimates and the additional segments;
(h) building another predictive model which becomes the current model, the current model applying at least one subordinate model to a plurality of data points in possibly intersecting segments and arbitrating among predictions of applicable subordinate models whenever a point falls within two or more segments, and being built from the subpopulation of training points;
(i) repeating steps (b) to (h) a desired number of times, with the subpopulation of training points as the current population of training points and with the subpopulation of observing data points as the current population of observing data points;
(j) arranging the selected and retained segments having better estimates in an open-ended decision list, where each item in the open-ended decision list specifies a test for membership in one of the selected and retained segments having better estimates and specifies that, if a point passes a membership test, then a prediction for the point is to be obtained by arbitrating among predictions of all pertinent models, the pertinent models being subordinate models corresponding to selected and retained segments within which the point falls, the selected and retained segments having been selected and retained in execution of step (e) that selected and retained the segment in the membership test as one of the segments having better estimates;
(k) terminating the open-ended decision list with one of the models built in step (a) or step (h) and resulting in a terminated decision list, thereby solving the problem of model interpretability. - View Dependent Claims (7, 8, 9, 10)
wherein the step (d) of sorting further comprises the step of comparing the observed performance of the current model when applied to the entire current population of observing points, as determined in step (b) with the sorted estimates as derived in step (d) and continuing with step (j), if an estimate as derived in step (d) is not substantially better than an observed performance.
-
-
8. A computer implemented method as recited in claim 7, wherein a counter K is initialized to a value of 1 in step (a) and incremented to K+1 in step (h) when a new predictive model is built so that the current model is called a K-th current model for each value of counter K, the arranging step (j) further comprising the steps:
-
selecting a value J in a range of 1 to the value of counter K;
discarding segments and subordinate models that were selected and retained in step (e) for the K-th current model, for all repetitions where K is greater than J;
discarding the K-th current model that was built in step (a) or (h), for all repetitions where K is not equal to J, wherein the terminated decision list as generated in step (k) is terminated with the K-th current model, where K is equal to J, as selected in the selecting step, thereby choosing a number of repetitions from an adaptively determined range of possibilities.
-
-
9. A computer implemented method as recited in claim 8, wherein the number of repetitions J is chosen from an adaptively determined range of possibilities by optimizing observed performance when applied to an initial population of observing points and the step of selecting a value J in the arranging step (j) further comprises the steps:
-
combining, for each possible choice of values for J in a range of 1 to K, the observed performance of the K-th current model when applied to every segment of a K-th population of observing points selected and retained in step (e) as a segment with a better estimate, where the value of counter K is less than J, with the observed performance of the K-th current model when applied to an entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J; and
choosing a value for J to optimize observed performance as determined in step (a), and choosing the number of repetitions J from an adaptively determined range of possibilities by optimizing observed performance when applied to the initial population of observing points.
-
-
10. A computer implemented method as recited in claim 8, wherein the number of repetitions J is chosen from an adaptively determined range of possibilities by optimizing a reliable statistical estimate bounding future performance, and the step of selecting a value J in the arranging step (j) further comprises the steps:
-
combining, for each possible choice for J in a range from 1 to the value of counter K as selected in step (j), the observed performance of the K-th current model when applied to every mutually exclusive segment of an entire K-th population of observing points that was selected and retained in step (e), where the value of counter K is less than J, with the observed performance of the K-th current model when applied to the entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J;
forming an increasing sequence of sets of possible choices for the value J in a range from 1 to the value of counter K and ending the sequence of sets with a set comprising all possible choices;
selecting, for each set of possible choices formed in the forming step, the value of J that optimizes observed performance for the set, as determined in the combining step;
deriving a reliable estimate bounding future performance with the value of J selected in the selecting step by applying statistical learning theory, for each set of possibilities formed in the forming step, the estimate being derived from parameters, wherein the parameters are selected from a group consisting of the observed performance with J, the number of possibilities in the set, and the number of observing points; and
selecting the value of J to optimize estimated performance as determined in the deriving step.
-
-
11. A computer implemented method of boosting of black box predictive models, called cascade boosting, for resolving an interpretability problem of previous boosting methods, said method comprising the steps:
-
(a) building an initial black box predictive model, the initial black box predictive model being built from an initial population of training data points;
(b) observing performance of a current model, which is initially the initial black box predictive model, when applied to each point in a current population of observing data points, which is initially either the initial population of training data points or a separate initial population of data points reserved for observing performance;
(c) classifying each point in the current population of observing data points as being in one of two classes, a first class for points where the current model performs well and a second class for points where the current model does not perform well;
(d) building a current auxiliary model, the auxiliary model being a segmented predictive model for classification into either the first or second class as specified in step (c), and built from the current population of observing data points, which may utilize either mutually exclusive segments, or intersecting segments;
(e) applying statistical learning theory to derive a reliable estimate bounding future performance of the current model on each segment, the estimate being derived for each segment from the observed performance together with a number of the observing data points falling within the segment;
(f) sorting the segments by the estimates;
(g) selecting and retaining a fraction of the segments, and also retaining each auxiliary model associated with selected segments, the selection resulting in retention of segments with better estimates;
(h) forming a subpopulation of training points by sampling from the current population of training points so as to exclude, either with certainty or with high probability, each point falling within selected and retained segments;
(i) forming a subpopulation of observing data points by sampling from the current population of observing data points so as to exclude, either with certainty or with high probability, each point falling within the selected and retained segments;
(j) building another black box predictive model which becomes the current model, the current model being built from the subpopulation of training points;
(k) repeating steps (b) to (j) a desired number of times, with the subpopulation of training points as the current population of training points and with the subpopulation of observing data points as the current population of observing data points;
(l) arranging the selected and retained segments having better estimates in an open-ended decision list, where each item in the open-ended decision list specifies a test for membership in one of the selected and retained segments having better estimates and specifies that, if a point passes the membership test, then the prediction for the point is to be obtained by using the prediction of the black box predictive model retained in the same execution of step (g) that selected and retained the segment in the membership test as one of the segments having better estimates;
(m) terminating the open-ended decision list with one of the models built in step (a) or step (j) and resulting in a terminated decision list, thereby solving the problem of model interpretability. - View Dependent Claims (12, 13, 14, 15)
wherein the step (f) of sorting further comprises the step of comparing the observed performance of the current model when applied to an entire current population of observing points, as determined in step (b) with the sorted estimates as derived in step (f) and continuing with step (l), if an estimate as derived in step (f) is not substantially better than observed performance.
-
-
13. A computer implemented method as recited in claim 12, wherein a counter K is initialized to a value of 1 in step (a) and incremented to K+1 in step (j) when a new predictive model is built so that the current model is called a K-th current model for each value of counter K, the arranging step (l) further comprising the steps:
-
selecting a value J in a range of 1 to the value of counter K;
discarding segments that were selected and retained in step (g) for the K-th current model, for all repetitions where K is greater than J;
discarding the K-th current model that was built in step (a) or (j), for all repetitions where K is not equal to J, wherein the terminated decision list as generated in step (m) is terminated with the K-th current model, where K is equal to J, as selected in the selecting step, thereby choosing a number of repetitions from an adaptively determined range of possibilities.
-
-
14. A computer implemented method as recited in claim 12, wherein the number of repetitions J is chosen from an adaptively determined range of possibilities by optimizing observed performance when applied to an initial population of observing points and the step of selecting a value J in the arranging step (l) further comprises the steps:
-
combining, for each possible choice of values for J in a range of 1 to K, the observed performance of a K-th model when applied to every segment of a K-th population of observing points selected and retained in step (g) as a segment with a better estimate, where the value of counter K is less than J, with the observed performance of the K-th model when applied to an entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J; and
choosing a value for J to optimize observed performance as determined in step (a), and choosing the number of repetitions J from an adaptively determined range of possibilities by optimizing observed performance when applied to the initial population of observing points.
-
-
15. A computer implemented method as recited in claim 12, wherein the number of repetitions J is chosen from an adaptively determined range of possibilities by optimizing a reliable statistical estimate bounding future performance, and the step of selecting a value J in the arranging step (l) further comprises the steps:
-
combining, for each possible choice for J in a range from 1 to the value of counter K as selected in step (l), the observed performance of the K-th current model when applied to every mutually exclusive segment of the entire K-th population of observing points that was selected and retained in step (g), where the value of counter K is less than J, with the observed performance of a K-th current model when applied to an entire K-th population of observing points, where the value of counter K is equal to J, thereby determining the observed performance when applied to the initial population of observing points that would result from choosing a value for J;
forming an increasing sequence of sets of possible choices for the value J in a range from 1 to the value of counter K and ending the sequence of sets with a set comprising all possible choices;
selecting, for each set of possible choices formed in the forming step, the value of J that optimizes observed performance for the set, as determined in the combining step;
deriving a reliable estimate bounding future performance with the value of J selected in the selecting step by applying statistical learning theory, for each set of possibilities formed in the forming step, the estimate being derived from parameters, wherein the parameters are selected from a group consisting of the observed performance with J, a number of possibilities in the set, and the number of observing points; and
selecting the value of J to optimize estimated performance as determined in the deriving step.
-
Specification