Computer system and process for training of analytical models using large data sets
First Claim
1. A computer-implemented method for training an analytical model using a training data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model, comprising:
- logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of memory available in the computer for a training data set;
selecting a first subset of the data set;
storing the first subset in memory available in the computer for a training data set;
training the analytical model using the first subset as the training data set;
selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;
retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset.
4 Assignments
0 Petitions
Accused Products
Abstract
A database often contains sparse, i.e., under-represented, conditions which might be not represented in a training data set for training an analytical model if the training data set is created by stratified sampling. Sparse conditions may be represented in a training set by using a data set which includes essentially all of the data in a database, without stratified sampling. A series of samples, or “windows,” are used to select portions of the large data set for phases of training. In general, the first window of data should be a reasonably broad sample of the data. After the model is initially trained using a first window of data, subsequent windows are used to retrain the model. For some model types, the model is modified in order to provide it with some retention of training obtained using previous windows of data. Neural networks and Kohonen networks may be used without modification. Models such as probabilistic neural networks, generalized regression neural networks, Gaussian radial basis functions, decision trees, including K-D trees and neural trees, are modified to provide them with properties of memory to retain the effects of training with previous training data sets. Such a modification may be provided using clustering. is Parallel training models which partition the training data set into disjoint subsets are modified so that the partitioner is trained only on the first window of data, whereas subsequent windows are used to train the models to which the partitioner applies the data in parallel.
114 Citations
16 Claims
-
1. A computer-implemented method for training an analytical model using a training data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model, comprising:
-
logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of memory available in the computer for a training data set;
selecting a first subset of the data set;
storing the first subset in memory available in the computer for a training data set;
training the analytical model using the first subset as the training data set;
selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;
retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
training the radial model;
updating a decision tree with adjusted radius values;
upon retraining, forcing training data from a subset of the training data set into leaves of the decision tree;
adjusting clusters defining the leaves of the decision tree;
optimizing the radial model defined by clusters from the decision tree; and
adjusting the decision tree by adjusting the clusters defining the leaves of the decision tree according to an optimized radial function.
-
-
7. The computer-implemented method of claim 1, wherein the analytical model is a decision tree model adjusted to have properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set through a computer-implemented method comprising:
-
training the initial decision tree;
defining clusters for each leaf of the decision tree;
upon retraining, forcing training data from a subset of the training data set into leaves of the decision tree; and
adjusting clusters of each leaf of the decision tree.
-
-
8. The computer-implemented method of claim 7, wherein adjusting the clusters of each leaf of the decision tree comprises creating new cluster definitions.
-
9. A computer system, for training an analytical model using a data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model comprising:
-
means for logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of the memory available in the computer for a training data set;
means for selecting a first subset of the data set;
means for storing the first subset in the memory available in the computer for a training data set;
means for training the analytical model using the first subset as the training data set;
means for selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;
means for retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
means for training the radial model;
means for updating a decision tree with adjusted radius values;
means for forcing training data of the subset of the training data set into leaves of the decision tree upon retraining;
means for adjusting clusters defining the leaves of the decision tree;
means for optimizing the radial model defined by clusters from the decision tree; and
means for adjusting the decision tree by adjusting the clusters defining the leaves of the decision tree according to an optimized radial function.
-
-
15. The computer system of claim 9, wherein the analytical model is a decision tree model adjusted to have properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set, wherein the computer system comprises:
-
means for training an initial decision tree;
means for defining clusters for each leaf of the decision tree;
means for forcing training data of the subset of the training data set into leaves of the decision tree upon retraining; and
means for adjusting clusters of each leaf of the decision tree.
-
-
16. The computer system of claim 15, wherein the means for adjusting the clusters of each leaf of the decision tree comprises means for creating new cluster definitions.
Specification