Computer system and process for training of analytical models using large data sets

US 6,347,310 B1
Filed: 05/11/1998
Issued: 02/12/2002
Est. Priority Date: 05/11/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method for training an analytical model using a training data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model, comprising:

logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of memory available in the computer for a training data set;

selecting a first subset of the data set;

storing the first subset in memory available in the computer for a training data set;

training the analytical model using the first subset as the training data set;

selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;

retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A database often contains sparse, i.e., under-represented, conditions which might be not represented in a training data set for training an analytical model if the training data set is created by stratified sampling. Sparse conditions may be represented in a training set by using a data set which includes essentially all of the data in a database, without stratified sampling. A series of samples, or “windows,” are used to select portions of the large data set for phases of training. In general, the first window of data should be a reasonably broad sample of the data. After the model is initially trained using a first window of data, subsequent windows are used to retrain the model. For some model types, the model is modified in order to provide it with some retention of training obtained using previous windows of data. Neural networks and Kohonen networks may be used without modification. Models such as probabilistic neural networks, generalized regression neural networks, Gaussian radial basis functions, decision trees, including K-D trees and neural trees, are modified to provide them with properties of memory to retain the effects of training with previous training data sets. Such a modification may be provided using clustering. is Parallel training models which partition the training data set into disjoint subsets are modified so that the partitioner is trained only on the first window of data, whereas subsequent windows are used to train the models to which the partitioner applies the data in parallel.

114 Citations

16 Claims

1. A computer-implemented method for training an analytical model using a training data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model, comprising:
- logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of memory available in the computer for a training data set;
  
  selecting a first subset of the data set;
  
  storing the first subset in memory available in the computer for a training data set;
  
  training the analytical model using the first subset as the training data set;
  
  selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;
  
  retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method of claim 1, wherein acts of selecting at least one additional subset and storing the at least one additional subset in memory available for a training data set, and retraining the trained analytical model using the at least one additional subset as the training data set, are performed until a terminating condition is reached.
  - 3. The computer-implemented method of claim 2, wherein the terminating condition is a lack of additional subsets of the training data set.
  - 4. The computer-implemented method of claim 2, wherein the terminating condition is reached when a desired convergence of the model is achieved.
  - 5. The computer-implemented method of claim 1, wherein the training data set contains data points which represent sparse conditions, and wherein the at least one additional subset of the training data set is comprised of the data points representing the sparse conditions of the training data set.
  - 6. The computer-implemented method of claim 1, wherein the analytical model is a radial model adjusted to have properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set through a computer-implemented method comprising:
7. The computer-implemented method of claim 1, wherein the analytical model is a decision tree model adjusted to have properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set through a computer-implemented method comprising:
- training the initial decision tree;
  
  defining clusters for each leaf of the decision tree;
  
  upon retraining, forcing training data from a subset of the training data set into leaves of the decision tree; and
  
  adjusting clusters of each leaf of the decision tree.
8. The computer-implemented method of claim 7, wherein adjusting the clusters of each leaf of the decision tree comprises creating new cluster definitions.

9. A computer system, for training an analytical model using a data set having a size greater than a size of memory available in a computer for a training data set for training the analytical model comprising:
- means for logically dividing the training data set into a plurality of subsets, each having a size at most equal the size of the memory available in the computer for a training data set;
  
  means for selecting a first subset of the data set;
  
  means for storing the first subset in the memory available in the computer for a training data set;
  
  means for training the analytical model using the first subset as the training data set;
  
  means for selecting at least one additional subset of the data set and storing the at least one additional subset in the memory available in the computer for a training data set;
  
  means for retraining the trained analytical model using the at least one additional subset as the training data set, such that the retrained analytical model represents training performed using the first subset and the at least one additional subset.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer system of claim 9, further comprising means for repeating operation of the means for selecting at least one additional subset, and storing the at least one additional subset in memory available for a training data set, and the means for retraining the trained analytical model using the at least one additional subset as the training data set, until a terminating condition is reached.
  - 11. The computer system of claim 10, wherein the terminating condition is a lack of additional subsets of the training data set.
  - 12. The computer system of claim 10, wherein the terminating condition is reached when a desired convergence of the model is achieved.
  - 13. The computer system of claim 9, wherein the training data set contains data points which represent sparse conditions, and wherein the at least one additional subset of the training data set is comprised of the data points representing the sparse conditions of the training data set.
  - 14. The computer system of claim 9, wherein the analytical model is a radial model adjusted to have the properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set, wherein the system comprises:
15. The computer system of claim 9, wherein the analytical model is a decision tree model adjusted to have properties of memory to retain effects of training with previous subsets of the training data set when training with a new subset of the training data set, wherein the computer system comprises:
- means for training an initial decision tree;
  
  means for defining clusters for each leaf of the decision tree;
  
  means for forcing training data of the subset of the training data set into leaves of the decision tree upon retraining; and
  
  means for adjusting clusters of each leaf of the decision tree.
16. The computer system of claim 15, wherein the means for adjusting the clusters of each leaf of the decision tree comprises means for creating new cluster definitions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Ascential Systems, Inc. (International Business Machines Corporation)
Inventors
Passera, Anthony
Primary Examiner(s)
Davis, George B.

Application Number

US09/075,730
Time in Patent Office

1,373 Days
Field of Search

706/25, 706/16, 706/20
US Class Current

706/25
CPC Class Codes

G06F 18/2414   Smoothing the distance, e.g...

G06F 18/24323   Tree-organised classifiers

G06N 20/00   Machine learning

G06N 3/08   Learning methods

Computer system and process for training of analytical models using large data sets

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

114 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Computer system and process for training of analytical models using large data sets

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

114 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links