Method and apparatus for predictive modeling & analysis for knowledge discovery

US 20080133434A1
Filed: 11/12/2004
Published: 06/05/2008
Est. Priority Date: 11/12/2004
Status: Abandoned Application

First Claim

Patent Images

1. A method and apparatus for predictive modeling &

analysis for knowledge discovery comprising;

selecting a specific target for which predictive modeling and analysis is to be performed;

importing the dataset into learning and testing data sets;

learning dataset is further divided into training and validation datasets;

normalizing and cleaning the dataset;

systematic dimensionality reduction of features from the learning dataset in order to improve the performance of creating models without sacrificing speed;

configuring the apparatus for either a single-class or multi-class classification modeling or a regression modeling or optionally both;

optionally selecting an appropriate linear or non-linear kernel for modeling;

selecting an auto-tuning parameter for automatically optimizing and selecting the best model with the highest accuracy for correct predictions of activity including selecting a linear or non-linear kernel that yields the best model with the highest accuracy;

creating models using support vector machines and other algorithms such as Naive Bayes, Random Forest, Ridge Regression with the learning dataset and auto-selecting the best model with the best accuracy for correct predictions of activity;

testing the test dataset against the auto-selected best model to determine over-fitting;

discovering dominant features and characteristics as in the learning dataset for the given target and the selected model;

performing cluster analysis on the learning dataset to discover different classes and series of similar data-points and discovering dominant features and characteristics of each cluster;

further systematic dimensionality reduction of features from the learning dataset in order to further improve accuracy based on the selected auto-tuning parameter;

iteratively re-creating models using support vector machines or other algorithms including Naï

ve Bayes, Random Forest and Ridged Regression with the learning dataset with reduced features and then auto-selecting the best model with the best accuracy for correct predictions of activity;

discovering noise in the training dataset by performing Noise Discovery Cross Validation Algorithm.predicting activity and level of activity of data-points with unknown ground truth using the selected best model;

discovering dominant features and characteristics of the data-points in the prediction dataset for the given target;

performing similarity discovery to discover if the prediction dataset and training dataset come from similar distribution and series;

packaging and exporting models to be integrated and used with other third party applications;

recreating the best model by only training on the support vectors in case the algorithm used for training is Support Vector Machines;

allowing users to add additional data to the original training dataset for retraining and generating local models that are more specific to the users problem domain;

ability to perform incremental learning by adding new training data to improve the model without having to re-run and re-generate model.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A device and method designed to carry out the computation of a wide range of topological indices of molecular structure to produce molecular descriptors, representing important elements of the molecular structure information including but not limited to molecular structure variables such as; the molecular connectivity chi indices, ^mX_t, and ^mX_t^v; kappa shape indices, ^mκ and ^mκα; electrotopological state indices, S_i; hydrogen electrotopological state indices, HES_i; atom type and bond type electrotopological state indices; new group type and bond type electrotopological state indices; topological equivalence indices and total topological index; several information indices, including the Shannon and the Bonchen Trinajstic information indices; counts of graph paths, atoms, atoms types, bond types; and others.

Citations

20 Claims

1. A method and apparatus for predictive modeling &
- analysis for knowledge discovery comprising;
  
  selecting a specific target for which predictive modeling and analysis is to be performed;
  
  importing the dataset into learning and testing data sets;
  
  learning dataset is further divided into training and validation datasets;
  
  normalizing and cleaning the dataset;
  
  systematic dimensionality reduction of features from the learning dataset in order to improve the performance of creating models without sacrificing speed;
  
  configuring the apparatus for either a single-class or multi-class classification modeling or a regression modeling or optionally both;
  
  optionally selecting an appropriate linear or non-linear kernel for modeling;
  
  selecting an auto-tuning parameter for automatically optimizing and selecting the best model with the highest accuracy for correct predictions of activity including selecting a linear or non-linear kernel that yields the best model with the highest accuracy;
  
  creating models using support vector machines and other algorithms such as Naive Bayes, Random Forest, Ridge Regression with the learning dataset and auto-selecting the best model with the best accuracy for correct predictions of activity;
  
  testing the test dataset against the auto-selected best model to determine over-fitting;
  
  discovering dominant features and characteristics as in the learning dataset for the given target and the selected model;
  
  performing cluster analysis on the learning dataset to discover different classes and series of similar data-points and discovering dominant features and characteristics of each cluster;
  
  further systematic dimensionality reduction of features from the learning dataset in order to further improve accuracy based on the selected auto-tuning parameter;
  
  iteratively re-creating models using support vector machines or other algorithms including Naï
  
  ve Bayes, Random Forest and Ridged Regression with the learning dataset with reduced features and then auto-selecting the best model with the best accuracy for correct predictions of activity;
  
  discovering noise in the training dataset by performing Noise Discovery Cross Validation Algorithm.predicting activity and level of activity of data-points with unknown ground truth using the selected best model;
  
  discovering dominant features and characteristics of the data-points in the prediction dataset for the given target;
  
  performing similarity discovery to discover if the prediction dataset and training dataset come from similar distribution and series;
  
  packaging and exporting models to be integrated and used with other third party applications;
  
  recreating the best model by only training on the support vectors in case the algorithm used for training is Support Vector Machines;
  
  allowing users to add additional data to the original training dataset for retraining and generating local models that are more specific to the users problem domain;
  
  ability to perform incremental learning by adding new training data to improve the model without having to re-run and re-generate model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1 wherein Qualitative Structural-Property Relationship (QSPR) analysis is to be performed then it is required to generate molecular descriptors, structural keys, signatures, or molecular fingerprints (or, more simply, fingerprints) from molecular structures represented in molecular structure file formats including;
      
      SMILES (the Simplified Molecular Input Line Entry System proposed by Dave Weininger [Weininger, 1988]8), SDF, MOL or MOL2;
  - 3. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1 wherein dual-class classification problem with very large unbalanced dataset with a small fraction of data-points belonging to the positive class and majority of the data-points belonging to the negative class can be further reduced by including a smaller quantity of data-points with a negative class where the quantity of data-points with a negative class is five times that of the total number of data-points with positive class;
  - 4. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein data normalization can either be achieved by a 0-1 scaling or it can be achieved by a unit scaling;
  - 5. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein data cleaning can be achieved by eliminating the feature with the same value;
  - 6. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 5, wherein data cleaning can be achieved by providing adequate values for missing feature values in the dataset;
  - 7. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein discovering dominant features and characteristics with non-linear relationships in the learning dataset for the given target can be achieved for non-linear kernel using a Non-linear Feature Selection for Support Vector Machine algorithm;
  - 8. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein discovering dominant features and characteristics in the learning dataset for the given target can be further enhanced to discover correlation between dominant features and features correlated to the dominant features in the learning dataset by using correlation coefficient algorithm based on Fischer Score, Unbalanced Univariate Correlation and Multivariate Unbalanced Correlation;
  - 9. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein feature dimensionality of the modeling dataset can be reduced by backward and/or forward elimination algorithms;
  - 10. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein models are created and auto-selected using grid search algorithm;
  - 11. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein models are created and auto-selected using pattern search (also known as auto train) algorithm;
  - 12. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein models are created and auto-selected using svmPath⁹that computes the entire solution path for the two-class SVM model. The solution is calculated for every value of the cost parameter C, essentially with the same computing cost of a single SVM solution;
  - 13. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein created models for classification can be assessed and compared based on Error Rate, Accuracy, Precision, Recall, Enrichment Curve, F-Measure, model complexity, ROC graph, Balanced Error Rate, 1% of Actives, Balanced Standard Error, Balanced Accuracy and Model Complexity;
  - 14. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein created models for regression can be assessed and compared based on RMS, R2, Mean Relative Error and Mean Absolute Error;
  - 15. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein k-fold cross validation can be used to further split the learning dataset into k-folds for building models based on multiple folds that improves accuracy by reducing over-fitting. Automatically tune the algorithms kernel parameters to minimize the validation error during k-fold cross-validation of the training data thus select the best model with the highest accuracy;
  - 16. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 15, can be further improved wherein the number of folds in equal to the number of data-points often referred to as “
      
      Leave-One-Out cross validation”
      
      ;
  - 17. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein the accuracy of the models can be further improved by combining multiple weaker models to build a more accurate model using techniques called boosting and bagging;
  - 18. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein a method called transductive inference can be used when testing is performed on data-points that are expected to come from a different distribution than the distribution of the data-points used in the learning dataset;
  - 19. A method and apparatus for predictive modeling &
    - analysis for knowledge discovery according to claim 1, wherein dominant features, with non-linear relationship, of prediction dataset with unknown ground truth can be discovered by applying Non-linear Feature Selection for Non-Support Vector algorithm;

20. A method and apparatus for predictive modeling &
- analysis for knowledge discovery according to claim 20, wherein Non-linear Feature Selection for Non-Support Vector algorithm can be applied to each cluster for discovering dominant features and characteristics of each cluster;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adnan Asar, Ravi Mallela, Sinclair Hamilton Hitchings, Victor N. Pavlov
Original Assignee
Adnan Asar, Ravi Mallela, Sinclair Hamilton Hitchings, Victor N. Pavlov
Inventors
Asar, Adnan, Mallela, Ravi, Pavlov, Victor N., Hitchings, Sinclair Hamilton

Application Number

US10/987,784
Publication Number

US 20080133434A1
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

G06N 20/10 using kernel methods, e.g. ...

Method and apparatus for predictive modeling & analysis for knowledge discovery

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for predictive modeling & analysis for knowledge discovery

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links