Method and apparatus for predictive modeling & analysis for knowledge discovery
First Claim
1. A method and apparatus for predictive modeling &
- analysis for knowledge discovery comprising;
selecting a specific target for which predictive modeling and analysis is to be performed;
importing the dataset into learning and testing data sets;
learning dataset is further divided into training and validation datasets;
normalizing and cleaning the dataset;
systematic dimensionality reduction of features from the learning dataset in order to improve the performance of creating models without sacrificing speed;
configuring the apparatus for either a single-class or multi-class classification modeling or a regression modeling or optionally both;
optionally selecting an appropriate linear or non-linear kernel for modeling;
selecting an auto-tuning parameter for automatically optimizing and selecting the best model with the highest accuracy for correct predictions of activity including selecting a linear or non-linear kernel that yields the best model with the highest accuracy;
creating models using support vector machines and other algorithms such as Naive Bayes, Random Forest, Ridge Regression with the learning dataset and auto-selecting the best model with the best accuracy for correct predictions of activity;
testing the test dataset against the auto-selected best model to determine over-fitting;
discovering dominant features and characteristics as in the learning dataset for the given target and the selected model;
performing cluster analysis on the learning dataset to discover different classes and series of similar data-points and discovering dominant features and characteristics of each cluster;
further systematic dimensionality reduction of features from the learning dataset in order to further improve accuracy based on the selected auto-tuning parameter;
iteratively re-creating models using support vector machines or other algorithms including Naï
ve Bayes, Random Forest and Ridged Regression with the learning dataset with reduced features and then auto-selecting the best model with the best accuracy for correct predictions of activity;
discovering noise in the training dataset by performing Noise Discovery Cross Validation Algorithm.predicting activity and level of activity of data-points with unknown ground truth using the selected best model;
discovering dominant features and characteristics of the data-points in the prediction dataset for the given target;
performing similarity discovery to discover if the prediction dataset and training dataset come from similar distribution and series;
packaging and exporting models to be integrated and used with other third party applications;
recreating the best model by only training on the support vectors in case the algorithm used for training is Support Vector Machines;
allowing users to add additional data to the original training dataset for retraining and generating local models that are more specific to the users problem domain;
ability to perform incremental learning by adding new training data to improve the model without having to re-run and re-generate model.
0 Assignments
0 Petitions
Accused Products
Abstract
A device and method designed to carry out the computation of a wide range of topological indices of molecular structure to produce molecular descriptors, representing important elements of the molecular structure information including but not limited to molecular structure variables such as; the molecular connectivity chi indices, mXt, and mXtv; kappa shape indices, mκ and mκα; electrotopological state indices, Si; hydrogen electrotopological state indices, HESi; atom type and bond type electrotopological state indices; new group type and bond type electrotopological state indices; topological equivalence indices and total topological index; several information indices, including the Shannon and the Bonchen Trinajstic information indices; counts of graph paths, atoms, atoms types, bond types; and others.
-
Citations
20 Claims
-
1. A method and apparatus for predictive modeling &
- analysis for knowledge discovery comprising;
selecting a specific target for which predictive modeling and analysis is to be performed; importing the dataset into learning and testing data sets; learning dataset is further divided into training and validation datasets; normalizing and cleaning the dataset; systematic dimensionality reduction of features from the learning dataset in order to improve the performance of creating models without sacrificing speed; configuring the apparatus for either a single-class or multi-class classification modeling or a regression modeling or optionally both; optionally selecting an appropriate linear or non-linear kernel for modeling; selecting an auto-tuning parameter for automatically optimizing and selecting the best model with the highest accuracy for correct predictions of activity including selecting a linear or non-linear kernel that yields the best model with the highest accuracy; creating models using support vector machines and other algorithms such as Naive Bayes, Random Forest, Ridge Regression with the learning dataset and auto-selecting the best model with the best accuracy for correct predictions of activity; testing the test dataset against the auto-selected best model to determine over-fitting; discovering dominant features and characteristics as in the learning dataset for the given target and the selected model; performing cluster analysis on the learning dataset to discover different classes and series of similar data-points and discovering dominant features and characteristics of each cluster; further systematic dimensionality reduction of features from the learning dataset in order to further improve accuracy based on the selected auto-tuning parameter; iteratively re-creating models using support vector machines or other algorithms including Naï
ve Bayes, Random Forest and Ridged Regression with the learning dataset with reduced features and then auto-selecting the best model with the best accuracy for correct predictions of activity;discovering noise in the training dataset by performing Noise Discovery Cross Validation Algorithm. predicting activity and level of activity of data-points with unknown ground truth using the selected best model; discovering dominant features and characteristics of the data-points in the prediction dataset for the given target; performing similarity discovery to discover if the prediction dataset and training dataset come from similar distribution and series; packaging and exporting models to be integrated and used with other third party applications; recreating the best model by only training on the support vectors in case the algorithm used for training is Support Vector Machines; allowing users to add additional data to the original training dataset for retraining and generating local models that are more specific to the users problem domain; ability to perform incremental learning by adding new training data to improve the model without having to re-run and re-generate model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- analysis for knowledge discovery comprising;
-
20. A method and apparatus for predictive modeling &
- analysis for knowledge discovery according to claim 20, wherein Non-linear Feature Selection for Non-Support Vector algorithm can be applied to each cluster for discovering dominant features and characteristics of each cluster;
- analysis for knowledge discovery according to claim 20, wherein Non-linear Feature Selection for Non-Support Vector algorithm can be applied to each cluster for discovering dominant features and characteristics of each cluster;
Specification