Method and system for analyzing data and creating predictive models
First Claim
Patent Images
1. In a computer-based system, a method of building a statistical model, comprising:
- automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables;
automatically identifying categorical variables that are correlated with one or more continuous variables and eliminating categorical variable that are correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and system of automatically analyzing data, cleansing and normalizing the data, identifying categorical variables within the data set, eliminating co-linearities among the variables and automatically building a statistical model is provided.
118 Citations
56 Claims
-
1. In a computer-based system, a method of building a statistical model, comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables;
automatically identifying categorical variables that are correlated with one or more continuous variables and eliminating categorical variable that are correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. In a computer-based system, a method of building a statistical model, comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables, wherein this step comprises;
determining if a variable contains integer observation values;
if the variable contains integer values, determining the number of unique integer values contained in the variable;
determining if the number of unique values exceeds a predetermined threshold value; and
if the number of unique values does not exceed the threshold value, flagging the variable as a categorical variable;
automatically identifying categorical variables that are correlated with one or more continuous variables and eliminating categorical variables that are correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (17, 18, 19, 20, 21)
-
-
22. In a computer-based system, a method of building a statistical model, comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables;
binning at least one continuous variable so as to convert the continuous variable into a psuedo-categorical variable;
calculating a Cramer'"'"'s V value between at least one categorical variable and the psuedo-categorical variable to obtain an estimated measure of co-linearity between the categorical variable and the continuous variable;
based on the calculated Cramer'"'"'s V value, eliminating a corresponding categorical variable that is correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (23, 24, 25, 26, 27, 28)
-
-
29. A computer-readable medium containing code executable by a computer that when executed performs a process of automatically building a statistical model, said process comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables;
automatically identifying categorical variables that are correlated with one or more continuous variables and eliminating categorical variables that are correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
-
-
44. A computer-readable medium containing code executable by a computer that when executed performs a process of automatically building a statistical model, the process comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables, wherein this step comprises;
determining if a variable contains integer observation values;
if the variable contains integer values, determining the number of unique integer values contained in the variable;
determining if the number of unique values exceeds a predetermined threshold value; and
if the number of unique values does not exceed the threshold value, flagging the variable as a categorical variable;
automatically identifying categorical variables that are correlated with one or more continuous variables and eliminating categorical variables that are correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (45, 46, 47, 48, 49)
-
-
50. A computer-readable medium containing code executable by a computer that when executed performs a process of automatically building a statistical model, the process comprising:
-
automatically identifying and flagging categorical variables in a data set containing both categorical and continuous variables;
binning at least one continuous variable so as to convert the continuous variable into a psuedo-categorical variable;
calculating a Cramer'"'"'s V value between at least one categorical variable and the psuedo-categorical variable to obtain an estimated measure of co-linearity between the categorical variable and the continuous variable;
based on the calculated Cramer'"'"'s V value, eliminating a corresponding categorical variable that is correlated with at least one continuous variable from a training data matrix used to build a statistical model, wherein the training data matrix comprises a subset of the original data set; and
building the statistical model based on the training data matrix. - View Dependent Claims (51, 52, 53, 54, 55, 56)
-
Specification