Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit
First Claim
1. A processor implemented method of comparing an investigated entity to a reference entity, the method comprising:
- augmenting via a processor a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity;
structuring the plurality of data points according to at least one of a plurality of preprocessing modalities,wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables,wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable;
performing via the processor logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;
receiving via the processor from the logistic regression a plurality of standardized values of regression coefficients for the variables;
ranking each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types;
identifying via the processor significant variables whose standardized values exceed a specified threshold and are thereby considered significant;
generating via the processor at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and
identifying via the processor based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
Sources of operational problems in business transactions often show themselves in relatively small pockets of data, which are called trouble hot spots. Identifying these hot spots from internal company transaction data is generally a fundamental step in the problem'"'"'s resolution, but this analysis process is greatly complicated by huge numbers of transactions and large numbers of transaction variables to analyze. A suite of practical modifications are provided to data mining techniques and logistic regressions to tailor them for finding trouble hot spots. This approach thus allows the use of efficient automated data mining tools to quickly screen large numbers of candidate variables for their ability to characterize hot spots. One application is the screening of variables which distinguish a suspected hot spot from a reference set.
-
Citations
16 Claims
-
1. A processor implemented method of comparing an investigated entity to a reference entity, the method comprising:
-
augmenting via a processor a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity; structuring the plurality of data points according to at least one of a plurality of preprocessing modalities, wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables, wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable; performing via the processor logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression; receiving via the processor from the logistic regression a plurality of standardized values of regression coefficients for the variables; ranking each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types; identifying via the processor significant variables whose standardized values exceed a specified threshold and are thereby considered significant; generating via the processor at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and identifying via the processor based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable medium, comprising:
-
processor readable instructions stored in the computer readable medium, wherein the processor readable instructions are issuable by a processor to; augment a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity; structure the plurality of data points according to at least one of a plurality of preprocessing modalities, wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables, wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable; perform logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression; receive from the logistic regression a plurality of standardized values of regression coefficients for the variables; rank each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types; identify significant variables whose standardized values exceed a specified threshold and are thereby considered significant; generate at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and identify based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.
-
-
16. A system, comprising:
-
a processor; a memory in communication with the processor and containing program instructions; an input and output device in communication with the processor and memory comprising a graphical interface; wherein the processor executes program instructions contained in the memory and the program instructions comprise; augment a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity; structure the plurality of data points according to at least one of a plurality of preprocessing modalities, wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables, wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable; perform logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression; receive from the logistic regression a plurality of standardized values of regression coefficients for the variables; rank each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types; identify significant variables whose standardized values exceed a specified threshold and are thereby considered significant; generate at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and identify based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.
-
Specification