Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit
First Claim
1. A method, comprising:
- receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit;
determining a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises;
separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively;
deriving first and second mean value sets from the first and second diagnostic data sets respectively;
determining a covariance matrix between the first and second diagnostic data sets;
determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;
selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and
generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.
2 Assignments
0 Petitions
Accused Products
Abstract
Sources of operational problems in business transactions often show themselves in relatively small pockets of data, which are called trouble hot spots. Identifying these hot spots from internal company transaction data is generally a fundamental step in the problem'"'"'s resolution, but this analysis process is greatly complicated by huge numbers of transactions and large numbers of transaction variables to analyze. A suite of practical modifications are provided to data mining techniques and logistic regressions to tailor them for finding trouble hot spots. This approach thus allows the use of efficient automated data mining tools to quickly screen large numbers of candidate variables for their ability to characterize hot spots. One application is the screening of variables which distinguish a suspected hot spot from a reference set.
-
Citations
17 Claims
-
1. A method, comprising:
-
receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; determining a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises; separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively; deriving first and second mean value sets from the first and second diagnostic data sets respectively; determining a covariance matrix between the first and second diagnostic data sets; determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets; selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method, comprising:
-
receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein corresponds to the investigated unit or to the benchmark unit respectively; deriving first and second mean value sets from the first and second diagnostic data sets respectively; determining a covariance between the first and second diagnostic data sets;
determining a plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;deriving a plurality of t-values values corresponding to each of the plurality of logistic regression coefficients; selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of t-values in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising the t-values and p-values; and generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.
-
-
17. A computer program product, stored on a computer readable storage medium, comprising:
-
processor readable instructions stored in the computer program product, wherein the processor readable instructions are issuable by a processor to; receive a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; determine a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises; separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively; deriving first and second mean value sets from the first and second diagnostic data sets respectively; determining a covariance matrix between the first and second diagnostic data sets; determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets; select a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.
-
Specification