METHOD AND COMPUTER PROGRAM PRODUCT FOR USING DATA MINING TOOLS TO AUTOMATICALLY COMPARE AN INVESTIGATED UNIT AND A BENCHMARK UNIT
First Claim
1. A method of comparing an investigated entity to a reference entity, the method comprising:
- augmenting a plurality of data points that correspond to a variable or characteristic of the investigated entity or the reference entity by creating a target variable whose value is indicative if whether the respective data point is associated with the investigated entity or the reference entity;
performing logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;
receiving from the logistic regression a plurality of standardized values of regression coefficients for the submitted variables; and
identifying variables whose standardized values exceed a specified threshold and are thereby considered significant.
2 Assignments
0 Petitions
Accused Products
Abstract
Sources of operational problems in business transactions often show themselves in relatively small pockets of data, which are called trouble hot spots. Identifying these hot spots from internal company transaction data is generally a fundamental step in the problem'"'"'s resolution, but this analysis process is greatly complicated by huge numbers of transactions and large numbers of transaction variables to analyze. A suite of practical modifications are provided to data mining techniques and logistic regressions to tailor them for finding trouble hot spots. This approach thus allows the use of efficient automated data mining tools to quickly screen large numbers of candidate variables for their ability to characterize hot spots. One application is the screening of variables which distinguish a suspected hot spot from a reference set.
14 Citations
21 Claims
-
1. A method of comparing an investigated entity to a reference entity, the method comprising:
-
augmenting a plurality of data points that correspond to a variable or characteristic of the investigated entity or the reference entity by creating a target variable whose value is indicative if whether the respective data point is associated with the investigated entity or the reference entity; performing logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression; receiving from the logistic regression a plurality of standardized values of regression coefficients for the submitted variables; and identifying variables whose standardized values exceed a specified threshold and are thereby considered significant.
-
-
2. A method, comprising:
-
receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; determining a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables; and selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method, comprising:
-
receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein corresponds to the investigated unit or to the benchmark unit respectively; deriving first and second mean value sets from the first and second diagnostic data sets respectively; determining a covariance between the first and second diagnostic data sets; determining a plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets; deriving a plurality of t-values values corresponding to each of the plurality of logistic regression coefficients; selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of t-values in reference to significance criteria; and generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.
-
-
21. A computer program product, comprising:
processor readable instructions stored in the computer program product, wherein the processor readable instructions are issuable by a processor to; receive a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit; determine a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables; and select a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria.
Specification