Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

US 7,970,785 B2
Filed: 10/15/2008
Issued: 06/28/2011
Est. Priority Date: 12/05/2005
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit;

determining a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises;

separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively;

deriving first and second mean value sets from the first and second diagnostic data sets respectively;

determining a covariance matrix between the first and second diagnostic data sets;

determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;

selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and

generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Sources of operational problems in business transactions often show themselves in relatively small pockets of data, which are called trouble hot spots. Identifying these hot spots from internal company transaction data is generally a fundamental step in the problem'"'"'s resolution, but this analysis process is greatly complicated by huge numbers of transactions and large numbers of transaction variables to analyze. A suite of practical modifications are provided to data mining techniques and logistic regressions to tailor them for finding trouble hot spots. This approach thus allows the use of efficient automated data mining tools to quickly screen large numbers of candidate variables for their ability to characterize hot spots. One application is the screening of variables which distinguish a suspected hot spot from a reference set.

Citations

17 Claims

1. A method, comprising:
- receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit;
  
  determining a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises;
  
  separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively;
  
  deriving first and second mean value sets from the first and second diagnostic data sets respectively;
  
  determining a covariance matrix between the first and second diagnostic data sets;
  
  determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;
  
  selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and
  
  generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, further comprising:
    - generating at least one interaction variable between elements of the subset of most significant diagnostic variables.
  - 3. The method of claim 2, wherein the interaction variable is generated as a cross product of elements of the subset of most significant diagnostic variables.
  - 4. The method of claim 1, further comprising:
    - extracting the plurality of diagnostic variables from a superset of diagnostic variables based on a decision tree analysis.
  - 5. The method of claim 4, wherein the decision tree analysis is only performed when the number of elements of the superset of diagnostic variables exceeds a threshold.
  - 6. The method of claim 1, wherein the selecting a subset of most significant diagnostic variables further comprises:
    - deriving a plurality of test statistic values corresponding to each of the plurality of logistic regression coefficients; and
      
      wherein the selecting a subset of most significant diagnostic variables is based on the plurality of test statistic values in reference to the significance criteria.
  - 7. The method of claim 6, wherein the significance criteria is based on an inspection of relative magnitudes of the plurality of test statistic values.
  - 8. The method of claim 6, wherein the significance criteria is based on comparison of each of the plurality of test statistic values to a threshold level.
  - 9. The method of claim 6, wherein the significance criteria is based on comparison of each t-value with corresponding values in a look-up table.
  - 10. The method of claim 6, wherein a particular diagnostic variable is selected as a most significant diagnostic variable if a corresponding p-value is less than 5%.
  - 11. The method of claim 1, further comprising:
    - identifying the investigated unit as an aberrant unit based on observed values for the most significant diagnostic variables.
  - 12. The method of claim 11, wherein identifying the investigated unit is based on a probability determined from the observed values for the most significant diagnostic variables.
  - 13. The method of claim 1, further comprising:
    - trimming outliers in the covariance matrix; and
      
      transforming the covariance matrix to near symmetry.
  - 14. The method of claim 1, wherein the determining a plurality of logistic regression coefficients further comprises:
    - creating a target variable whose value is one if a data point is associated with the investigated unit and zero otherwise; and
      
      submitting the diagnostic data set to a logistic regression component of a data mining module, with the target variable designated as a dependent variable.
  - 15. The method of claim 1, wherein the plurality of diagnostic variables include at least a repair time, a repair type, a repair location, and a time of day.

16. A method, comprising:
- receiving a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit;
  
  separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein corresponds to the investigated unit or to the benchmark unit respectively;
  
  deriving first and second mean value sets from the first and second diagnostic data sets respectively;
  
  determining a covariance between the first and second diagnostic data sets;
  
  determining a plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;
  
  deriving a plurality of t-values values corresponding to each of the plurality of logistic regression coefficients;
  
  selecting a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of t-values in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising the t-values and p-values; and
  
  generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.

17. A computer program product, stored on a computer readable storage medium, comprising:
- processor readable instructions stored in the computer program product, wherein the processor readable instructions are issuable by a processor to;
  
  receive a diagnostic data set comprising a plurality of observational values for a plurality of diagnostic variables corresponding to an investigated unit and a benchmark unit;
  
  determine a plurality of logistic regression coefficients based on the diagnostic data set, each logistic regression coefficient corresponding to at least one diagnostic variable of the plurality of diagnostic variables, wherein determining a plurality of logistic regression coefficients further comprises;
  
  separating the diagnostic data set into first and second diagnostic data sets depending on whether the plurality of observational values contained therein correspond to the investigated unit or to the benchmark unit respectively;
  
  deriving first and second mean value sets from the first and second diagnostic data sets respectively;
  
  determining a covariance matrix between the first and second diagnostic data sets;
  
  determining the plurality of logistic regression coefficients based on a product of an inverse of the covariance matrix and a difference between the first and second mean value sets;
  
  select a subset of most significant diagnostic variables from the plurality of diagnostic variables based on the plurality of logistic regression coefficients in reference to significance criteria, wherein the selecting a subset of most significant diagnostic variables further comprises deriving a plurality of test statistic values comprising t-values and p-values; and
  
  generating at least one interaction variable as a cross product between elements of the subset of most significant diagnostic variables.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Patent and Licensing Incorporated (Verizon Communications Inc.)
Original Assignee
Verizon Services Corporation (Verizon Communications Inc.)
Inventors
Drew, James Howard
Primary Examiner(s)
Abel-Jalil, Neveen
Assistant Examiner(s)
Veillard, Jacques

Application Number

US12/251,750
Publication Number

US 20090112917A1
Time in Patent Office

986 Days
Field of Search

None
US Class Current

707/776
CPC Class Codes

G06Q 10/0639   Performance analysis of emp...

G06Q 90/00   Systems or methods speciall...

Y10S 707/99936   Pattern matching access

Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links