Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

US 8,306,997 B2
Filed: 05/27/2011
Issued: 11/06/2012
Est. Priority Date: 12/05/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A processor implemented method of comparing an investigated entity to a reference entity, the method comprising:

augmenting via a processor a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity;

structuring the plurality of data points according to at least one of a plurality of preprocessing modalities,wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables,wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable;

performing via the processor logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;

receiving via the processor from the logistic regression a plurality of standardized values of regression coefficients for the variables;

ranking each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types;

identifying via the processor significant variables whose standardized values exceed a specified threshold and are thereby considered significant;

generating via the processor at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and

identifying via the processor based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Sources of operational problems in business transactions often show themselves in relatively small pockets of data, which are called trouble hot spots. Identifying these hot spots from internal company transaction data is generally a fundamental step in the problem'"'"'s resolution, but this analysis process is greatly complicated by huge numbers of transactions and large numbers of transaction variables to analyze. A suite of practical modifications are provided to data mining techniques and logistic regressions to tailor them for finding trouble hot spots. This approach thus allows the use of efficient automated data mining tools to quickly screen large numbers of candidate variables for their ability to characterize hot spots. One application is the screening of variables which distinguish a suspected hot spot from a reference set.

Citations

16 Claims

1. A processor implemented method of comparing an investigated entity to a reference entity, the method comprising:
- augmenting via a processor a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity;
  
  structuring the plurality of data points according to at least one of a plurality of preprocessing modalities,wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables,wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable;
  
  performing via the processor logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;
  
  receiving via the processor from the logistic regression a plurality of standardized values of regression coefficients for the variables;
  
  ranking each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types;
  
  identifying via the processor significant variables whose standardized values exceed a specified threshold and are thereby considered significant;
  
  generating via the processor at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and
  
  identifying via the processor based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein a target variable value equals 1 if the respective data point is associated with the investigated entity, or equals 0 if the respective data point is associated with the reference entity.
  - 3. The method of claim 1, wherein the specified threshold comprises a test statistic type, wherein said test statistic type is a t-value.
  - 4. The method of claim 1, wherein the specified threshold comprises a test statistic type, wherein said test statistic type is a p-value.
  - 5. The method of claim 1, wherein identifying variables whose standardized values exceed a specified threshold comprises comparing t-values associated with variables correspondent to the plurality of augmented data points to a t-value significance threshold.
  - 6. The method of claim 5, wherein the t-value significance threshold equals 1.96 or −
    - 1.96.
  - 7. The method of claim 1, wherein identifying variables whose standardized values exceed a specified threshold comprises comparing p-values associated with variables correspondent to the plurality of augmented data points to a p-value significance threshold.
  - 8. The method of claim 7, wherein the p-value significance threshold is 5%.
  - 9. The method of claim 1, wherein said at least one interaction variable is generated from the cross product of a first significant variable and a second significant variable.
  - 10. The method of claim 1, wherein said at least one interaction variable provides information about the relationship of the two significant standardized variables.
  - 11. The method of claim 1, wherein generating the at least one interaction variable between the first significant variable and the second significant variable is equivalent to nesting the effects of the first significant variable in the second significant variable.
  - 12. The method of claim 1, further comprising:
    - generating at least one second interaction variable between the first significant variable and a third significant variable of the identified significant variables; and
      
      identifying based on the at least one second interaction variable third significant variable values for which second interaction variable values exceed the specified threshold.
  - 13. The method of claim 12, wherein generating the at least one second interaction variable between the first significant variable and the third significant variable is equivalent to nesting the effects of the first significant variable in the third significant variable.
  - 14. The method of claim 1, further comprising:
    - determining from the identified significant variables a significant variable with the largest single standardized value difference between the investigated entity and the reference entity; and
      
      assigning the significant variable with largest single standardized value difference between the investigated entity and the reference entity as the first significant variable.

15. A non-transitory computer readable medium, comprising:
- processor readable instructions stored in the computer readable medium, wherein the processor readable instructions are issuable by a processor to;
  
  augment a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity;
  
  structure the plurality of data points according to at least one of a plurality of preprocessing modalities,wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables,wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable;
  
  perform logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;
  
  receive from the logistic regression a plurality of standardized values of regression coefficients for the variables;
  
  rank each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types;
  
  identify significant variables whose standardized values exceed a specified threshold and are thereby considered significant;
  
  generate at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and
  
  identify based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.

16. A system, comprising:
- a processor;
  
  a memory in communication with the processor and containing program instructions;
  
  an input and output device in communication with the processor and memory comprising a graphical interface;
  
  wherein the processor executes program instructions contained in the memory and the program instructions comprise;
  
  augment a plurality of data points that correspond to variables of an investigated entity or a reference entity by creating a target variable whose value is indicative of whether the respective data point is associated with the investigated entity or the reference entity;
  
  structure the plurality of data points according to at least one of a plurality of preprocessing modalities,wherein the plurality of preprocessing modalities include trimming outlier data points and transforming variables to near symmetry, standardizing variables, and screening variables,wherein screening variables comprises performing decision tree analysis on the plurality of data points to identify variables having an effect on the target variable;
  
  perform logistic regression upon the augmented data points with the target variable used as a dependent variable in performing the logistic regression;
  
  receive from the logistic regression a plurality of standardized values of regression coefficients for the variables;
  
  rank each of the variables corresponding to the plurality of augmented data points in order of one of a plurality of test statistic types;
  
  identify significant variables whose standardized values exceed a specified threshold and are thereby considered significant;
  
  generate at least one interaction variable between a first significant variable and a second significant variable of the identified significant variables; and
  
  identify based on the at least one interaction variable second significant variable values for which interaction variable values exceed the specified threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Patent and Licensing Incorporated (Verizon Communications Inc.)
Original Assignee
Verizon Services Corporation (Verizon Communications Inc.)
Inventors
Drew, James Howard
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
Veillard, Jacques

Application Number

US13/117,229
Publication Number

US 20110231444A1
Time in Patent Office

529 Days
Field of Search

None
US Class Current

707/776
CPC Class Codes

G06Q 10/0639   Performance analysis of emp...

G06Q 90/00   Systems or methods speciall...

Y10S 707/99936   Pattern matching access

Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links