Method, system, and computer program product for outlier detection
First Claim
1. A method of detecting outliers within a multidimensional data set, comprising the steps of:
- identifying a subset of said multidimensional data set;
building a predictive model based on said identified subset;
generating predicted values for each data point in said multidimensional data set based on said predictive model;
repeating said identifying, building, and generating steps a predetermined number of iterations to generate a set of predicted values for each data point in said multidimensional data set based on each built predictive model;
calculating an average predicted value for each data point in said multidimensional data set based on the generated predicted values;
calculating the variance for each data point in said multidimensional data set;
ranking the data points in said multidimensional data set based on the calculated variances; and
identifying as outliers any data points in said multidimensional having variances that exceed a predetermined outlier threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A random sampling of a subset of a data population is taken and the sampled data is used to build a predictive model using a cubic or multiquadric radial basis function, and then “scores” (i.e., predictions) are generated for each data point in the entire data population. This process is repeated on additional random sample subsets of the same data population. After a predetermined number of random sample subsets have been modeled and scores for all data points in the population are generated for each of the models, the average score and variation for each predicted data point is calculated. The data points are subjected to rank ordering by their variance, thereby allowing those data points having a high variance to be identified as outliers.
-
Citations
15 Claims
-
1. A method of detecting outliers within a multidimensional data set, comprising the steps of:
-
identifying a subset of said multidimensional data set; building a predictive model based on said identified subset; generating predicted values for each data point in said multidimensional data set based on said predictive model; repeating said identifying, building, and generating steps a predetermined number of iterations to generate a set of predicted values for each data point in said multidimensional data set based on each built predictive model; calculating an average predicted value for each data point in said multidimensional data set based on the generated predicted values; calculating the variance for each data point in said multidimensional data set; ranking the data points in said multidimensional data set based on the calculated variances; and identifying as outliers any data points in said multidimensional having variances that exceed a predetermined outlier threshold. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system of detecting outliers within a multidimensional data set, comprising:
-
means for identifying a subset of said multidimensional data set; means for building a predictive model based on said identified subset; means for generating predicted values for each data point in said multidimensional data set based on said predictive model; means for repeating said identifying, building, and generating steps a predetermined number of iterations to generate a set of predicted values for each data point in said multidimensional data set based on each built predictive model; means for calculating an average predicted value for each data point in said multidimensional data set based on the generated predicted values; means for calculating the variance for each data point in said multidimensional data set; means for ranking the data points in said multidimensional data set based on the calculated variances; and means for identifying as outliers any data points in said multidimensional having variances that exceed a predetermined outlier threshold. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program product recorded on computer readable medium for detecting outliers within a multidimensional data set, comprising:
-
computer readable means for identifying a subset of said multidimensional data set; computer readable means for building a predictive model based on said identified subset; computer readable means for generating predicted values for each data point in said multidimensional data set based on said predictive model; computer readable means for repeating said identifying, building, and generating steps a predetermined number of iterations to generate a set of predicted values for each data point in said multidimensional data set based on each built predictive model; computer readable means for calculating an average predicted value for each data point in said multidimensional data set based on the generated predicted values; computer readable means for calculating the variance for each data point in said multidimensional data set; computer readable means for ranking the data points in said multidimensional data set based on the calculated variances; and computer readable means for identifying as outliers any data points in said multidimensional having variances that exceed a predetermined outlier threshold. - View Dependent Claims (12, 13, 14, 15)
-
Specification