METHOD FOR IDENTIFYING OUTLIERS IN LARGE DATA SETS
First Claim
1. A method for identifying a predetermined number of data points of interest in a data set, comprising the steps of:
- partitioning a plurality of data points in a data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of a predetermined number of data points of interest, wherein the predetermined number of data points of interest are included within the plurality of data points in the data set; and
identifying the predetermined number of data points of interest from the plurality of candidate partitions.
1 Assignment
0 Petitions
Accused Products
Abstract
A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.
13 Citations
19 Claims
-
1. A method for identifying a predetermined number of data points of interest in a data set, comprising the steps of:
-
partitioning a plurality of data points in a data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of a predetermined number of data points of interest, wherein the predetermined number of data points of interest are included within the plurality of data points in the data set; and
identifying the predetermined number of data points of interest from the plurality of candidate partitions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 19)
-
-
17. A method for computing the top n outliers in a data set, comprising the steps of:
-
partitioning a plurality of data points in a data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of n number of outliers of interest, wherein the outliers are included within the plurality of data points in the data set; and
identifying the outliers from the plurality of candidate partitions. - View Dependent Claims (18)
-
Specification