Method for identifying outliers in large data sets
First Claim
1. A computer implemented method embedded in a recordable media for identifying a predetermined number of outliers of interest in a data set, comprising the steps of:
- partitioning a plurality of data points in the data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain the predetermined number of outliers of interest;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly contains at least one of a predetermined number of outliers of interest, wherein the predetermined number of outliers of interest are included within the plurality of data points in the data set; and
identifying the predetermined number of outliers of interest from the plurality of candidate partitions wherein the data set has N data points and the predetermined number of outliers of interest is n which each have k neighboring data points, a data point p is one of the predetermined number of outliers of interest n if no more than n−
1 other points in the data set reside at greater distances from the k neighboring data point than data point p.
1 Assignment
0 Petitions
Accused Products
Abstract
A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.
-
Citations
20 Claims
-
1. A computer implemented method embedded in a recordable media for identifying a predetermined number of outliers of interest in a data set, comprising the steps of:
-
partitioning a plurality of data points in the data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain the predetermined number of outliers of interest;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly contains at least one of a predetermined number of outliers of interest, wherein the predetermined number of outliers of interest are included within the plurality of data points in the data set; and
identifying the predetermined number of outliers of interest from the plurality of candidate partitions wherein the data set has N data points and the predetermined number of outliers of interest is n which each have k neighboring data points, a data point p is one of the predetermined number of outliers of interest n if no more than n−
1 other points in the data set reside at greater distances from the k neighboring data point than data point p.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
for each one of the plurality of partitions, calculating a distance of at least one neighboring data point from the plurality of data points in the partition, the lower bound being the smallest distance from the at least one neighboring data point to a first one of the plurality of data points in the partition and the upper bound being the largest distance from the at least one neighboring data point to a second one of the plurality of data points in the partition.
-
-
4. The method according to claim 3, wherein:
-
for the predetermined number of outliers of interest, a number of partitions having the largest lower bound values are selected such that the number of data points residing in such partitions is at least equal to the predetermined number of outliers of interest;
the identifying of the plurality of candidate partitions includes identifying which of the candidate partitions are comprised of those partitions having upper bound values that are greater than or equal to the smallest lower bound value of the number of partitions; and
the non-candidate partitions are comprised of those partitions having upper bound values that are less than the smallest lower bound value of the number of partitions, the non-candidate partitions being eliminated from consideration because they do not contain the at least one of the predetermined number of outliers of interest.
-
-
5. The method according to claim 1, wherein the candidate partitions are smaller than a main memory, and further comprising the step of:
-
storing all of the data points in the candidate partitions in a main memory spatial index; and
identifying the predetermined number of outliers of interest using an index-based algorithm which probes the main memory spatial index.
-
-
6. The method according to claim 1, wherein the candidate partitions are larger than a main memory, then the partitions are processed in batches such that the overlap between each one of the partitions in a batch is as large as possible so that as many points as possible are processed in each batch.
-
7. The method according to claim 6, wherein each one of the batches is comprised of a subset of the plurality of candidate partitions, the subset being smaller than the main memory.
-
8. The method according to claim 6, wherein the predetermined number of outliers of interest are selected from the batch processed candidate partitions, the predetermined number of outliers of interest being those data points residing the farthest from their at least one neighboring data point.
-
9. The method according to claim 8, wherein the step of identifying the predetermined number of outliers of interest uses an index-based algorithm.
-
10. The method according to claim 8, wherein the step of identifying the predetermined number of outliers of interest uses a block nested-loop algorithm.
-
11. The method according to claim 1, wherein:
-
the partitioning step includes calculating a minimum bounding rectangle MBR for each one of the plurality of data points in the data set; and
the computing step of the lower and upper bounds being computed for each one of the data points in the MBR.
-
-
12. The method according to claim 11, wherein the minimum distance between a point p and an MBR R is denoted by MINDIST (p, R) defined as MINDIST (p, R)=Σ
-
δ
i=Ix2i, whereinand wherein every point in MBR R is at a distance of at least MINDIST (p, R) from point p;
the point p in δ
-dimensional space is denoted by [p1, p2, . . . , pδ
]; and
the MBR R is a δ
-dimensional rectangle R denoted by the two endpoints of its major diagonal;
r=[r1, r2, . . . , rδ
] and r′
=[r′
1, r′
2, . . . , r′
δ
] such that ri≦
r′
i for 1≦
i≦
n.
-
δ
-
13. The method according to claim 11, wherein the maximum distance between a point p and an MBR R is denoted by MAXDIST (p, R) defined as MAXDIST (p, R)=Σ
-
δ
i=1x2i, whereinand wherein no point in MBR R is at a distance that exceeds MAXDIST (p, R) from the point p;
the point p in δ
-dimensional space is denoted by [p1, p2, . . . , pδ
]; and
the MBR R is a δ
-dimensional rectangle R denoted by the two endpoints of its major diagonal;
r=[r1, r2, . . . , rδ
] and r′
=[r′
1, r′
2, . . . , r′
δ
] such that ri≦
r′
i for 1≦
i≦
n.
-
δ
-
14. The method according to claim 11, wherein the minimum distance between two MBRs R and S is denoted by MINDIST(R, S) defined as MINDIST (R, S)=Σ
-
δ
i=1x2i, whereinand wherein every point in MBR R is at a distance of at least MINDIST(R, S) from any point in MBR S;
a point p in δ
-dimensional space is denoted by [p1, p2, . . . , pδ
];
the MBR R is a δ
-dimensional rectangle R denoted by the two endpoints of its major diagonal;
r=[r1, r2, . . . , rδ
] and r′
=[r′
1, r′
2, . . . , r′
δ
] such that ri≦
r′
i for 1≦
i≦
n; and
the MBR S is a δ
-dimensional rectangle S denoted by the two endpoints of its major diagonal;
s=[s1, s2, . . . , sδ
] and s′
=[s′
1, s′
2, . . . , s′
δ
].
-
δ
-
15. The method according to claim 14, wherein the maximum distance between two MBRs R and S is denoted by MAXDIST(R, S) defined as MAXDIST (R, S)=Σ
-
δ
i−
1x2i, where xi=max {|s′
i−
ri|, |r′
i−
si|}.
-
δ
-
16. A computer implemented method embedded in a recordable media for determining the top n outliers in a data set, comprising the steps of:
-
partitioning a plurality of data points in the data set into a plurality of partitions;
computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain a predetermined number of outliers of interest;
identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly includes at least one of n number of outliers of interest, wherein the outliers are included within the plurality of data points in the data set; and
identifying the outliers from the plurality of candidate partitions wherein given the data set has N data points and the n number of outliers of interest which each have k neighbors, a data point p is one of the n outliers of interest if no more than n−
1 other points in the data set have a higher value for Dl(p) than data point p wherein Dk(p) represents the distance of the point p to its kth nearest neighbor.- View Dependent Claims (17, 18, 19, 20)
the partitioning step includes calculating a minimum bounding rectangle MBR for each one of the plurality of data points in the data set; and
the computing step of the lower and upper bounds being computed for each one of the data points in the MBR.
-
-
19. The method according to claim 18, wherein:
-
the minimum distance between a point p and an MBR R is denoted by MINDIST (p, R) defined as MINDIST (p, R)=Σ
δ
i−
1x2i, wherein
and wherein every point in MBR R is at a distance of at least MINDIST (p, R) from point p;
the point p in δ
-dimensional space is denoted by [p1, p2, . . . , pδ
]; and
the MBR R is a δ
-dimensional rectangle R denoted by the two endpoints of its major diagonal;
r=[r1, r2, . . . , rδ
] and r′
=[r′
1, r′
2, . . . , r′
7] such that ri≦
r′
i for 1≦
i≦
n; and
the maximum distance between the point p and the MBR R is denoted by MAXDIST (p, R) defined as MAXDIST (p, R)=Σ
δ
i−
1x2i, wherein
and wherein no point in MBR R is at a distance that exceeds MAXDIST (p, R) from the point p.
-
-
20. The method according to claim 19, wherein:
-
the minimum distance between two MBRs R and S is denoted by MINDIST(R, S) defined as MINDIST (R, S)=Σ
δ
i−
1x2i, wherein
and wherein every point in MBR R is at a distance of at least MINDIST(R, S) from any point in MBR S;
a point p in δ
-dimensional space is denoted by [p1, p2, . . . , pδ
];
the MBR R is a δ
-dimensional rectangle R denoted by the two endpoints of its major diagonal;
r=[r1, r2, . . . , rδ
] and r′
−
[r′
1, r′
2, . . . , r′
δ
] such that ri≦
r′
i for 1≦
i≦
n; and
the MBR S is a δ
-dimensional rectangle S denoted by the two endpoints of its major diagonal;
s=[s1, s2, . . . , sδ
] and s′
=[s′
1, s′
2, . . . , s′
δ
]; and
the maximum distance between two MBRs R and S is denoted by MAXDIST(R, S) defined as MAXDIST (R, S)=Σ
δ
i−
1x2i, where xi=max {|s′
i−
ri|, |r′
1−
s1|}.
-
Specification