METHOD FOR IDENTIFYING OUTLIERS IN LARGE DATA SETS

US 20030061249A1
Filed: 11/18/1999
Published: 03/27/2003
Est. Priority Date: 11/18/1999
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a predetermined number of data points of interest in a data set, comprising the steps of:

partitioning a plurality of data points in a data set into a plurality of partitions;

computing lower and upper bounds for each one of the plurality of partitions;

identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of a predetermined number of data points of interest, wherein the predetermined number of data points of interest are included within the plurality of data points in the data set; and

identifying the predetermined number of data points of interest from the plurality of candidate partitions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.

13 Citations

19 Claims

1. A method for identifying a predetermined number of data points of interest in a data set, comprising the steps of:
- partitioning a plurality of data points in a data set into a plurality of partitions;
  
  computing lower and upper bounds for each one of the plurality of partitions;
  
  identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of a predetermined number of data points of interest, wherein the predetermined number of data points of interest are included within the plurality of data points in the data set; and
  
  identifying the predetermined number of data points of interest from the plurality of candidate partitions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 19)
- - 2. The method according to claim 1, wherein the plurality of data points in the data set are partitioned using a clustering algorithm.
  - 3. The method according to claim 1, wherein given a data set having N data points and a predetermined number of data points of interest n which each have k neighboring data points, a data point p is one of the predetermined number of data points of interest n if no more than an n−
    - 1 other points in the data set reside at greater distances from the k neighboring data point than data point p.
  - 4. The method according to claim 1, wherein for each one of the plurality of partitions the lower and upper bounds are computed by calculating a distance of at least one neighboring data point from the plurality of data points in the partition, the lower bound being the smallest distance from the at least one neighboring data point to a first one of the plurality of data points in the partition and the upper bound being the largest distance from the at least one neighboring data point to a second one of the plurality of data points in the partition.
  - 5. The method according to claim 4, wherein for the predetermined number of data points of interest, a number of partitions having the largest lower bound values are selected such that the number of data points residing in such partitions is at least equal to the predetermined number of data points of interest, wherein the candidate partitions are comprised of those partitions having upper bound values that are greater than or equal to the smallest lower bound value of the number of partitions and the non-candidate partitions are comprised of those partitions having upper bound values that are less than the smallest lower bound value of the number of partitions, the non-candidate partitions being eliminated from consideration because they do not contain the at least one of the predetermined number of data points of interest.
  - 6. The method according to claim 1, wherein if the candidate partitions are smaller than a main memory, then all of the data points in the candidate partitions are stored in a main memory spatial index and the predetermined number of points of interest are identified using an index-based algorithm which probes the main memory spatial index.
  - 7. The method according to claim 1, wherein if the candidate partitions are larger than a main memory, then the partitions are processed in batches such that the overlap between each one of the partitions in a batch is as large as possible so that as many points as possible are processed in each batch.
  - 8. The method according to claim 7, wherein each one of the batches is comprised of a subset of the plurality of candidate partitions, the subset being smaller than the main memory.
  - 9. The method according to claim 7, wherein the predetermined number of data points of interest are selected from the batch processed candidate partitions, the predetermined number of data points of interest being those data points residing the farthest from their at least one neighboring data point.
  - 10. The method according to claim 9, wherein the algorithm computeOutliers uses an index-based algorithm.
  - 11. The method according to claim 9, wherein the algorithm computeOutliers uses an block nested-loop algorithm.
  - 12. The method according to claim 1, wherein an MBR is calculated for each one of the plurality of data points in the data set, lower and upper bounds being computed for each one of the data points in the MBR.
  - 13. The method according to claim 12, wherein the minimum distance between a point p and an MBR R is denoted by MINDIST (p, R) defined as MINDIST (p, R)=Σ
    - ^δ_i=1x²_i, wherein
  - 14. The method according to claim 12, wherein the maximum distance between a point p and an MBR R is denoted by MAXDIST (p, R) defined as MAXDIST (p, R)=Σ
    - ^δ_i=1x²_i, wherein
  - 15. The method according to claim 12, wherein the minimum distance between two MBRs R and S is denoted by MINDIST(R, S) defined as MINDIST (R, S)=Σ
    - ^δ_i=1x²_i, wherein
  - 16. The method according to claim 12, wherein the maximum distance between two MBRs R and S is denoted by MAXDIST(R, S) defined as MAXDIST (R, S)=Σ
    - ^δ_i=1x²_i, where x_i=max {|s′
      
      _i−
      
      r_i|,|r′
      
      _i−
      
      s_i|}.
  - 19. The method according to claim 3, wherein a data point having a larger value for D^k(p) resides in a more sparsely populated neighborhood of points and is thus more likely to be one of the predetermined number of points of interest n than a data point residing in a more densely populated neighborhood having a smaller value for D^k(p).

17. A method for computing the top n outliers in a data set, comprising the steps of:
- partitioning a plurality of data points in a data set into a plurality of partitions;
  
  computing lower and upper bounds for each one of the plurality of partitions;
  
  identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions may include at least one of n number of outliers of interest, wherein the outliers are included within the plurality of data points in the data set; and
  
  identifying the outliers from the plurality of candidate partitions.
- View Dependent Claims (18)
- - 18. The method according to claim 17, wherein given a data set having N data points and a predetermined number of data points of interest n which each have k neighbors, a data point p is one of the n outliers of interest if no more than n−
    - 1 other points in the data set have a higher value for D^k(p) than data point p.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
RASTOGI, RAJEEV, RAMASWAMY, SRIDHAR, SHIM, KYUSEOK

Granted Patent

US 6,643,629 B2
Time in Patent Office

Days
Field of Search
US Class Current

708/136
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 2216/03   Data mining

Y10S 706/925   Business

METHOD FOR IDENTIFYING OUTLIERS IN LARGE DATA SETS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

13 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR IDENTIFYING OUTLIERS IN LARGE DATA SETS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links