Method for identifying outliers in large data sets

US 6,643,629 B2
Filed: 11/18/1999
Issued: 11/04/2003
Est. Priority Date: 11/18/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method embedded in a recordable media for identifying a predetermined number of outliers of interest in a data set, comprising the steps of:

partitioning a plurality of data points in the data set into a plurality of partitions;

computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain the predetermined number of outliers of interest;

identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly contains at least one of a predetermined number of outliers of interest, wherein the predetermined number of outliers of interest are included within the plurality of data points in the data set; and

identifying the predetermined number of outliers of interest from the plurality of candidate partitions wherein the data set has N data points and the predetermined number of outliers of interest is n which each have k neighboring data points, a data point p is one of the predetermined number of outliers of interest n if no more than n−

1 other points in the data set reside at greater distances from the k neighboring data point than data point p.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.

Citations

20 Claims

1. A computer implemented method embedded in a recordable media for identifying a predetermined number of outliers of interest in a data set, comprising the steps of:
- partitioning a plurality of data points in the data set into a plurality of partitions;
  
  computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain the predetermined number of outliers of interest;
  
  identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly contains at least one of a predetermined number of outliers of interest, wherein the predetermined number of outliers of interest are included within the plurality of data points in the data set; and
  
  identifying the predetermined number of outliers of interest from the plurality of candidate partitions wherein the data set has N data points and the predetermined number of outliers of interest is n which each have k neighboring data points, a data point p is one of the predetermined number of outliers of interest n if no more than n−
  
  1 other points in the data set reside at greater distances from the k neighboring data point than data point p.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method according to claim 1, wherein the partitioning step of the plurality of data points in the data set are partitioned using a clustering algorithm.
  - 3. The method according to claim 1, wherein the computing step includes:
4. The method according to claim 3, wherein:
- for the predetermined number of outliers of interest, a number of partitions having the largest lower bound values are selected such that the number of data points residing in such partitions is at least equal to the predetermined number of outliers of interest;
  
  the identifying of the plurality of candidate partitions includes identifying which of the candidate partitions are comprised of those partitions having upper bound values that are greater than or equal to the smallest lower bound value of the number of partitions; and
  
  the non-candidate partitions are comprised of those partitions having upper bound values that are less than the smallest lower bound value of the number of partitions, the non-candidate partitions being eliminated from consideration because they do not contain the at least one of the predetermined number of outliers of interest.
5. The method according to claim 1, wherein the candidate partitions are smaller than a main memory, and further comprising the step of:
- storing all of the data points in the candidate partitions in a main memory spatial index; and
  
  identifying the predetermined number of outliers of interest using an index-based algorithm which probes the main memory spatial index.
6. The method according to claim 1, wherein the candidate partitions are larger than a main memory, then the partitions are processed in batches such that the overlap between each one of the partitions in a batch is as large as possible so that as many points as possible are processed in each batch.
7. The method according to claim 6, wherein each one of the batches is comprised of a subset of the plurality of candidate partitions, the subset being smaller than the main memory.
8. The method according to claim 6, wherein the predetermined number of outliers of interest are selected from the batch processed candidate partitions, the predetermined number of outliers of interest being those data points residing the farthest from their at least one neighboring data point.
9. The method according to claim 8, wherein the step of identifying the predetermined number of outliers of interest uses an index-based algorithm.
10. The method according to claim 8, wherein the step of identifying the predetermined number of outliers of interest uses a block nested-loop algorithm.
11. The method according to claim 1, wherein:
- the partitioning step includes calculating a minimum bounding rectangle MBR for each one of the plurality of data points in the data set; and
  
  the computing step of the lower and upper bounds being computed for each one of the data points in the MBR.
12. The method according to claim 11, wherein the minimum distance between a point p and an MBR R is denoted by MINDIST (p, R) defined as MINDIST (p, R)=Σ
- ^δ_i=Ix²_i, wherein $x_{i} = {\begin{matrix} r_{i} - p_{i} & if p_{i} < r_{i} \\ p_{i} - r_{i} & if r_{i}^{'} < p_{i} \\ 0 & otherwise, \end{matrix}$ and wherein every point in MBR R is at a distance of at least MINDIST (p, R) from point p;
  
  the point p in δ
  
  -dimensional space is denoted by [p₁, p₂, . . . , p_δ]; and
  
  the MBR R is a δ
  
  -dimensional rectangle R denoted by the two endpoints of its major diagonal;
  
  r=[r₁, r₂, . . . , r_δ] and r′
  
  =[r′
  
  ₁, r′
  
  ₂, . . . , r′
  
  _δ] such that r_i≦
  
  r′
  
  _ifor 1≦
  
  i≦
  
  n.
13. The method according to claim 11, wherein the maximum distance between a point p and an MBR R is denoted by MAXDIST (p, R) defined as MAXDIST (p, R)=Σ
- ^δ_i=1x²_i, wherein $x_{i} = {\begin{matrix} r_{i}^{'} - p_{i} & if p_{i} < r_{i} + r_{i}^{'} / 2 \\ p_{i} - r_{i} & otherwise, \end{matrix}$ and wherein no point in MBR R is at a distance that exceeds MAXDIST (p, R) from the point p;
  
  the point p in δ
  
  -dimensional space is denoted by [p₁, p₂, . . . , p_δ]; and
  
  the MBR R is a δ
  
  -dimensional rectangle R denoted by the two endpoints of its major diagonal;
  
  r=[r₁, r₂, . . . , r_δ] and r′
  
  =[r′
  
  ₁, r′
  
  ₂, . . . , r′
  
  _δ] such that r_i≦
  
  r′
  
  _ifor 1≦
  
  i≦
  
  n.
14. The method according to claim 11, wherein the minimum distance between two MBRs R and S is denoted by MINDIST(R, S) defined as MINDIST (R, S)=Σ
- ^δ_i=1x²_i, wherein $x_{i} = {\begin{matrix} r_{i} - s_{I}^{'} & if s_{i}^{'} < r_{i} \\ p_{i} - r_{i}^{'} & if r_{i}^{'} < p_{i} \\ 0 & otherwise, \end{matrix}$ and wherein every point in MBR R is at a distance of at least MINDIST(R, S) from any point in MBR S;
  
  a point p in δ
  
  -dimensional space is denoted by [p₁, p₂, . . . , p_δ];
  
  the MBR R is a δ
  
  -dimensional rectangle R denoted by the two endpoints of its major diagonal;
  
  r=[r₁, r₂, . . . , r_δ] and r′
  
  =[r′
  
  ₁, r′
  
  ₂, . . . , r′
  
  _δ] such that r_i≦
  
  r′
  
  _ifor 1≦
  
  i≦
  
  n; and
  
  the MBR S is a δ
  
  -dimensional rectangle S denoted by the two endpoints of its major diagonal;
  
  s=[s₁, s₂, . . . , s_δ] and s′
  
  =[s′
  
  ₁, s′
  
  ₂, . . . , s′
  
  _δ].
15. The method according to claim 14, wherein the maximum distance between two MBRs R and S is denoted by MAXDIST(R, S) defined as MAXDIST (R, S)=Σ
- ^δ_i−
  
  1x²_i, where x_i=max {|s′
  
  _i−
  
  r_i|, |r′
  
  _i−
  
  s_i|}.

16. A computer implemented method embedded in a recordable media for determining the top n outliers in a data set, comprising the steps of:
- partitioning a plurality of data points in the data set into a plurality of partitions;
  
  computing lower and upper bounds for each one of the plurality of partitions to identify those partitions that cannot possibly contain a predetermined number of outliers of interest;
  
  identifying a plurality of candidate partitions from the plurality of partitions, wherein each one of the plurality of candidate partitions possibly includes at least one of n number of outliers of interest, wherein the outliers are included within the plurality of data points in the data set; and
  
  identifying the outliers from the plurality of candidate partitions wherein given the data set has N data points and the n number of outliers of interest which each have k neighbors, a data point p is one of the n outliers of interest if no more than n−
  
  1 other points in the data set have a higher value for D^l(p) than data point p wherein D^k(p) represents the distance of the point p to its k^thnearest neighbor.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The method according to claim 16, wherein a data point having a larger value for D^k(p) resides in a more sparsely populated neighborhood of points and is thus more likely to be one of the n outliers of interest n than a data point residing in a more densely populated neighborhood having a smaller value for D^k(p).
  - 18. The method according to claim 16, wherein:
19. The method according to claim 18, wherein:
- the minimum distance between a point p and an MBR R is denoted by MINDIST (p, R) defined as MINDIST (p, R)=Σ
  
  ^δ_i−
  
  1x²_i, wherein $x_{i} = {\begin{matrix} r_{i} - p_{i} & if p_{i} < r_{i} \\ p_{i} - r_{i} & if r_{i}^{'} < p_{i} \\ 0 & otherwise, \end{matrix}$
  
  and wherein every point in MBR R is at a distance of at least MINDIST (p, R) from point p;
  
  the point p in δ
  
  -dimensional space is denoted by [p₁, p₂, . . . , p_δ]; and
  
  the MBR R is a δ
  
  -dimensional rectangle R denoted by the two endpoints of its major diagonal;
  
  r=[r₁, r₂, . . . , r_δ] and r′
  
  =[r′
  
  ₁, r′
  
  ₂, . . . , r′
  
  ₇] such that r_i≦
  
  r′
  
  _ifor 1≦
  
  i≦
  
  n; and
  
  the maximum distance between the point p and the MBR R is denoted by MAXDIST (p, R) defined as MAXDIST (p, R)=Σ
  
  ^δ_i−
  
  1x²_i, wherein $x_{i} = {\begin{matrix} r_{i}^{'} - p_{i} & if p_{i} < r_{i} + r_{i}^{'} / 2 \\ p_{i} - r_{i} & otherwise, \end{matrix}$
  
  and wherein no point in MBR R is at a distance that exceeds MAXDIST (p, R) from the point p.
20. The method according to claim 19, wherein:
- the minimum distance between two MBRs R and S is denoted by MINDIST(R, S) defined as MINDIST (R, S)=Σ
  
  ^δ_i−
  
  1x²_i, wherein $x_{i} = {\begin{matrix} r_{i} - s_{I}^{'} & if s_{i}^{'} < r_{i} \\ p_{i} - r_{i}^{'} & if r_{i}^{'} < p_{i} \\ 0 & otherwise, \end{matrix}$
  
  and wherein every point in MBR R is at a distance of at least MINDIST(R, S) from any point in MBR S;
  
  a point p in δ
  
  -dimensional space is denoted by [p₁, p₂, . . . , p_δ];
  
  the MBR R is a δ
  
  -dimensional rectangle R denoted by the two endpoints of its major diagonal;
  
  r=[r₁, r₂, . . . , r_δ] and r′
  
  −
  
  [r′
  
  ₁, r′
  
  ₂, . . . , r′
  
  _δ] such that r_i≦
  
  r′
  
  _ifor 1≦
  
  i≦
  
  n; and
  
  the MBR S is a δ
  
  -dimensional rectangle S denoted by the two endpoints of its major diagonal;
  
  s=[s₁, s₂, . . . , s_δ] and s′
  
  =[s′
  
  ₁, s′
  
  ₂, . . . , s′
  
  _δ]; and
  
  the maximum distance between two MBRs R and S is denoted by MAXDIST(R, S) defined as MAXDIST (R, S)=Σ
  
  ^δ_i−
  
  1x²_i, where x_i=max {|s′
  
  _i−
  
  r_i|, |r′
  
  ₁−
  
  s₁|}.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Ramaswamy, Sridhar, Rastogi, Rajeev, Shim, Kyuseok
Primary Examiner(s)
Voeltz, Emanuel Todd
Assistant Examiner(s)
Hirl, Joseph P.

Application Number

US09/442,912
Publication Number

US 20030061249A1
Time in Patent Office

1,447 Days
Field of Search

706/45, 706/12, 706/925, 707/100
US Class Current

706/45
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 2216/03   Data mining

Y10S 706/925   Business

Method for identifying outliers in large data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method for identifying outliers in large data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links