Methods and apparatus for outlier detection for high dimensional data sets

US 7,395,250 B1
Filed: 10/11/2000
Issued: 07/01/2008
Est. Priority Date: 10/11/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:

determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and

determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers;

wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus are provided for outlier detection in databases by determining sparse low dimensional projections. These sparse projections are used for the purpose of determining which points are outliers. The methodologies of the invention are very relevant in providing a novel definition of exceptions or outliers for the high dimensional domain of data.

Citations

27 Claims

1. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
- determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers;
  
  wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein a range is defined as a set of contiguous values on a given dimension.
  - 3. The method of claim 1, wherein the sparsity coefficient measure S(D) is defined as $n$
    - ( D ) - N * f k N * f k * ( 1 - f k ) , where k represents the number of dimensions in the data set, f represents the fraction of data points in each range, N is the total number of data points in the data set, and n(D) is the number of data points in a set of dimensions D.
  - 4. The method of claim 1, wherein a given sparsity coefficient measure is inversely proportional to the number of data points in a given set of dimensions and corresponding ranges.
  - 5. The method of claim 1, wherein a set of dimensions is determined using an algorithm which uses the processes of solution recombination, selection and mutation over a population of multiple solutions.
  - 6. The method of claim 5, wherein the process of solution recombination comprises combining characteristics of two solutions in order to create two new solutions.
  - 7. The method of claim 5, wherein the process of mutation comprises changing a particular characteristic of a solution in order to result in a new solution.
  - 8. The method of claim 5, wherein the process of selection comprises biasing the population in order to favor solutions which are more optimum.

9. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
- identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers;
  
  wherein the abnormally low presence is quantified by a sparsity coefficient measure.

10. Apparatus for optimizing data mining to detect one or more outliers within a high dimensional data set, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, comprising:
- a computer having a memory and a data storage device coupled thereto, wherein the data storage device stores the data set; and
  
  one or more computer programs, performed by the computer, for;
  
  (i) determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  (ii) determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as representing the one or more the one or more outliers;
  
  wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The apparatus of claim 10, wherein a range is defined as a set of contiguous values on a given dimension.
  - 12. The apparatus of claim 10, wherein the sparsity coefficient measure S(D) is defined as $n$
    - ( D ) - N * f k N * f k * ( 1 - f k ) , where k represents the number of dimensions in the data set, f represents the fraction of data points in each range, N is the total number of data points in the data set, and n(D) is the number of data points in a set of dimensions D.
  - 13. The apparatus of claim 10, wherein a given sparsity coefficient measure is inversely proportional to the number of data points in a given set of dimensions and corresponding ranges.
  - 14. The apparatus of claim 10, wherein a set of dimensions is determined using an algorithm which uses the processes of solution recombination, selection and mutation over a population of multiple solutions.
  - 15. The apparatus of claim 14, wherein the process of solution recombination comprises combining characteristics of two solutions in order to create two new solutions.
  - 16. The apparatus of claim 14, wherein the process of mutation comprises changing a particular characteristic of a solution in order to result in a new solution.
  - 17. The apparatus of claim 14, wherein the process of selection comprises biasing the population in order to favor solutions which are more optimum.

18. Apparatus for optimizing data mining to detect one or more outliers within a high dimensional data set, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, comprising:
- a computer having a memory and a data storage device coupled thereto, wherein the data storage device stores; and
  
  one or more computer programs, performed by the computer for;
  
  (i) identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  (ii) identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers;
  
  wherein the abnormally low presence is quantified by a sparsity coefficient measure.

19. An article of manufacture comprising a program storage medium readable by a computer and embodying one or more instructions executable by the computer to perform method steps for optimizing data mining, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
- determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers;
  
  wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
- - 20. The article of claim 19, wherein a range is defined as a set of contiguous values on a given dimension.
  - 21. The article of claim 19, wherein the sparsity coefficient measure S(D) is defined as $n$
    - ( D ) - N * f k N * f k * ( 1 - f k ) , where k represents the number of dimensions in the data set, f represents the fraction of data points in each range, N is the total number of data points in the data set, and n(D) is the number of data points in a set of dimensions D.
  - 22. The article of claim 19, wherein a given sparsity coefficient measure is inversely proportional to the number of data points in a given set of dimensions and corresponding ranges.
  - 23. The article of claim 19, wherein a set of dimensions is determined using an algorithm which uses the processes of solution recombination, selection and mutation over a population of multiple solutions.
  - 24. The article of claim 23, wherein the process of solution recombination comprises combining characteristics of two solutions in order to create two new solutions.
  - 25. The article of claim 23, wherein the process of mutation comprises changing a particular characteristic of a solution in order to result in a new solution.
  - 26. The article of claim 23, wherein the process of selection comprises biasing the population in order to favor solutions which are more optimum.

27. An article of manufacture comprising a program storage medium readable by a computer and embodying one or more instructions executable by the computer to perform method steps for optimizing data mining, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
- identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
  
  identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers;
  
  wherein the abnormally low presence is quantified by a sparsity coefficient measure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Trend Micro Inc.
Original Assignee
International Business Machines Corporation
Inventors
Aggarwal, Charu C., Yu, Philip Shi-lung
Primary Examiner(s)
STARKS, WILBERT L

Application Number

US09/686,115
Time in Patent Office

2,820 Days
Field of Search

706/20, 706/15, 706/45, 706/48, 707/100, 707/101, 707/6, 707/7, 702/194, 708/203, 356/326, 250/255, 731/16
US Class Current

706/20
CPC Class Codes

G06F 18/2433 Single-class perspective, e...

Methods and apparatus for outlier detection for high dimensional data sets

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for outlier detection for high dimensional data sets

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links