Methods and apparatus for outlier detection for high dimensional data sets
First Claim
1. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
- determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers;
wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus are provided for outlier detection in databases by determining sparse low dimensional projections. These sparse projections are used for the purpose of determining which points are outliers. The methodologies of the invention are very relevant in providing a novel definition of exceptions or outliers for the high dimensional domain of data.
-
Citations
27 Claims
-
1. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
-
determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers; wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of optimizing data mining in a computer, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
-
identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers; wherein the abnormally low presence is quantified by a sparsity coefficient measure.
-
-
10. Apparatus for optimizing data mining to detect one or more outliers within a high dimensional data set, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, comprising:
-
a computer having a memory and a data storage device coupled thereto, wherein the data storage device stores the data set; and one or more computer programs, performed by the computer, for;
(i) determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
(ii) determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as representing the one or more the one or more outliers;wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. Apparatus for optimizing data mining to detect one or more outliers within a high dimensional data set, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, comprising:
-
a computer having a memory and a data storage device coupled thereto, wherein the data storage device stores; and one or more computer programs, performed by the computer for;
(i) identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and
(ii) identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers;wherein the abnormally low presence is quantified by a sparsity coefficient measure.
-
-
19. An article of manufacture comprising a program storage medium readable by a computer and embodying one or more instructions executable by the computer to perform method steps for optimizing data mining, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
-
determining one or more subsets of dimensions and corresponding ranges in the data set which are sparse in density using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and determining one or more data points in the data set which contain these subsets of dimensions and corresponding ranges, the one or more data points being identified as the one or more outliers; wherein the sets of dimensions and corresponding ranges in which the data is sparse in density is quantified by a sparsity coefficient measure. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
-
-
27. An article of manufacture comprising a program storage medium readable by a computer and embodying one or more instructions executable by the computer to perform method steps for optimizing data mining, the data mining being performed by the computer to detect one or more outliers within a high dimensional data set stored on a data storage device coupled to the computer, the data set representing a population of persons and the one or more outliers representing one or more persons within the population of persons, the method comprising the steps of:
-
identifying and mining one or more sub-patterns in the data set which have abnormally low presence not due to randomness using an algorithm comprising at least one of the processes of solution recombination, selection and mutation over a population of multiple solutions; and identifying one or more records which have the one or more sub-patterns present in them as the one or more outliers; wherein the abnormally low presence is quantified by a sparsity coefficient measure.
-
Specification