High dimensional data mining and visualization via gaussianization
First Claim
1. A method for mining high dimensional data, comprising the steps of:
- linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data;
marginally Gaussianizing each of the coordinates, said Gaussianizing being characterized by univariate Gaussian means, priors, and variances;
iteratively repeating said transforming and Gaussianizing steps until the coordinates converge to a standard Gaussian distribution;
arranging the coordinates of all iterations hierarchically to facilitate data mining; and
mining the arranged coordinates.
2 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for providing high dimensional data. The high dimensional data is linearly transformed into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data. Each of the coordinates are marginally Gaussianized, the Gaussianization being characterized by univariate Gaussian means, priors, and variances. The transforming and Gaussianizing steps are iteratively repeated until the coordinates converge to a standard Gaussian distribution. The coordinates of all iterations are arranged hierarchically to facilitate data mining. The arranged coordinates are then mined. According to an embodiment of the invention, the transform step includes applying an iterative maximum likelihood expectation maximization (EM) method to the high dimensional data.
36 Citations
26 Claims
-
1. A method for mining high dimensional data, comprising the steps of:
-
linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data;
marginally Gaussianizing each of the coordinates, said Gaussianizing being characterized by univariate Gaussian means, priors, and variances;
iteratively repeating said transforming and Gaussianizing steps until the coordinates converge to a standard Gaussian distribution;
arranging the coordinates of all iterations hierarchically to facilitate data mining; and
mining the arranged coordinates. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
computing an auxiliary function Q of the EM method based upon the log likelihood of the high dimensional data;
updating the univariate Gaussian priors, to maximize the auxiliary function Q;
respectively updating the univariate Gaussian variances, the linear transform row by row, and the univariate Gaussian means, to maximize the auxiliary function Q;
repeating said second updating step, until the auxiliary function Q converges to a local maximum; and
repeating said computing step and said second updating step, until the log likelihood of the high dimensional data converges to a local maximum.
-
-
5. The method according to claim 4, wherein the linear transform is fixed, when the univariate Gaussian variances are updated.
-
6. The method according to claim 4, wherein the univariate Gaussian variances are fixed, when the linear transform is updated.
-
7. The method according to claim 4, wherein the linear transform is fixed, when the univariate Gaussian means are updated.
-
8. The method according to claim 1, wherein said arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
-
9. A method for visualizing high dimensional data, comprising the steps of:
-
linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data;
marginally Gaussianizing each of the coordinates, said Gaussianizing being characterized by univariate Gaussian means, priors, and variances;
iteratively repeating said transforming and Gaussianizing steps until the coordinates converge to a standard Gaussian distribution;
arranging the coordinates of all iterations hierarchically into high dimensional data sets to facilitate data visualization; and
visualizing the high dimensional data sets. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
the expectation step comprising the step of: computing an auxiliary function Q of the EM method, based upon the log likelihood of the high dimensional data;
the maximization step comprising the steps of;
updating the univariate Gaussian priors, to maximize the auxiliary function Q; and
respectively updating the univariate Gaussian variances, the linear transform row by row, and the univariate Gaussian means, to maximize the auxiliary function Q;
wherein said second updating step is repeated, until the auxiliary function Q converges to a local maximum, and wherein said computing step and said second updating step are repeated, until the log likelihood of the high dimensional data converges to a local maximum.
-
-
13. The method according to claim 12, wherein the linear transform is fixed, when the univariate Gaussian variances are updated.
-
14. The method according to claim 13, wherein the univariate Gaussian variances are fixed, when the linear transform is updated.
-
15. The method according to claim 13, wherein the linear transform is fixed, when the univariate Gaussian means are updated.
-
16. The method according to claim 9, wherein said arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
-
17. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for arranging high dimensional data for data mining, said method steps comprising:
-
linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data;
marginally Gaussianizing each of the coordinates, said Gaussianizing being characterized by univariate Gaussian means, priors, and variances;
iteratively repeating said transforming and Gaussianizing steps until the coordinates converge to a standard Gaussian distribution; and
arranging the coordinates of all iterations hierarchically to facilitate data mining. - View Dependent Claims (18, 19, 20, 21)
computing an auxiliary function Q of the EM method based upon the log likelihood of the high dimensional data;
updating the univariate Gaussian priors, to maximize the auxiliary function Q;
respectively updating the univariate Gaussian variances, the linear transform row by row, and the univariate Gaussian means, to maximize the auxiliary function Q;
repeating said second updating step, until the auxiliary function Q converges to a local maximum; and
repeating said computing step and said second updating step, until the log likelihood of the high dimensional data converges to a local maximum.
-
-
21. The program storage device according to claim 17, wherein said arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
-
22. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for arranging high dimensional data for visualization, said method steps comprising:
-
linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data;
marginally Gaussianizing each of the coordinates, said Gaussianizing being characterized by univariate Gaussian means, priors, and variances;
iteratively repeating said transforming and Gaussianizing steps until the coordinates converge to a standard Gaussian distribution; and
arranging the coordinates of all iterations hierarchically into high dimensional data sets to facilitate data visualization. - View Dependent Claims (23, 24, 25, 26)
the expectation step comprising the step of: computing an auxiliary function Q of the EM method, based upon the log likelihood of the high dimensional data;
the maximization step comprising the steps of;
updating the univariate Gaussian priors, to maximize the auxiliary function Q; and
respectively updating the univariate Gaussian variances, the linear transform row by row, and the univariate Gaussian means, to maximize the auxiliary function Q;
wherein said second updating step is repeated, until the auxiliary function Q converges to a local maximum, and wherein said computing step and said second updating step are repeated, until the log likelihood of the high dimensional data converges to a local maximum.
-
-
26. The program storage device according to claim 22, wherein said arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
Specification