SQL-based analytic algorithm for cluster analysis
First Claim
Patent Images
1. A computer-implemented system for performing data mining applications, comprising:
- (a) a computer system having one or more data storage devices connected thereto;
(b) a relational database management system, executed by the computer system for managing a relational database stored on the data storage devices; and
(c) an analytic algorithm for cluster analysis performed by the computer system, wherein the analytic algorithm for cluster analysis includes SQL statements and programmatic iteration for finding one or more groupings in data retrieved from the relational database management system and for identifying homogenous ones of the groupings as clusters, and the analytic algorithm for cluster analysis creates at least one analytic model within an analytic logical data model from data residing in the relational database.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus, and article of manufacture for performing data mining applications in a relational database management system. An analytic algorithm for cluster analysis is performed by the computer. The analytic algorithm for cluster analysis includes SQL statements and programmatic iteration for finding one or more groupings in the data retrieved from the relational database management system and for identifying homogenous ones of the groupings as clusters. The analytic algorithm for cluster analysis creates at least one analytic model within an analytic logical data model from data residing in the relational database.
-
Citations
69 Claims
-
1. A computer-implemented system for performing data mining applications, comprising:
-
(a) a computer system having one or more data storage devices connected thereto;
(b) a relational database management system, executed by the computer system for managing a relational database stored on the data storage devices; and
(c) an analytic algorithm for cluster analysis performed by the computer system, wherein the analytic algorithm for cluster analysis includes SQL statements and programmatic iteration for finding one or more groupings in data retrieved from the relational database management system and for identifying homogenous ones of the groupings as clusters, and the analytic algorithm for cluster analysis creates at least one analytic model within an analytic logical data model from data residing in the relational database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
a list of attributes to be analyzed for clusters, a type of duster analysis, a number of dusters to be searched for within the data, an iteration threshold, or a maximum number of iterations.
-
-
3. The system of claim 1, the analytic algorithm for cluster analysis generates output comprising cluster means, variances, or prior probabilities.
-
4. The system of claim 3, wherein the prior probabilities comprise a measure of success of cluster identification as an average of all within-cluster variances.
-
5. The system of claim 1, wherein the analytic algorithm for cluster analysis randomly associates each row of a specified table to one or more clusters, and performs a specified number of iterations on the clusters, where each iteration performs an expectation step, a maximization step, and evaluation step.
-
6. The system of claim 5, wherein the expectation step comprises:
calculating means, variances and frequencies for the rows assigned to each cluster, and constructing a covariance inverse matrix using the calculated variances.
-
7. The system of claim 6, wherein the constructing step assumes the covariances arc zero.
-
8. The system of claim 6, wherein the covariances are based on a Standardized Euclidean distance.
-
9. The system of claim 8, wherein the Standardized Euclidean distance improves the cluster analysis'"'"' performance, since the number of calculations required are proportional to a number of columns rather than to a square of the number of columns.
-
10. The system of claim 8, wherein the Standardized Euclidean distance comprises a Mahalanobis Distance (MD).
-
11. The system of claim 10, wherein the constructing step comprises:
calculating each row'"'"'s distance to each cluster using the Mahalanobis Distance.
-
12. The system of claim 11, wherein the calculating step uses means and variances form the expectation step.
-
13. The system of claim 11, wherein the constructing step comprises:
under a K-Means model, re-assigning rows to clusters by associating each row to a closest cluster centroid using a lowest Mahalanobis Distance.
-
14. The system of claim 11, wherein the constructing step comprises:
under a Gaussian Mixture mode, re-assigning rows to clusters with probabilistic weighting after units of distance have been transformed to units of standard deviation of a standard normal distribution by a Gaussian distance function.
-
15. The system of claim 6, wherein the constructing step comprises:
displaying intermediate results from the calculating and constructing steps, and passing the intermediate results onto a next iteration.
-
16. Thc system of claim 15, wherein the intermediate results comprise cluster means, variances and average within-cluster variances.
-
17. The system of claim 6, further comprising, after a specified number of iterations have been performed, displaying final results.
-
18. The system of claim 5, wherein the evaluation step comprises identifying any resulting claims.
-
19. The system of claim 18, wherein the identifying step depends on observations of convergence.
-
20. The system of claim 18, wherein the identifying step depends on a pattern of declining average within-cluster variances.
-
21. The system of claim 18, wherein the identifying step depends on an accurate production of cluster centroids.
-
22. The system of claim 1, wherein the computer system is a massively parallel processing (MPP) computer system, and the analytic algorithm for cluster analysis is performed concurrently in parallel by the computer system.
-
23. The computer-implemented system of claim 1, wherein the analytic algorithm for cluster analysis is implemented as a combination of SQL statements performed by the relational database management system and programmatic iteration performed by an application program.
-
24. A method for performing data mining applications, comprising:
-
(a) managing a relational database stored on one or more data storage devices connected to a computer system; and
(b) performing an analytic algorithm for cluster analysis in the computer system, wherein the analytic algorithm for cluster analysis includes SQL statements and programmatic iteration for finding one or more groupings in data retrieved from the relational database management system and for identifying homogenous ones of the groupings as clusters, and the analytic algorithm for cluster analysis creates at least one analytic model within an analytic logical data model from data residing in the relational database. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46)
a list of attributes to be analyzed for clusters, a type of cluster analysis, a number of clusters to be searched for within the data, an iteration threshold, or a maximum number of iterations.
-
-
26. The method of claim 24, the analytic algorithm for cluster analysis generates output comprising cluster means, variances, or prior probabilities.
-
27. Tec method of claim 26, wherein the prior probabilities comprise a measure of success of cluster identification as an average of all within-cluster variances.
-
28. The method of claim 24, wherein the analytic algorithm for cluster analysis comprises:
-
randomly associating each row of a specified table to one or more clusters, and performing a specified number of iterations on the clusters, where each iteration performs an expectation stop, a maximization step, and an evaluation step.
-
-
29. The method of claim 28, wherein the expectation step comprises:
calculating means, variances and frequencies for the rows assigned to each cluster, and constructing a covariance inverse matrix using the calculated variances.
-
30. The method of claim 29, wherein the constructing step assumes the covariances are zero.
-
31. The method of claim 29, wherein the covariances are based on a Standardized Euclidean distance.
-
32. The method of claim 31, wherein the Standardized Euclidean distance improves the cluster analysis'"'"' performance, since the number of calculations required are proportional to a number of columns rather than to a square of the number of columns.
-
33. The method of claim 31, wherein the Standardized Euclidean distance comprises a Mahalanobis Distance (MD).
-
34. The method of claim 33, wherein the constructing step comprises:
calculating each row'"'"'s distance to each cluster using the Mahalanobis Distance.
-
35. The method of claim 34, wherein the calculating step uses means and variances from the expectation step.
-
36. The method of claim 34, wherein the constructing step comprises:
under a K-Means model, re-assigning rows to clusters by associating each tow to a closest cluster centroid using a lowest Mahalanobis Distance.
-
37. The method of claim 34, wherein the constructing step comprises:
under a Gaussian Mixture model, re-assigning rows to clusters with a probabilistic weighting after units of distance have been transformed to units of standard deviation of a standard normal distribution by a Gaussian distance function.
-
38. The method of claim 29, wherein the constructing step comprises:
displaying intermediate results from the calculating and constructing steps, and passing the intermediate results onto a next iteration.
-
39. The method of claim 38, wherein the intermediate results comprise cluster means, variances and average within-cluster variances.
-
40. The method of claim 29, further comprising, after a specified number of iterations have been performed, displaying final results.
-
41. The method of claim 28, wherein the evaluation step comprises identifying any resulting clusters.
-
42. The method of claim 41, wherein the identifying step depends on observations of convergence.
-
43. The method of claim 41, wherein the identifying step depends on a pattern of declining average within-cluster variances.
-
45. The method of claim 24, wherein the computer system is a massively parallel processing (MPP) computer system, and the analytic algorithm for cluster analysis is performed concurrently in parallel by the computer system.
-
46. The method of claim 24, wherein the analytic algorithm for cluster analysis is implemented as a combination of SQL statements performed by the relational database management system and programmatic iteration performed by an application program.
-
44. The method of clam 41, wherein the identifying step depends on an accurate production of cluster centroids.
-
47. An article of manufacture comprising logic embodying a method for performing data mining applications, the logic comprising
(a) managing a relational database stored on one or more data storage devices connected to a computer; - and
(b) performing an analytic algorithm for cluster analysis in the computer, wherein the analytic algorithm for cluster analysis includes SQL statements and programmatic iteration for finding one or more groupings in data retrieved from tire relational database management system and for identifying homogenous ones of the groupings as clusters, and the analytic algorithm for cluster analysis creates at least one analytic model within an analytic logical data model from data residing in the relational database. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69)
a list of attributes to be analyzed for clusters, a type of cluster analysis, a number of clusters to be shed for within the data, an iteration threshold, or a maximum number of iterations.
- and
-
49. The article of manufacture of claim 47, the analytic algorithm for cluster analysis generates output comprising cluster means, variances, or prior probabilities.
-
50. The article of manufacture of claim 49, wherein the prior probabilities comprise a measure of success of cluster identification as an average of all within-cluster variances.
-
51. The article of manufacture of claim 47, wherein the analytic algorithm for cluster analysis comprises:
-
randomly associating each row of a specified table to one or more clusters, and performing a specified number of iterations on the clusters, where each iteration performs an expectation step, a maximization step, and an evaluation step.
-
-
52. The article of manufacture of claim 51, wherein the expectation step comprises:
-
calculating means, variances and frequencies for the rows assigned to each cluster, and constructing a covariance inverse matrix using the calculated variances.
-
-
53. The article of manufacture of claim 52, wherein the constructing step assumes the covariances are zero.
-
54. The article of manufacture of claim 52, wherein the covariances are based on a Standardized Euclidean distance.
-
55. The article of manufacture of claim 54, wherein the Standardized Euclidean distance improves the cluster analysis'"'"' performance, since the number of calculations required are proportional to a number of columns rather than to a square of the number of columns.
-
56. The article of manufacture of claim 54, wherein the Standardized Euclidean distance comprises a Mahalanobis Distance (MD).
-
57. The article of manufacture of claim 56, wherein the constructing step comprises:
calculating each row'"'"'s distance to each cluster using the Mahalanobis Distance.
-
58. The article of manufacture of claim 57, wherein the calculating step uses means and variances from the expectation step.
-
59. The article of manufacture of claim 57, wherein the constructing step comprises:
under a K-Means model, re-assigning rows to clusters by associating each row to a closest cluster centroid using a lowest Mahalanobis Distance.
-
60. The article of manufacture of claim 57, wherein the constructing step comprises:
under a Gaussian Mixture model, re-assigning rows to clusters with a probabilistic weighting after units of distance have been transformed to units of standard deviation of a standard normal distribution by a Gaussian distance fiction.
-
61. The article of manufacture of claim 52, wherein the constructing step comprises:
-
displaying intermediate results from the calculating and constructing steps, and passing the intermediate results onto a next iteration.
-
-
62. The method of claim 61, wherein the intermediate results comprise cluster means, variances and average within-cluster variances.
-
63. The article of manufacture of claim 52, further comprising, after a specified number of iterations have been performed, displaying final results.
-
64. The article of manufacture of claim 51, wherein the evaluation step comprises identifying any resulting clusters.
-
65. The method of claim 64, wherein the identifying step depends on observations of convergence.
-
66. The method of claim 64, wherein the identifying step depends on a pattern of declining average within-cluster variances.
-
67. The method of claim 64, wherein the identifying step depends on an accurate production of cluster centroids.
-
68. The article of manufacture of claim 47, wherein the computer system is a massively parallel processing (MPP) computer system, and the analytic algorithm for cluster analysis is performed concurrently in parallel by the computer system.
-
69. The article of manufacture of claim 47, wherein the analytic algorithm for cluster analysis is implemented as a combination of SQL statements performed by the relational database management system and programmatic iteration performed by an application program.
Specification