Method, apparatus and programmed medium for clustering databases with categorical attributes
First Claim
1. A computer based method of clustering related data stored in a computer database, said computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:
- a) determining all neighbors for every data point within said computer database;
b) establishing a cluster for every data point in said computer database;
c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
e) merging a pair of clusters having the best goodness measures into a merged cluster;
f) repeating steps c) through e) until a predetermined termination condition is met; and
g) storing clusters which remain after step f) in a computer readable medium.
3 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to a computer method, apparatus and programmed medium for clustering databases containing data with categorical attributes. The present invention assigns a pair of points to be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on non-metric information. The present invention determines a total number of links between each cluster and every other cluster bases upon the neighbors of the clusters. A goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and the total number of points within each cluster and every other cluster is then calculated. The present invention merges the two clusters with the best goodness measure. Thus, clustering is performed accurately and efficiently by merging data based on the amount of links between the data to be clustered.
-
Citations
57 Claims
-
1. A computer based method of clustering related data stored in a computer database, said computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:
-
a) determining all neighbors for every data point within said computer database; b) establishing a cluster for every data point in said computer database; c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; e) merging a pair of clusters having the best goodness measures into a merged cluster; f) repeating steps c) through e) until a predetermined termination condition is met; and g) storing clusters which remain after step f) in a computer readable medium. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer based method of clustering related data in a large computer database, said large computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:
-
a) selecting a random set of data points from said large computer database; b) determining all neighbors for every data point within said random set of data points; c) establishing a cluster for every data point within said random set of data points; d) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; e) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; f) merging a pair of clusters having the best goodness measures into a merged cluster; g) repeating steps d) through f) until a predetermined termination condition is met; and h) storing clusters which remain after step g) in a computer readable medium. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer readable storage medium containing a computer readable code for operating a computer to perform a clustering method on a computer database, said computer database including data points having categorical attributes, said clustering method comprises the steps of:
-
a) determining all neighbors for every data point within said computer database; b) establishing a cluster for every data point in said computer database; c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; e) merging a pair of clusters having the best goodness measures into a merged cluster; f) repeating steps c) through e) until a predetermined termination condition is met; and g) storing clusters which remain after step f) in a computer readable medium. - View Dependent Claims (24, 25, 26, 27, 28, 29)
-
-
30. A computer readable storage medium containing a computer readable code for operating a computer to perform a clustering method on a large database, said large database including a set of data points having categorical attributes, said clustering method comprises the steps of:
-
a) selecting a random set of data points from said large computer database; b) determining all neighbors for every data point within said random set of data points; c) establishing a cluster for every data point within said random set of data points; d) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; e) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; f) merging a pair of clusters having the best goodness measures into a merged cluster; g) repeating steps d) through f) until a predetermined termination condition is met; and h) storing clusters which remain after step g) in a computer readable medium. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
-
-
43. A programmed computer database clustering system comprising:
-
means for determining all neighbors for every data point within a computer database stored on a computer readable medium, said computer database including data points having categorical attributes; means for establishing a cluster for every data point in said computer database; means for determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; means for calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; means for merging a pair of clusters having the best goodness measures into a merged cluster; means for storing clusters which remain in a computer readable medium. - View Dependent Claims (44, 45, 46, 47)
-
-
48. A programmed computer database clustering system comprising:
-
means for selecting a random set of data points from a large computer database stored on a computer readable medium, said large computer database including data points having categorical attributes; means for determining all neighbors for every data point within said large computer database; means for establishing a cluster for every data point in said computer database; means for determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster; means for calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster; means for merging a pair of clusters having the best goodness measures into a merged cluster; means for storing clusters which remain in a computer readable medium. - View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57)
-
Specification