MASSIVE CLUSTERING OF DISCRETE DISTRIBUTIONS
First Claim
1. A method of clustering data points, comprising the steps of:
- a) performing an initial segmentation of data points;
b) performing a series of discrete distribution (D2) clustering operations to determine a set of local centroids within each segment;
c) combining the local centroids determined in step b) into one data set and performing a segmentation of this data set;
d) iteratively repeating steps b) and c) at higher levels in a hierarchy, if necessary, until a single segmentation of the data points is achieved, the number of centroids is reduced to an acceptable level, or another stopping criterion is satisfied.
2 Assignments
0 Petitions
Accused Products
Abstract
The trend of analyzing big data in artificial intelligence requires more scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed for clustering bags of weighted vectors which are well adopted in many emerging machine learning applications. The high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present a parallel D2-clustering algorithm with substantially improved scalability. We develop a hierarchical structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. The parallel algorithm achieves significant speed-up with minor accuracy loss.
-
Citations
23 Claims
-
1. A method of clustering data points, comprising the steps of:
-
a) performing an initial segmentation of data points; b) performing a series of discrete distribution (D2) clustering operations to determine a set of local centroids within each segment; c) combining the local centroids determined in step b) into one data set and performing a segmentation of this data set; d) iteratively repeating steps b) and c) at higher levels in a hierarchy, if necessary, until a single segmentation of the data points is achieved, the number of centroids is reduced to an acceptable level, or another stopping criterion is satisfied. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method of clustering data points, comprising the steps of:
-
a) performing an initial clustering of data points using a constrained discrete distribution (D2) clustering algorithm to divide the data into segments; b) performing a series of constrained D2 clustering operations on one or more of the segments to divide the data into additional segments, if necessary; and c) iteratively repeating step b) at the next level of a hierarchy, if necessary, until the size of each segment is reduced to an acceptable level, the number of segments increased to an acceptable level, or another stopping criterion is satisfied. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
-
Specification