Massive clustering of discrete distributions
First Claim
1. A method of determining similarities in still or moving images utilizing a hierarchical structure including a base level and higher levels, the method comprising the steps of:
- a) performing an initial segmentation of data objects in the base level into a plurality of segments using a master processor, each object being representative of a different image or video, the data objects included in the base level of the hierarchy being original data objects and a data set of the base level;
b) performing a discrete distribution (D2) clustering on each of the segments on one or more slave processors to determine a group of clusters within the segment, each of the clusters corresponding to a local centroid;
c) combining the local centroids within all of the segments determined in step b) into one global data set of local centroids and performing an initial segmentation of this global data set of local centroids into a plurality of segments using the master processor, the global data set of local centroids representing a higher level of the hierarchy, a size of the data set of the higher level being smaller than a size of data set of the previous level;
d) iteratively repeating steps b) and c) at higher levels in the hierarchy until a single segmentation of the data objects is achieved, the number of centroids is reduced to a predefined number, or a predefined threshold based on distances of the data objects to the centroids is satisfied; and
e) outputting information regarding the way in which the data objects are clustered in terms of similarity.
2 Assignments
0 Petitions
Accused Products
Abstract
The trend of analyzing big data in artificial intelligence requires more scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed for clustering bags of weighted vectors which are well adopted in many emerging machine learning applications. The high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present a parallel D2-clustering algorithm with substantially improved scalability. We develop a hierarchical structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. The parallel algorithm achieves significant speed-up with minor accuracy loss.
-
Citations
17 Claims
-
1. A method of determining similarities in still or moving images utilizing a hierarchical structure including a base level and higher levels, the method comprising the steps of:
-
a) performing an initial segmentation of data objects in the base level into a plurality of segments using a master processor, each object being representative of a different image or video, the data objects included in the base level of the hierarchy being original data objects and a data set of the base level; b) performing a discrete distribution (D2) clustering on each of the segments on one or more slave processors to determine a group of clusters within the segment, each of the clusters corresponding to a local centroid; c) combining the local centroids within all of the segments determined in step b) into one global data set of local centroids and performing an initial segmentation of this global data set of local centroids into a plurality of segments using the master processor, the global data set of local centroids representing a higher level of the hierarchy, a size of the data set of the higher level being smaller than a size of data set of the previous level; d) iteratively repeating steps b) and c) at higher levels in the hierarchy until a single segmentation of the data objects is achieved, the number of centroids is reduced to a predefined number, or a predefined threshold based on distances of the data objects to the centroids is satisfied; and e) outputting information regarding the way in which the data objects are clustered in terms of similarity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method of determining similarities in a biological process or genetic sequence utilizing a hierarchical structure including a base level and higher levels, the method comprising the steps of:
-
a)) performing a segmentation of data objects in the base level into a plurality of segments using a master processor, each data object being representative of a different biological process or genetic sequence, the data objects included in the base level of the hierarchy being original data objects and a data set of the base level; b) performing a discrete distribution (D2) clustering on each of the segments on one or more slave processors to determine a group of dusters within each of the segments, each of the clusters corresponding to a local centroid; c) combining the local centroids within all of the segments determined in step b) into one global data set of local centroids and performing a segmentation of this global data set of local centroids into a plurality of segments using the master processor, the global data set of local centroids representing a higher level of the hierarchy, a size of the data set of the higher level being smaller than a size of the data set of the previous level; d) iteratively repeating steps b) and c) at higher levels in the hierarchy until a single segmentation of the data objects is achieved, the number of centroids is reduced to a predefined number, or a predefined threshold based on distances of the data objects to the centroids is satisfied; and e) outputting information regarding the way in which the data objects are clustered in terms of similarity.
-
Specification