MASSIVE CLUSTERING OF DISCRETE DISTRIBUTIONS

US 20140143251A1
Filed: 11/15/2013
Published: 05/22/2014
Est. Priority Date: 11/19/2012
Status: Active Grant

First Claim

Patent Images

1. A method of clustering data points, comprising the steps of:

a) performing an initial segmentation of data points;

b) performing a series of discrete distribution (D2) clustering operations to determine a set of local centroids within each segment;

c) combining the local centroids determined in step b) into one data set and performing a segmentation of this data set;

d) iteratively repeating steps b) and c) at higher levels in a hierarchy, if necessary, until a single segmentation of the data points is achieved, the number of centroids is reduced to an acceptable level, or another stopping criterion is satisfied.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The trend of analyzing big data in artificial intelligence requires more scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed for clustering bags of weighted vectors which are well adopted in many emerging machine learning applications. The high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present a parallel D2-clustering algorithm with substantially improved scalability. We develop a hierarchical structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. The parallel algorithm achieves significant speed-up with minor accuracy loss.

Citations

23 Claims

1. A method of clustering data points, comprising the steps of:
- a) performing an initial segmentation of data points;
  
  b) performing a series of discrete distribution (D2) clustering operations to determine a set of local centroids within each segment;
  
  c) combining the local centroids determined in step b) into one data set and performing a segmentation of this data set;
  
  d) iteratively repeating steps b) and c) at higher levels in a hierarchy, if necessary, until a single segmentation of the data points is achieved, the number of centroids is reduced to an acceptable level, or another stopping criterion is satisfied.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the segmentation of the data points is based upon adjacency.
  - 3. The method of claim 1, wherein the D2 clustering operations are performed by parallel processors.
  - 4. The method of claim 1, wherein the D2 clustering operations are performed by a single processor in sequence.
  - 5. The method of claim 1, including the steps of:
    - assigning different cluster labels to the data points within each segment; and
      
      optimizing the cluster labels of each point to determine the local centroids.
  - 6. The method of claim 1, including the step of imposing one or more constraints on the centroids passed to each level in the hierarchy.
  - 7. The method of claim 1, wherein the cluster centroids passed to successively higher levels are weighted to maintain equal contributions from each original data point.
  - 8. The method of claim 1, wherein a master processor performs the initial data segmentation step and distributes the data segments to different parallel slave processors to perform the D2-clustering at each level in the hierarchy.
  - 9. The method of claim 8, including the use of use of synchronized Message Passing Interface (MPI) and MapReduce techniques to perform message passing and data transmission between the master and slave processors.
  - 10. The method of claim 1, wherein the objects to be clustered are mathematically represented by discrete distributions or bags of weighted vectors.
  - 11. The method of claim 1, wherein the data points are associated with images or video.
  - 12. The method of claim 1, wherein the data points are associated with a biological process or genetic sequence.
  - 13. The method of claim 1, wherein the series of discrete distribution (D2) clustering operations are performed by physically separate parallel processors separate cores of an integrated device.

14. A method of clustering data points, comprising the steps of:
- a) performing an initial clustering of data points using a constrained discrete distribution (D2) clustering algorithm to divide the data into segments;
  
  b) performing a series of constrained D2 clustering operations on one or more of the segments to divide the data into additional segments, if necessary; and
  
  c) iteratively repeating step b) at the next level of a hierarchy, if necessary, until the size of each segment is reduced to an acceptable level, the number of segments increased to an acceptable level, or another stopping criterion is satisfied.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 15. The method of claim 14, wherein the constrained D2 clustering operations are performed by parallel processors.
  - 16. The method of claim 14, wherein the constrained D2 clustering operations are performed by a single processor in sequence.
  - 17. The method of claim 14, wherein the constraint includes a fixed weighting scheme on the support vectors of the discrete distributions that are the cluster centroids.
  - 18. The method of claim 14, wherein a master processor performs the initial clustering and distributes the data segments to different parallel slave processors to perform the constrained D2 clustering at each level in the hierarchy.
  - 19. The method of claim 18, including the use of use of synchronized messaging.
  - 20. The method of claim 14, wherein the objects to be clustered are mathematically represented by discrete distributions or bags of weighted vectors.
  - 21. The method of claim 14, wherein the data points are associated with images or video.
  - 22. The method of claim 14, wherein the data points are associated with a biological process or genetic sequence.
  - 23. The method of claim 14, wherein the clustering operations are performed by physically separate parallel processors or separate cores of an integrated device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Penn State Research Foundation (Pennsylvania State University)
Original Assignee
Penn State Research Foundation (Pennsylvania State University)
Inventors
Wang, James Z., Li, Jia, Zhang, Yu

Granted Patent

US 9,720,998 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/285 Clustering or classification

MASSIVE CLUSTERING OF DISCRETE DISTRIBUTIONS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

MASSIVE CLUSTERING OF DISCRETE DISTRIBUTIONS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links