Method and system for clustering data in parallel in a distributed-memory multiprocessor system

US 6,269,376 B1
Filed: 10/26/1998
Issued: 07/31/2001
Est. Priority Date: 10/26/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of clustering a set of data points into k clusters, comprising:

(a) dividing the set of data points into P data blocks of substantially equal size, each data block assigned to one of P processors;

(b) selecting k initial global centroids with a first processor and broadcasting the k initial global centroids from the first processor to the remaining P−

1 processors;

(c) computing the distance from each data point in each data block to the global centroid values by using the processor associated with the data block;

(d) assigning each data point in each data block to a global centroid value closest to the data point by using the processor associated with the data block;

(e) computing k block accumulation values in each block from the data points assigned thereto; and

(f) recomputing the k global centroid values from the k block accumulation values computed for each data block.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus, article of manufacture, and a memory structure for clustering data points in parallel using a distributed-memory multi-processor system is disclosed. The disclosed system has particularly advantageous application to a rapid and flexible k-means computation for data mining. The method comprises the steps of dividing a set of data points into a plurality of data blocks, initializing a set of k global centroid values in each of the data blocks k initial global centroid values, performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in each data block to the closest global centroid value within each data block, computing a set of k block accumulation values from the data points assigned to the k global centroid values, and recomputing the k global centroid values from the k block accumulation values.

86 Citations

View as Search Results

42 Claims

1. A method of clustering a set of data points into k clusters, comprising:
- (a) dividing the set of data points into P data blocks of substantially equal size, each data block assigned to one of P processors;
  
  (b) selecting k initial global centroids with a first processor and broadcasting the k initial global centroids from the first processor to the remaining P−
  
  1 processors;
  
  (c) computing the distance from each data point in each data block to the global centroid values by using the processor associated with the data block;
  
  (d) assigning each data point in each data block to a global centroid value closest to the data point by using the processor associated with the data block;
  
  (e) computing k block accumulation values in each block from the data points assigned thereto; and
  
  (f) recomputing the k global centroid values from the k block accumulation values computed for each data block.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the number of data points in each block is substantially proportional to the processing speed of the processor assigned to the data block.
  - 3. The method of claim 1, wherein (c)-(e) are performed by the P processors asynchronously.
  - 4. The method of claim 1, wherein the recomputing the k global centroid values in each block from the k block accumulation values from each data block comprises:
5. The method of claim 1, further comprising repeating (c) through (f) until a convergence condition is satisfied.

6. A method of clustering a set of data points into k clusters, comprising:
- (a) dividing the set of data points into a plurality of data blocks;
  
  (b) initializing a set of k global centroid values in each of the plurality of data blocks to k initial global centroid values;
  
  (c) performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in a data block to a global centroid value closest to the data point;
  
  (d) computing a set of k block accumulation values in each block from the data points assigned to the k global centroid values; and
  
  (e) recomputing the k global centroid values from the k block accumulation values from the plurality of data blocks.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 7. The method of claim 6, wherein the initial k global centroid values are chosen arbitrarily.
  - 8. The method of claim 6 wherein (d) is performed by the plurality of asynchronous processes.
  - 9. The method of claim 6, further comprising repeating (c) through (e) until a convergence condition is satisfied.
  - 10. The method of claim 9, wherein the convergence condition includes a mean squared error between the global centroid values and the data points, and the repeating of steps (b) through (e) until a convergence condition is satisfied comprises:
11. The method of claim 6, wherein the block accumulation values are computed as a sum of the data points assigned to the global centroid values.
12. The method of claim 11, wherein the global centroid values are recomputed from the block accumulation values.
13. The method of claim 6, wherein the asynchronous processes are performed on a plurality of processors.
14. The method of claim 6, wherein the set of data points is divided into P data blocks, each data block associated with one of P processors, and the assigning each data point in each block to the closest global centroid value and computing the block accumulation values from the data points assigned to the global centroid values are performed for each data block by the processor associated with the data block.
15. The method of claim 14, wherein the recomputing the k global centroid values from the k block accumulation values comprises:
- broadcasting the block accumulation values across a communication network interconnecting the processors; and
  
  computing the k global centroid values from the broadcasted k block accumulation values.
16. The method of claim 6, wherein the recomputing the global centroid values from the block accumulation values is performed by the plurality of asynchronous processes.
17. The method of claim 6, wherein the data blocks each comprise substantially the same number of data points.

18. An apparatus for clustering a set of data points into k clusters, comprising:
- (a) means for dividing the set of data points into a plurality of data blocks;
  
  (b) means for initializing a set of k global centroid values in each of the plurality of data blocks to k initial global centroid values;
  
  (c) a plurality of processors for performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in a data block to a global centroid value closest to the data point, and computing a set of k block accumulation values from the data points assigned to the global centroid values; and
  
  (e) means for recomputing the k global centroid values from the k block accumulation values from the plurality of data blocks.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 19. The apparatus of claim 18, further comprising means for arbitrarily defining the k global centroid values.
  - 20. The apparatus of claim 18, further comprising means for determining when a convergence condition is satisfied.
  - 21. The apparatus of claim 18, wherein the convergence condition includes a mean squared error between the global centroid values and the data points, and the means for determining when a convergence condition is satisfied comprises:
22. The apparatus of claim 18, wherein the means for computing the block accumulation values comprises means for summing the data points assigned to the global centroid values.
23. The method of claim 22, wherein the means for computing the global centroid values comprises means for averaging the block accumulation values.
24. The apparatus of claim 18, wherein the means for performing a plurality of asynchronous processes on the data blocks comprises a plurality of asynchronously operating processors.
25. The apparatus of claim 18, wherein means for dividing the set of data points into a plurality of data blocks comprises a means for dividing the set of data points into P data blocks, each data block associated with one of P processors, and the means for assigning each data point in each block to the closest global centroid value and recomputing the block accumulation values from the data points assigned to the global centroid values comprises the processor associated with the data block.
26. The apparatus of claim 25, wherein the means for recomputing the k global centroid values from the k block accumulation values comprises:
- means for broadcasting the block accumulation values across a communication network; and
  
  means for computing the k global centroid values from the broadcasted k block accumulation values.
27. The apparatus of claim 18, wherein the means for recomputing the global centroid values from the block accumulation values comprises a plurality of synchronously operating processors.
28. The apparatus of claim 18, wherein the data blocks each comprise substantially the same number of data points.

29. An apparatus for detecting relationships in a set of data points divided into a set of data blocks, comprising a plurality of asynchronous processors, each associated with one of the plurality of data blocks and operating on the data points within the associated data blocks, each processor implementing a plurality of procedures comprising:
- a first procedure for initializing a set of k global centroid values to k initial global centroid values;
  
  a second procedure for assigning each data point in each data block to the closest global centroid value;
  
  a third procedure for computing a set of k block accumulation values from the data points assigned to the k global centroid values; and
  
  a fourth procedure for recomputing the global centroid values from the k block accumulation values from the plurality of data blocks.
- View Dependent Claims (30, 31)
- - 30. The apparatus of claim 29, wherein each of the plurality of processors further implements a procedure for computing a mean squared error between the global centroid values and the data points.
  - 31. The program storage medium of claim 30, wherein the data blocks each comprise substantially the same number of data points.

32. A program storage medium, readable by a computer, embodying one or more instructions executable by the computer to perform a method for clustering a set of data points around into k clusters, the method comprising:
- (a) dividing the set of data points into a plurality of data blocks;
  
  (b) initializing a set of k global centroid values in each of the plurality of data blocks to k initial global centroid values;
  
  (c) performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in a data block to a global centroid value closest to the data point;
  
  (d) computing a set of k block accumulation values in each block from the data points assigned to the global centroid values; and
  
  (e) recomputing the k global centroid values from the k block accumulation values from the k data blocks.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41)
- - 33. The program storage medium of claim 32, wherein the method further comprises arbitrarily defining the k initial global centroid values.
  - 34. The program storage medium of claim 33, wherein the method further comprises repeating (b) through (e) until a local convergence condition is satisfied.
  - 35. The program storage medium of claim 34, wherein the convergence condition includes a mean squared error between the global centroid values and the data points, and the repeating of (b) through (e) until a convergence condition is satisfied comprises:
36. The program storage medium of claim 32, wherein the block accumulation values are computed as a sum of the data points assigned to the global centroid values.
37. The program storage medium of claim 36, wherein the global centroid values are recomputed from the average of the block accumulation values.
38. The program storage medium of claim 32, wherein the asynchronous processes are performed on a plurality of processors.
39. The program storage medium of claim 38, wherein the set of data points is divided into P data blocks, each data block associated with one of P processors, and the assigning each data point in each block to the closest global centroid value and recomputing the global centroid values from the data points assigned to the global centroid values are performed for each data block by the processor associated with the data block.
40. The program storage medium of claim 39, wherein the recomputing the k global centroid values from the k block accumulation values comprises:
- broadcasting the block accumulation values across a communication network; and
  
  computing the k global centroid values from the broadcasted k block accumulation values.
41. The program storage medium of claim 32, wherein the recomputing the global centroid values from the block accumulation values is performed synchronously by a plurality of processors.

42. A memory for storing data for clustering a set of data points around k global centroid values, comprising:
- a plurality of local memories, each directly accessible by one of a plurality of processors intercoupled by a communication network; and
  
  a data structure stored in each local memory, the data structure including a data block comprising a unique subset of the set of data points, block accumulation values and global centroid values.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dhillon, Inderjit Singh, Modha, Dharmendra Shantilal
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
JUNG, DAVID YIUK

Application Number

US09/179,027
Time in Patent Office

1,009 Days
Field of Search

707/3-6, 707/200-206, 707/101-103, 704/9, 704/2-8, 704/241, 725/116
US Class Current

707/613
CPC Class Codes

G06F 17/18   for evaluating statistical ...

G06F 18/23213   with fixed number of cluste...

G06V 10/955   using specific electronic p...

Y10S 707/922   Communications

Y10S 707/959   Network

Y10S 707/99936   Pattern matching access

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Method and system for clustering data in parallel in a distributed-memory multiprocessor system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

86 Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for clustering data in parallel in a distributed-memory multiprocessor system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

86 Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links