Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer

US 6,684,177 B2
Filed: 05/10/2001
Issued: 01/27/2004
Est. Priority Date: 05/10/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for clustering a set S of n data points to find k final centers, comprising:

partitioning said set S into P disjoint pieces S₁, . . . ,S_P;

for each said piece S_i, determining a set D_iof k intermediate centers;

assigning each data point in each piece S_ito the nearest one of said k intermediate centers;

weighting each of said k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center; and

clustering said weighted intermediate centers together to find said k final centers, said clustering performed using a specific error metric and a clustering method A; and

wherein if P is not sufficiently large enough such that each piece S_iobeys the constraint |S_i|<

M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece, then iteratively performing partitioning, determining, assigning, and weighting until the sets D′

of weighted intermediate centers generated thereby obeys the constraint |D′

|<

M.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique that uses a weighted divide and conquer approach for clustering a set S of n data points to find k final centers. The technique comprises 1) partitioning the set S into P disjoint pieces S₁, . . . , S_P; 2) for each piece S_i, determining a set D_iof k intermediate centers; 3) assigning each data point in each piece S_ito the nearest one of the k intermediate centers; 4) weighting each of the k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center; and 5) clustering the weighted intermediate centers together to find said k final centers, the clustering performed using a specific error metric and a clustering method A.

27 Citations

View as Search Results

22 Claims

1. A method for clustering a set S of n data points to find k final centers, comprising:
- partitioning said set S into P disjoint pieces S₁, . . . ,S_P;
  
  for each said piece S_i, determining a set D_iof k intermediate centers;
  
  assigning each data point in each piece S_ito the nearest one of said k intermediate centers;
  
  weighting each of said k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center; and
  
  clustering said weighted intermediate centers together to find said k final centers, said clustering performed using a specific error metric and a clustering method A; and
  
  wherein if P is not sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
  
  M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece, then iteratively performing partitioning, determining, assigning, and weighting until the sets D′
  
  of weighted intermediate centers generated thereby obeys the constraint |D′
  
  |<
  
  M.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method according to claim 1 further comprising:
3. The method according to claim 1 wherein P is sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
- M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece.
4. The method according to claim 1 wherein said clustering is performed upon iteratively obtained weighted intermediate clusters.
5. The method according to claim 1 wherein said set S is replaced by weighted intermediate centers of the previous iteration when iteratively performing said partitioning, determining, assigning, and weighting.
6. The method according to claim 1 wherein said determining is performed using said specific error metric and said clustering method A.
7. The method according to claim 1 wherein said specific error metric is the minimizing of the sum of the squares of the distances between points and their nearest centers.

8. A method for clustering a set S of n data points to find k final centers, comprising:
- partitioning said set S into P disjoint pieces S₁, . . . ,S_P;
  
  for each said piece S_i, determining a set D_iof k intermediate centers;
  
  assigning each data point in each piece S_ito the nearest one of said k intermediate centers;
  
  weighting each of said k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center;
  
  merging said weighted centers into a single dataset D′
  
  ; and
  
  clustering said weighted intermediate centers together to find said k final centers, said clustering performed using a specific error metric and a clustering method A.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 9. The method according to claim 8 wherein P is sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
    - M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece.
  - 10. The method according to claim 8 wherein if P is not sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
    - M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece, then iteratively performing partitioning, determining, assigning, and weighting until the sets D′
      
      of weighted intermediate centers generated thereby obeys the constraint |D′
      
      |<
      
      M.
  - 11. The method according to claim 10 wherein said clustering is performed upon iteratively obtained weighted intermediate clusters.
  - 12. The method according to claim 10 wherein said set S is replaced by weighted intermediate centers of the previous iteration when iteratively performing said partitioning, determining, assigning, and weighting.
  - 13. The method according to claim 8 wherein said determining is performed using said specific error metric and said clustering method A.
  - 14. The method according to claim 8 wherein said specific error metric is the minimizing of the sum of the squares of the distances between points and their nearest centers.
  - 15. The method according to claim 14 wherein the distance is the Euclidean distance.
  - 16. A method according to claim 8 further comprising:
17. The method according to claim 8 wherein said partitioning, determining, assigning and weighting is performed in parallel for each piece S_i.

18. An article comprising a computer readable medium having instructions stored thereon which when executed causes clustering a set S of n data points to find k final centers, said clustering implemented by:
- partitioning said set S into P disjoint pieces S₁, . . . ,S_P;
  
  for each said piece S_i, determining a set D_iof k intermediate centers;
  
  assigning each data point in each piece S_ito the nearest one of said k intermediate centers;
  
  weighting each of said k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center;
  
  merging said weighted centers into a single dataset D′
  
  ; and
  
  clustering said weighted intermediate centers together to find said k final centers, said clustering performed using a specific error metric and a clustering method A.
- View Dependent Claims (19, 20)
- - 19. The article according to claim 18 wherein P is sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
    - M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece.
  - 20. The article according to claim 18 wherein if P is not sufficiently large enough such that each piece S_iobeys the constraint |S_i|<
    - M, where M is the size of a physical memory or a portion thereof to be used in processing said each piece, then iteratively performing partitioning, determining, assigning, and weighting until the sets D′
      
      of weighted intermediate centers generated thereby obeys the constraint |D′
      
      |<
      
      M.

21. An apparatus for clustering a set S of n data points to find k final centers, said apparatus comprising:
- a main memory;
  
  at least one processor coupled to said memory, wherein at least one processor is configured to partition said set S into P disjoint pieces S₁, . . . ,S_Psuch that each piece S_ifits in main memory, said each piece S_ifirst stored separately in said main memory and then clustered by said processor performing;
  
  for each said piece S_i, determining a set D_iof k intermediate centers;
  
  assigning each data point in each piece S_ito the nearest one of said k intermediate centers;
  
  weighting each of said k intermediate centers in each set D_iby the number of points in the corresponding piece S_iassigned to that center;
  
  merging said weighted centers into a single dataset D′
  
  ; and
  
  clustering said weighted intermediate centers together to find said k final centers, said clustering performed using a specific error metric and a clustering method A.
- View Dependent Claims (22)
- - 22. The apparatus of claim 21, wherein there are a plurality of processors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Guha, Sudipto, Mishra, Nina, Motwani, Rajeev, OʼCallaghan, Liadan
Primary Examiner(s)
Hoff, Marc S.
Assistant Examiner(s)
Barbee, Manuel L.

Application Number

US09/854,212
Publication Number

US 20020183966A1
Time in Patent Office

992 Days
Field of Search

702/80, 702/97, 702/158, 702/179-181, 702/187, 707/6, 707/7
US Class Current

702/179
CPC Class Codes

G06F 18/23   Clustering techniques

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

27 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links