Method for refining the initial conditions for clustering with applications to small and large database clustering

US 6,115,708 A
Filed: 03/04/1998
Issued: 09/05/2000
Est. Priority Date: 03/04/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:

a) obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;

c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and

d) using the one chosen candidate clustering starting point as a starting point to perform clustering analysis on the data set to be evaluated.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

As an optimization problem, clustering data (unsupervised learning) is known to be a difficult problem. Most practical approaches use a heuristic, typically gradient-descent, algorithm to search for a solution in the huge space of possible solutions. Such methods are by definition sensitive to starting points. It has been well-known that clustering algorithms are extremely sensitive to initial conditions. Most methods for guessing an initial solution simply make random guesses. In this paper we present a method that takes an initial condition and efficiently produces a refined starting condition. The method is applicable to a wide class of clustering algorithms for discrete and continuous data. In this paper we demonstrate how this method is applied to the popular K-means clustering algorithm and show that refined initial starting points indeed lead to improved solutions. The technique can be used as an initializer for other clustering solutions. The method is based on an efficient technique for estimating the modes of a distribution and runs in time guaranteed to be less than overall clustering time for large data sets. The method is also scalable and hence can be efficiently used on huge databases to refine starting points for scalable clustering algorithms in data mining applications.

103 Citations

View as Search Results

26 Claims

1. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
- a) obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;
  
  c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and
  
  d) using the one chosen candidate clustering starting point as a starting point to perform clustering analysis on the data set to be evaluated.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 23, 24, 25, 26)
- - 2. The method of claim 1 wherein the step of performing clustering analysis on each of the subsets is performed from an initial starting point which is refined to determine said candidate clustering starting point.
  - 3. The method of claim 1 wherein the step of choosing one of the candidate clustering starting points comprises the step of performing additional data clustering on the multiple number of candidate clustering starting points.
  - 4. The method of claim 1 wherein the step of choosing one of the candidate clustering starting points comprises the step of performing additional data clustering on the multiple number of candidate clustering starting points and wherein one of said multiple number of candidate clustering starting points is chosen as a refined clustering starting point to perform the data clustering.
  - 5. The method of claim 2 wherein the initial starting point is randomly determined.
  - 6. The method of claim 2 wherein the step of performing clustering analysis on each of the subsets is performed by removing sparsely populated clusters, merging the sparsely populated clusters with clusters that are more densely populated and determining a new initial starting point for subsequent clustering of data.
  - 7. The method of claim 2 wherein if at the end of the clustering of each of the subsets, any of the clusters have zero membership, then the corresponding initial choice for this empty cluster centroid is adjusted to a solution set to a data point in the sample farthest from the initial choice that resulted in the empty cluster.
  - 8. The method of claim 2 wherein if at the end of the clustering of each of the subsets any of the clusters have zero membership a new cluster mean for the empty cluster is chosen from the mean of the entire data sample.
  - 9. The method of claim 2 wherein if at the end of the clustering the data records that make up each of the subsets, any of the clusters have zero membership a new cluster mean for the empty cluster is chosen by picking the mean of the entire data set and perturbing that mean by a small random amount corresponding to a variance in the data mean in each dimension of the data sample.
  - 10. The method of claim 1 wherein the step of performing a clustering analysis on each of the subsets uses a different clustering process than the clustering process that is used to perform clustering analysis on the data set to be evaluated.
  - 23. The method of claim 1 wherein the step of choosing one of the multiple candidate clustering starting points is performed by:
    - a) choosing a candidate clustering starting point;
      
      b) using the chosen candidate clustering starting point to perform an interim clustering analysis on an interim data set;
      
      c) determining the quality of the cluster solution resulting from the interim clustering analysis;
      
      d) repeating steps a) through c) until all candidate clustering starting points have been used in interim clustering analysis; and
      
      e) selecting the candidate starting point which yielded the best quality cluster solution as a refined starting point for performing clustering analysis on the data set to be evaluated.
  - 24. The method of claim 23 wherein the interim data set is a plurality of unchosen candidate starting points.
  - 25. The method of claim 23 wherein the interim data set is a plurality of data records that make up the data subsets.
  - 26. The method of claim 23 wherein the step of determining the quality of the interim clustering analysis if performed by calculating a degree of fit of the chosen candidate starting point with the interim data set.

11. In a computer data mining system, apparatus for evaluating data in a database comprising:
- a) one or more data storage devices for storing a database of records on a storage medium;
  
  b) a computer having an interface to the storage devices for reading data from the storage medium and bring the data into a rapid access memory for subsequent evaluation; and
  
  c) said computer comprising a processing unit for evaluating at least some of the data in the database and for clustering the data into multiple numbers of data clusters;
  
  said processing unit programmed to retrieve multiple subsets of data from the database, find multiple candidate clustering starting points from the multiple data subsets retrieved from the database and choosing an optimum solution from the multiple number of candidate clustering starting points to begin subsequent clustering on data in the database.

12. In a computer database system, a method for use in choosing starting conditions in a data clustering procedure comprising the steps of:
- a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters;
  
  b) choosing an initial centroid for each of the K data clusters;
  
  c) sampling a portion of the data in the database from a storage medium and performing a clustering on the data sampled from the database based on the K centroids to form K characterizations of the database;
  
  d) repeating the sampling and clustering steps until a plurality of clustering solutions have been determined; and
  
  e) choosing a best solution from said plurality of clustering solutions to use as a starting point in further clustering of data from the database.

13. Apparatus for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, the apparatus comprising:
- a) means for obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) means for performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;
  
  c) means for choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and
  
  d) means for using the chosen starting point to perform clustering analysis on the data set to be evaluated.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The apparatus of claim 13 additionally comprising means to randomly choose an initial starting point for use in clustering the multiple data subsets.
  - 15. The apparatus of claim 13 additionally comprising means to choose an initial starting point for use in performing clustering analysis on the multiple data subsets based on data contained in the database.
  - 16. The apparatus of claim 13 wherein the means for choosing comprises means for clustering the multiple number of candidate clustering starting points using each starting point as an interim starting point for clustering said starting points.
  - 17. The apparatus of claim 16 wherein the means for choosing comprises means for determining a best refined candidate clustering starting point from an intermediate clustering of the data sample solutions based upon a distance from the data points of said multiple solutions to a set of intermediate clustering solutions.

18. A computer-readable medium having computer-executable instructions for performing steps for evaluating a database wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
- a) obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;
  
  c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and
  
  d) using the chosen starting point to perform clustering analysis on the data set to be evaluated.

19. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
- a) obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;
  
  c) choosing a clustering starting point based on the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and
  
  d) using the chosen starting point to perform clustering analysis on the data set to be evaluated.
- View Dependent Claims (20, 21, 22)
- - 20. The method of claim 19 wherein each of the multiple number of candidate clustering starting points is used as an intermediate starting point for clustering the candidate starting points and wherein a resulting solution of said clustering of candidate starting points is used as a refined clustering starting point.
  - 21. The method of claim 20 wherein the refined clustering starting point is chosen by determining a distance from multiple intermediate clustering solutions and a set of data points made up of the candidate clustering starting points.
  - 22. The method of claim 21 wherein the refined clustering starting point is an optimum solution to the clustering of the multiple candidate clustering starting points wherein optimum is based on said distance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Fayyad, Usama, Bradley, Paul S.
Primary Examiner(s)
Breene, John E.
Assistant Examiner(s)
ROBINSON, GRETA LEE

Application Number

US09/034,834
Time in Patent Office

916 Days
Field of Search

707/1, 707/2, 707/5, 707/6, 707/104, 707/3, 707/4
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 2216/03   Data mining

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method for refining the initial conditions for clustering with applications to small and large database clustering

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

103 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Method for refining the initial conditions for clustering with applications to small and large database clustering

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

103 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others