Method for refining the initial conditions for clustering with applications to small and large database clustering
First Claim
1. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
- a) obtaining a multiple number of data subsets comprising a plurality of records from the data set,b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points;
c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and
d) using the one chosen candidate clustering starting point as a starting point to perform clustering analysis on the data set to be evaluated.
2 Assignments
0 Petitions
Accused Products
Abstract
As an optimization problem, clustering data (unsupervised learning) is known to be a difficult problem. Most practical approaches use a heuristic, typically gradient-descent, algorithm to search for a solution in the huge space of possible solutions. Such methods are by definition sensitive to starting points. It has been well-known that clustering algorithms are extremely sensitive to initial conditions. Most methods for guessing an initial solution simply make random guesses. In this paper we present a method that takes an initial condition and efficiently produces a refined starting condition. The method is applicable to a wide class of clustering algorithms for discrete and continuous data. In this paper we demonstrate how this method is applied to the popular K-means clustering algorithm and show that refined initial starting points indeed lead to improved solutions. The technique can be used as an initializer for other clustering solutions. The method is based on an efficient technique for estimating the modes of a distribution and runs in time guaranteed to be less than overall clustering time for large data sets. The method is also scalable and hence can be efficiently used on huge databases to refine starting points for scalable clustering algorithms in data mining applications.
103 Citations
26 Claims
-
1. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
-
a) obtaining a multiple number of data subsets comprising a plurality of records from the data set, b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points; c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and d) using the one chosen candidate clustering starting point as a starting point to perform clustering analysis on the data set to be evaluated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 23, 24, 25, 26)
-
-
11. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing a database of records on a storage medium; b) a computer having an interface to the storage devices for reading data from the storage medium and bring the data into a rapid access memory for subsequent evaluation; and c) said computer comprising a processing unit for evaluating at least some of the data in the database and for clustering the data into multiple numbers of data clusters;
said processing unit programmed to retrieve multiple subsets of data from the database, find multiple candidate clustering starting points from the multiple data subsets retrieved from the database and choosing an optimum solution from the multiple number of candidate clustering starting points to begin subsequent clustering on data in the database.
-
-
12. In a computer database system, a method for use in choosing starting conditions in a data clustering procedure comprising the steps of:
-
a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters; b) choosing an initial centroid for each of the K data clusters; c) sampling a portion of the data in the database from a storage medium and performing a clustering on the data sampled from the database based on the K centroids to form K characterizations of the database; d) repeating the sampling and clustering steps until a plurality of clustering solutions have been determined; and e) choosing a best solution from said plurality of clustering solutions to use as a starting point in further clustering of data from the database.
-
-
13. Apparatus for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, the apparatus comprising:
-
a) means for obtaining a multiple number of data subsets comprising a plurality of records from the data set, b) means for performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points; c) means for choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and d) means for using the chosen starting point to perform clustering analysis on the data set to be evaluated. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A computer-readable medium having computer-executable instructions for performing steps for evaluating a database wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
-
a) obtaining a multiple number of data subsets comprising a plurality of records from the data set, b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points; c) choosing one of the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and d) using the chosen starting point to perform clustering analysis on the data set to be evaluated.
-
-
19. A method for evaluating data in a database that is stored on a storage medium, wherein the database has a data set to be evaluated, and wherein the data set is comprised of a plurality of records, comprising the steps of:
-
a) obtaining a multiple number of data subsets comprising a plurality of records from the data set, b) performing clustering analysis on the data records that make up each of the subsets to provide a multiple number of candidate clustering starting points; c) choosing a clustering starting point based on the multiple candidate clustering starting points to be used in clustering the data set to be evaluated; and d) using the chosen starting point to perform clustering analysis on the data set to be evaluated. - View Dependent Claims (20, 21, 22)
-
Specification