SYSTEM AND METHOD FOR ANALYZING RESULT OF CLUSTERING MASSIVE DATA
First Claim
1. A system for analyzing a result of clustering massive data, the system comprising:
- a task management apparatus configured to divide a clustered target file into blocks of a pre-designated size, and generate an input split corresponding to a task pair for a reduce task for reducing input data by combining the divided blocks;
at least one distance calculation apparatus configured to receive allocation of the input split, and calculate a distance sum for each record between blocks included in the input split;
at least one index coefficient calculation apparatus configured to calculate a clustering significance verification index coefficient for each record by using the distance sum for each record received from the at least one distance calculation apparatus; and
an analysis apparatus configured to calculate a final significance verification index coefficient of a corresponding cluster, by averaging the clustering significance verification index coefficient for each record.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are a system and a method for analyzing a result of clustering massive data. An open-source map/reduce framework named Hadoop is used to calculate a silhouette coefficient corresponding to a significance verification index capable of evaluating a result of clustering massive data. To implement the system and the method for analyzing a result of clustering massive data, clustered data is divided into blocks. For all of the blocks, input splits are generated. Then, the generated input splits are assigned to multiple computers. Each computer stores only data of blocks included in an input split assigned in a memory, and calculates a silhouette coefficient for each record. Each computer provides only the calculated silhouette coefficient to an index coefficient calculation apparatus, and enables the index coefficient calculation apparatus to calculate a silhouette coefficient for a cluster. Therefore, the result of clustering the massive data can be rapidly and objectively analyzed.
13 Citations
10 Claims
-
1. A system for analyzing a result of clustering massive data, the system comprising:
-
a task management apparatus configured to divide a clustered target file into blocks of a pre-designated size, and generate an input split corresponding to a task pair for a reduce task for reducing input data by combining the divided blocks; at least one distance calculation apparatus configured to receive allocation of the input split, and calculate a distance sum for each record between blocks included in the input split; at least one index coefficient calculation apparatus configured to calculate a clustering significance verification index coefficient for each record by using the distance sum for each record received from the at least one distance calculation apparatus; and an analysis apparatus configured to calculate a final significance verification index coefficient of a corresponding cluster, by averaging the clustering significance verification index coefficient for each record.
-
-
2. A task management apparatus for analyzing a result of clustering massive data, the task management apparatus comprising:
-
a block generator configured to divide a clustered target file registered in a Hadoop Distribute File System (HDFS) into designated-sized blocks; an input split generator configured to combine the divided blocks, and generate an input split corresponding to a task pair for a reduce task for reducing input data; and an input split assigner configured to assign the generated input split to at least one distance calculation apparatus recognized in an identical network. - View Dependent Claims (3, 4)
-
-
5. A distance calculation apparatus for analyzing a result of clustering massive data, the distance calculation apparatus comprising:
-
a data acquirer configured to receive allocation of an input split corresponding to a task pair for a reduce task for reducing input data, and read all records of blocks included in the input split from a Hadoop Distribute File System (HDFS); a memory unit configured to store all the acquired records of the blocks; a calculator configured to calculate a distance sum for each of record between the blocks, and store the calculated distance sum for each record in the memory unit; and a data output unit configured to output the distance sum for each record. - View Dependent Claims (6)
-
-
7. A method of analyzing a result of clustering massive data, the method comprising:
-
dividing a clustered target file into block of a pre-designated size; generating an input split corresponding to a task pair for a reduce task for reducing input data by combining the divided blocks; storing all records of block included in the input split into a memory, and outputting a distance sum for each of record; calculating a clustering significance verification index coefficient for each record by using the distance sum for each record; and defining a clustering significance verification index coefficient by averaging the clustering significance verification index coefficient for each record. - View Dependent Claims (8, 9, 10)
-
Specification