Method for determining a quality for a data clustering and data processing system

US 20030182082A1
Filed: 12/09/2002
Published: 09/25/2003
Est. Priority Date: 03/16/2002
Status: Active Grant

First Claim

Patent Images

1. A method for determining a quality for a data clustering, said data clustering resulting in a plurality of clusters each cluster having a cluster identifier, the method comprising the steps of:

determining a set of observed values for at least one of the clusters by mapping the cluster identifier of said one of the clusters to a first predefined value and by mapping the cluster identifiers of other clusters to a second predefined value, and calculating a normalized statistical coefficient based on the set of observed values to determine the quality for said one of the clusters.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This invention relates to a method for determining a quality for a data clustering, said data clustering resulting in a plurality of clusters each cluster having a cluster identifier, the method comprising the steps of:

determining a set of observed values for at least one of the clusters by mapping the cluster identifier of said one of the clusters to a first predefined value and by mapping the cluster identifiers of other clusters to a second predefined value, and

calculating a normalized statistical coefficient based on the set of observed values to determine the quality for said one of the clusters.

16 Citations

View as Search Results

14 Claims

1. A method for determining a quality for a data clustering, said data clustering resulting in a plurality of clusters each cluster having a cluster identifier, the method comprising the steps of:
- determining a set of observed values for at least one of the clusters by mapping the cluster identifier of said one of the clusters to a first predefined value and by mapping the cluster identifiers of other clusters to a second predefined value, and calculating a normalized statistical coefficient based on the set of observed values to determine the quality for said one of the clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 13, 14)
- - 2. The method of claim 1, whereby the normalized statistical coefficient being the R squared coefficient, which is calculated on the basis of the set of observed values for said one of the clusters.
  - 3. The method of claim 2, whereby the set of observed values is determined for each of the clusters and whereby the normalized statistical coefficient is calculated for each of the clusters based on the respective sets of observed values.
  - 4. The method of claim 3, further comprising calculating an overall quality for the data clustering on the basis of the normalized statistical coefficients of the clusters.
  - 5. The method of claim 4, whereby the overall quality is calculated by determining a weighted average of the normalized statistical coefficients of the clusters, whereby the weighting factor for each cluster is the number of records within that cluster.
  - 6. A method of data clustering comprising the steps of:
    - performing a first data clustering by means of a first data clustering method, determining the quality for the first data clustering by means of a method in accordance with any one of the preceding claims 1 to 5, selecting at least one cluster with a relatively low normalized statistical coefficient, performing a second data clustering by means of the first data clustering method or by means of a second data clustering method with respect to the selected cluster, and determining the quality of the second data clustering with respect to the selected cluster.
  - 7. The method of claim 6, whereby the steps of selecting of at least one of the clusters, applying the first or the second data clustering method and determining the quality with respect to the selected cluster are performed iteratively.
  - 13. A computer-readable storage medium tangibly embodying a program of computer instructions for performing a method in accordance with any one of the preceding claims 1 to 5.
  - 14. A computer-readable storage medium tangibly embodying a program of computer instructions for performing a method in accordance with claim 6.

8. A data processing system comprising:
- means (8) for storing a number of records, means (9, 10) for performing a data clustering of the records into a plurality of clusters each having a cluster identifier, means (11) for determining a set of observed values for each of the clusters by mapping the cluster identifier of a given cluster to a first predefined value and by mapping the cluster identifiers of other clusters to a second predefined value, and means (11) for calculating a normalized statistical coefficient based on the set of observed values.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The data processing system of claim 8, wherein the means for calculating the normalized statistical coefficient are adapted to calculate the R squared coefficient.
  - 10. The data processing system of claim 9, further comprising means (12) for calculating an overall quality on the basis of the normalized statistical coefficients of the clusters.
  - 11. The data processing system of claim 10, wherein the means for calculating the overall quality being adapted to calculate a weighted average of the normalized statistical coefficients of the clusters, the weighting factor for each cluster being the number of records within that cluster.
  - 12. The data processing system of any one of the preceding claims 8 to 11, wherein the means for data clustering being adapted to perform the data clustering in accordance with a first data clustering method and in accordance with a second data clustering method.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Keller, Martin, Raspl, Stefan

Granted Patent

US 6,829,561 B2
Time in Patent Office

Days
Field of Search
US Class Current

702/179
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 18/217   Validation; Performance eva...

G06F 18/23   Clustering techniques

Method for determining a quality for a data clustering and data processing system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

16 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method for determining a quality for a data clustering and data processing system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links