Method, apparatus and programmed medium for clustering databases with categorical attributes

US 6,049,797 A
Filed: 04/07/1998
Issued: 04/11/2000
Est. Priority Date: 04/07/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer based method of clustering related data stored in a computer database, said computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:

a) determining all neighbors for every data point within said computer database;

b) establishing a cluster for every data point in said computer database;

c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;

d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;

e) merging a pair of clusters having the best goodness measures into a merged cluster;

f) repeating steps c) through e) until a predetermined termination condition is met; and

g) storing clusters which remain after step f) in a computer readable medium.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a computer method, apparatus and programmed medium for clustering databases containing data with categorical attributes. The present invention assigns a pair of points to be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on non-metric information. The present invention determines a total number of links between each cluster and every other cluster bases upon the neighbors of the clusters. A goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and the total number of points within each cluster and every other cluster is then calculated. The present invention merges the two clusters with the best goodness measure. Thus, clustering is performed accurately and efficiently by merging data based on the amount of links between the data to be clustered.

Citations

57 Claims

1. A computer based method of clustering related data stored in a computer database, said computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:
- a) determining all neighbors for every data point within said computer database;
  
  b) establishing a cluster for every data point in said computer database;
  
  c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  e) merging a pair of clusters having the best goodness measures into a merged cluster;
  
  f) repeating steps c) through e) until a predetermined termination condition is met; and
  
  g) storing clusters which remain after step f) in a computer readable medium.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the predetermined termination condition is obtaining a desired number of clusters.
  - 3. The method of claim 1 wherein the predetermined termination condition is obtaining remaining clusters that do not have any links between said remaining clusters.
  - 4. The method of claim 1 wherein the step of determining all neighbors for every data point within said computer database is determined by calculating a similarity between said data points.
  - 5. The method of claim 4 wherein calculating the similarity between said data points comprises the steps of:
    - a1) calculating a similarity ratio; and
      
      a2)assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 6. The method of claim 5 wherein said similarity threshold is selected from the range of 0.5 to 0.8.
  - 7. The method of claim 5 wherein the calculation of said similarity ratio is performed by dividing an intersection of said data points by a union of said data points.

8. A computer based method of clustering related data in a large computer database, said large computer database stored on a computer readable medium and including a set of data points having categorical attributes, the method comprising the steps of:
- a) selecting a random set of data points from said large computer database;
  
  b) determining all neighbors for every data point within said random set of data points;
  
  c) establishing a cluster for every data point within said random set of data points;
  
  d) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  e) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  f) merging a pair of clusters having the best goodness measures into a merged cluster;
  
  g) repeating steps d) through f) until a predetermined termination condition is met; and
  
  h) storing clusters which remain after step g) in a computer readable medium.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 9. The method of claim 8 wherein said random set of data points has a size determined by Chernoff bounds.
  - 10. The method of claim 8 further comprising the step of:
    - i) assigning a cluster label to said data points not included in said random set of said data points.
  - 11. The method of claim 10 wherein the step of assigning a cluster label to said data points not included in said random set of data points comprises the steps of:
    - i1) selecting a set of labeling points for each of said remaining clusters; and
      
      i2) assigning a cluster label to said data points not included in said random set of data points in which said data points have a maximum amount of neighbors with the labeling points of one of said clusters.
  - 12. The method of claim 8 wherein the predetermined termination condition is obtaining a desired number of clusters.
  - 13. The method of claim 8 wherein the predetermined termination condition is obtaining remaining clusters that do riot have any links between said remaining clusters.
  - 14. The method of claim 8 wherein the step of determining all neighbors for every data point within said random set of data points is determined by calculating a similarity between said data points.
  - 15. The method of claim 14 wherein calculating the similarity between said data points comprises the steps of:
    - b1) calculating a similarity ratio; and
      
      b2) assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 16. The method of claim 15 wherein said similarity threshold is selected from the range of 0.5 to 0.8.
  - 17. The method of claim 15 wherein the calculation of said similarity ratio is performed by dividing an intersection of said data points by a union of said data points.
  - 18. The method of claim 8 further including the step of:
    - i) eliminating clusters comprised of outliers.
  - 19. The method of claim 18 wherein the step of eliminating clusters comprised of outliers is performed by deleting clusters having a number of links less than a predetermined threshold from said remaining clusters.
  - 20. The method of claim 19 wherein the predetermined threshold is 5 or less.
  - 21. The method of claim 18 further comprising the step of:
    - j) assigning a cluster label to said data points not included in said random set of said data points.
  - 22. The method of claim 21 wherein the step of assigning a cluster label to said data points not included in said random set of data points comprises the steps of:
    - j1) selecting a set of labeling points for each of said remaining clusters; and
      
      j2) assigning a cluster label to said data points not included in said random set of data points in which said data points have a maximum amount of neighbors with the labeling points of one of said clusters.

23. A computer readable storage medium containing a computer readable code for operating a computer to perform a clustering method on a computer database, said computer database including data points having categorical attributes, said clustering method comprises the steps of:
- a) determining all neighbors for every data point within said computer database;
  
  b) establishing a cluster for every data point in said computer database;
  
  c) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  d) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  e) merging a pair of clusters having the best goodness measures into a merged cluster;
  
  f) repeating steps c) through e) until a predetermined termination condition is met; and
  
  g) storing clusters which remain after step f) in a computer readable medium.
- View Dependent Claims (24, 25, 26, 27, 28, 29)
- - 24. The computer readable storage medium of claim 23 wherein the predetermined termination condition of said clustering method is obtaining a desired number of clusters.
  - 25. The computer readable storage medium of claim 23 wherein the predetermined termination condition of said clustering method is obtaining remaining clusters that do not have any links between said remaining clusters.
  - 26. The computer readable storage medium of claim 23 wherein said clustering method determines all neighbors of every data point within said database by calculating a similarity between said data points.
  - 27. The computer readable storage medium of claim 26 wherein said clustering method performs calculates the similarity between said data points by calculating a similarity ratio and assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 28. The computer readable storage medium of claim 27 wherein said similarity threshold is selected from the range of 0.5 to 0.8.
  - 29. The computer readable storage medium of claim 27 wherein said clustering method calculates said similarity ratio by dividing the intersection of said data points by a union of said data points.

30. A computer readable storage medium containing a computer readable code for operating a computer to perform a clustering method on a large database, said large database including a set of data points having categorical attributes, said clustering method comprises the steps of:
- a) selecting a random set of data points from said large computer database;
  
  b) determining all neighbors for every data point within said random set of data points;
  
  c) establishing a cluster for every data point within said random set of data points;
  
  d) determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  e) calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  f) merging a pair of clusters having the best goodness measures into a merged cluster;
  
  g) repeating steps d) through f) until a predetermined termination condition is met; and
  
  h) storing clusters which remain after step g) in a computer readable medium.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
- - 31. The computer readable storage medium of claim 30 wherein said random set of data points has a size determined by Chernoff bounds.
  - 32. The computer readable storage medium of claim 30 wherein said clustering method further comprises the step of:
    - i) assigning a cluster label to said data points not included in said random set of said data points.
  - 33. The computer readable storage medium of claim 32 wherein the step of assigning a cluster label to said data points not included in said random set of data points comprises the steps of:
    - i1) selecting a set of labeling points for each of said remaining clusters; and
      
      i2) assigning a cluster label to said data points not included in said random set of data points in which said data points have a maximum amount of neighbors with the labeling points of one of said clusters.
  - 34. The computer readable storage medium of claim 30 wherein the predetermined termination condition of said clustering method is obtaining a desired number of clusters.
  - 35. The computer readable storage medium of claim 30 wherein the predetermined termination condition of said clustering method is obtaining remaining clusters that do not have any links between said remaining clusters.
  - 36. The computer readable storage medium of claim 30 wherein said clustering method step of determining all neighbors for every data point within said database is determined by calculating a similarity between said data points.
  - 37. The computer readable storage medium of claim 36 wherein said clustering method calculates the similarity between said data points calculating a similarity ratio and assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 38. The computer readable storage medium of claim 37 wherein said similarity threshold is selected from the range of 0.5 to 0.8.
  - 39. The computer readable storage medium of claim 30 wherein said clustering method further includes the step of:
    - i) eliminating clusters comprised of outliers.
  - 40. The computer readable storage medium of claim 39 wherein the clustering method step of eliminating clusters comprised of outliers is performed by deleting clusters having a number of links less than a predetermined threshold from said remaining clusters.
  - 41. The computer readable storage medium of claim 40 wherein the predetermined threshold is 5 or less.
  - 42. The computer readable storage medium of claim 41 wherein said clustering method further comprises the step of:
    - j) assigning a cluster label to said data points not included in said random set of said data points.

43. A programmed computer database clustering system comprising:
- means for determining all neighbors for every data point within a computer database stored on a computer readable medium, said computer database including data points having categorical attributes;
  
  means for establishing a cluster for every data point in said computer database;
  
  means for determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  means for calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  means for merging a pair of clusters having the best goodness measures into a merged cluster;
  
  means for storing clusters which remain in a computer readable medium.
- View Dependent Claims (44, 45, 46, 47)
- - 44. The programmed computer database clustering system of claim 43 wherein said means for means for determining all neighbors for every data point within a computer database stored on a computer readable medium calculates a similarity between said data points.
  - 45. The programmed computer database clustering system of claim 43 wherein said means for determining all neighbors for every data point within a computer database stored on a computer readable medium comprises:
    - means for calculating a similarity ratio; and
      
      means for assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 46. The programmed computer database clustering system of claim 45 wherein said means for assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold uses a similarity threshold selected from the range of 0.5 to 0.8.
  - 47. The programmed computer database clustering system of claim 46 wherein said means for calculating a similarity ratio calculates a similarity ratio by dividing an intersection of said data points by a union of said data points.

48. A programmed computer database clustering system comprising:
- means for selecting a random set of data points from a large computer database stored on a computer readable medium, said large computer database including data points having categorical attributes;
  
  means for determining all neighbors for every data point within said large computer database;
  
  means for establishing a cluster for every data point in said computer database;
  
  means for determining a total number of links between each cluster and every other cluster based on a number of common neighbors between each cluster and every other cluster;
  
  means for calculating a goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and an estimated number of links between each cluster and every other cluster;
  
  means for merging a pair of clusters having the best goodness measures into a merged cluster;
  
  means for storing clusters which remain in a computer readable medium.
- View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 49. The programmed computer database clustering system of claim 48 further comprising:
    - means for assigning a cluster label to said data points not included in said random set of said data points.
  - 50. The programmed computer database clustering system of claim 49 wherein said means for assigning a cluster label comprises:
    - means for selecting a set of labeling points for each of said remaining clusters; and
      
      means for assigning a cluster label to said data points not included in said random set of data points in which said data points have a maximum amount of neighbors with the labeling points of one of said clusters.
  - 51. The programmed computer database clustering system of claim 49 wherein said means for means for determining all neighbors for every data point within a computer database stored on a computer readable medium calculates a similarity between said data points.
  - 52. The programmed computer database clustering system of claim 49 wherein said means for determining all neighbors for every data point within a computer database stored on a computer readable medium comprises:
    - means for calculating a similarity ratio; and
      
      means for assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold.
  - 53. The programmed computer database clustering system of claim 52 wherein said means for assigning said points to be neighbors if said similarity ratio exceeds a similarity threshold uses a similarity threshold selected from the range of 0.5 to 0.8.
  - 54. The programmed computer database clustering system of claim 53 wherein said means for calculating a similarity ratio calculates a similarity ratio by dividing an intersection of said data points by a union of said data points.
  - 55. The programmed computer database clustering system of claim 48 wherein said program further comprises:
    - means for eliminating clusters comprised of outliers.
  - 56. The programmed computer database clustering system of claim 55 wherein said means for eliminating clusters comprised of outliers deleting clusters having a number of links less than a predetermined threshold from said remaining clusters.
  - 57. The programmed computer database clustering system of claim 56 wherein the program further comprises:
    - means for assigning a cluster label to said data points not included in said random set of said data points.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Rastogi, Rajeev, Guha, Sudipto, Shim, Kyuseok
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Mizrahi, Diane D.

Application Number

US09/055,940
Time in Patent Office

735 Days
Field of Search

707/2, 707/3, 707/5, 707/6, 707/101, 707/104, 702/179, 704/256, 705/27, 706/12, 706/50, 709/224, 345/433
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 18/23   Clustering techniques

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99942   Manipulating data structure...

Y10S 707/99945   Object-oriented database st...

Method, apparatus and programmed medium for clustering databases with categorical attributes

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

57 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus and programmed medium for clustering databases with categorical attributes

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

57 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links