Method and system for partitioning data into subsets of related data

US 6,421,668 B1
Filed: 08/05/1999
Issued: 07/16/2002
Est. Priority Date: 08/05/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for partitioning entities into a set of clusters, the method comprising:

repeatedly opening a new cluster;

selecting an entity not assigned to a cluster and assigning the selected entity to a new cluster;

iteratively selecting additional entities not assigned to a cluster, with high affinities toward entities currently assigned to the new cluster, and assigning the selected additional entities to the new cluster and removing from the new cluster entities assigned to the new cluster that no longer have high affinities toward entities currently assigned to the new cluster; and

closing the new cluster and adding the closed new cluster to the set of of clusters;

until no entities remain that are not assigned to a cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for applying arbitrary similarity metrics to data entities in order to partition the entities into subsets of related entities. The method and system iteratively construct successive subsets, during construction of each subset adding candidate entities, not yet assigned to a subset, with high affinities toward the subset and removing entities previously assigned to the subset for which the affinities toward the subset have decreased. The method and system efficiently partition data with a high probability of finding an optimal or near-optimal partitioning.

Citations

27 Claims

1. A method for partitioning entities into a set of clusters, the method comprising:
- repeatedly opening a new cluster;
  
  selecting an entity not assigned to a cluster and assigning the selected entity to a new cluster;
  
  iteratively selecting additional entities not assigned to a cluster, with high affinities toward entities currently assigned to the new cluster, and assigning the selected additional entities to the new cluster and removing from the new cluster entities assigned to the new cluster that no longer have high affinities toward entities currently assigned to the new cluster; and
  
  closing the new cluster and adding the closed new cluster to the set of of clusters;
  
  until no entities remain that are not assigned to a cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1 wherein iteratively selecting additional entities and assigning the selected additional entities to the new cluster and removing from the new cluster entities assigned to the new cluster that no longer have high affinities toward entities currently assigned to the new cluster further includes:
3. The method of claim 1 wherein an entity has high affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is at least equal to a threshold value and wherein an entity has low affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is below a threshold value.
4. The method of claim 1 wherein an entity has high affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is greater than a threshold value and wherein an entity has low affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is at most equal to a threshold value.
5. The method of claim 4 where an aggregate affinity of an entity for the entities in the new cluster is an average of the affinities of the entity for each entity currently assigned to the new cluster.
6. The method of claim 4 wherein an aggregate affinity of an entity for the entities in the new cluster is a linear combination of the affinities of the entity for each entity currently assigned to the new cluster.
7. The method of claim 4 wherein a final threshold value is determined by repeatedly clustering the entities using different threshold values.
8. The method of claim 7 wherein a threshold value that produces the most desirable clustering of entities is selected as the final threshold value.
9. The method of claim 8 wherein desirability of clustering is proportional to a score calculated for the clustering, the score equal to the number of large clusters multiplied by the number of entities contained within the large clusters, where large clusters are clusters exceeding a threshold number of members.
10. The method of claim 8 wherein desirability of clustering is proportional to a score calculated for the clustering, the score calculated from differences between an expected clustering and clustering obtained using a threshold value.
11. The method of claim 10 wherein the expected clustering is obtained from pair-wise similarity values for pairs of entities.
12. The method of claim 10 wherein the expected clustering is obtained from extrinsic data, such as known relationships between entities.
13. The method of claim 8 wherein desirability of clustering is inversely proportional to a combination of an implied inter-cluster error and an implied intra-cluster error.
14. The method of claim 1 wherein the affinities for all possible pairs of entities are stored in a similarity matrix.
15. The method of claim 14 wherein each entity is a vector of gene expression measurements for a particular gene and wherein the affinity of a first vector to a second vector is one minus a normalized distance between the endpoints of the first vector and the second vector.
16. The method of claim 1 further including:
- when all entities have been assigned to clusters, for each donor cluster in time-of-creation order, starting from the first cluster of the set of clusters, for each acceptor cluster in time-of-creation order, starting from the cluster in the set of clusters next created after the donor cluster;
  
  identifying any entities in the donor cluster that have a higher affinity for the acceptor cluster than for the donor cluster and moving the identified entities from the donor cluster to the acceptor cluster.
17. The method of claim 16 further including:
- iteratively comparing the affinity of each entity to that entity'"'"'s current cluster and to other clusters;
  
  marking each entity having a higher affinity of a cluster other than the entity'"'"'s current cluster for relocation; and
  
  moving all marked clusters to the cluster for which they have highest affinity until no entities have higher affinities for clusters other than their current clusters.
18. The method of claim 17 further including terminating iteration of the comparing, marking, and moving steps after a fixed number of iterations, even when a number of entities have higher affinities for clusters other than their current clusters.

19. A system for recognizing related subgroups of entities within sets of entities and clustering the related entities into clusters, each cluster containing entities related according to a similarity metric, the system comprising:
- a computer that can execute programs, accept input from a user, and display output to a user;
  
  a computer-readable, initial set of non-clustered entities;
  
  a similarity metric component that provides, in computer-readable form, similarity values that relate each entity with the remaining entities in the initial set of non-clustered entities;
  
  a similarity cutoff value that partitions possible similarity values into low-affinity similarity values and high-affinity similarity values; and
  
  a cluster analysis program that is executed by the computer, that receives, as input, the initial set of non-clustered entities, the similarity metric component, and similarity cutoff value, and that partitions the initial set of non-clustered entities into clusters by iteratively creating successive clusters, during each iteration creating a new cluster and selecting non-clustered entities to associate with the new cluster, until no non-clustered entities remain.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The system of claim 19 wherein the cluster analysis program iteratively generates successive clusters, a single cluster generated during each outer-level iteration of the cluster analysis program.
  - 21. The system of claim 20 wherein the cluster analysis program maintains a list of entities not assigned to a cluster and a list of clusters generated by the cluster analysis program.
  - 22. The system of claim 20 wherein, during each outer-level iteration of the cluster analysis program, the cluster analysis program:
23. The system of claim 22 wherein, following assignment of all entities in the initial set of entities to clusters, entities from earlier-constructed clusters may be reassigned to later-constructed clusters for which the entities have a higher affinity.
24. Electronic signals embodied in a carrier wave that encode computer instructions that implement the cluster analysis program of claim 19.
25. A computer-readable medium containing computer instructions of the cluster analysis program of claim 19.
26. A computer-readable medium containing an indication of the clusters generated by the cluster analysis program of claim 19.
27. Electronic signals embodied in a carrier wave that encode an indication of the clusters generated by the cluster analysis program of claim 19.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Agilent Technologies, Inc.
Original Assignee
Agilent Technologies, Inc.
Inventors
Ben-Dor, Amir, Yakhini, Zohar H, Shamir, Ron
Primary Examiner(s)
Homere, Jean R.
Assistant Examiner(s)
Wassum, Luke S.

Application Number

US09/369,210
Time in Patent Office

1,076 Days
Field of Search

706/13, 707/6-7, 707/101, 707/104.1, 702/179-181, 702/127-128, 702/19-21
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

Y10S 707/99936   Pattern matching access

Y10S 707/99942   Manipulating data structure...

Y10S 707/99945   Object-oriented database st...

Method and system for partitioning data into subsets of related data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for partitioning data into subsets of related data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links