Method and system for partitioning data into subsets of related data
First Claim
Patent Images
1. A method for partitioning entities into a set of clusters, the method comprising:
- repeatedly opening a new cluster;
selecting an entity not assigned to a cluster and assigning the selected entity to a new cluster;
iteratively selecting additional entities not assigned to a cluster, with high affinities toward entities currently assigned to the new cluster, and assigning the selected additional entities to the new cluster and removing from the new cluster entities assigned to the new cluster that no longer have high affinities toward entities currently assigned to the new cluster; and
closing the new cluster and adding the closed new cluster to the set of of clusters;
until no entities remain that are not assigned to a cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for applying arbitrary similarity metrics to data entities in order to partition the entities into subsets of related entities. The method and system iteratively construct successive subsets, during construction of each subset adding candidate entities, not yet assigned to a subset, with high affinities toward the subset and removing entities previously assigned to the subset for which the affinities toward the subset have decreased. The method and system efficiently partition data with a high probability of finding an optimal or near-optimal partitioning.
-
Citations
27 Claims
-
1. A method for partitioning entities into a set of clusters, the method comprising:
-
repeatedly opening a new cluster;
selecting an entity not assigned to a cluster and assigning the selected entity to a new cluster;
iteratively selecting additional entities not assigned to a cluster, with high affinities toward entities currently assigned to the new cluster, and assigning the selected additional entities to the new cluster and removing from the new cluster entities assigned to the new cluster that no longer have high affinities toward entities currently assigned to the new cluster; and
closing the new cluster and adding the closed new cluster to the set of of clusters;
until no entities remain that are not assigned to a cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
repeatedly attempting to select an additional entity not assigned to a cluster with high affinity for the entities assigned to the new cluster, when an additional entity is selected, adding the additional entity to the new cluster;
when an additional entity is not selected, attempting to select an entity currently assigned to the new cluster that now has low affinity for the entities assigned to the new cluster;
when an entity currently assigned to the new cluster that now has low affinity for the entities assigned to the new cluster a entity is selected, removing the entity from the new cluster;
until no additional entity not assigned to a cluster with high affinity for the entities assigned to the new cluster can be selected and no entity currently assigned to the new cluster that has low affinity for the entities assigned to the new cluster can be selected.
-
-
3. The method of claim 1 wherein an entity has high affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is at least equal to a threshold value and wherein an entity has low affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is below a threshold value.
-
4. The method of claim 1 wherein an entity has high affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is greater than a threshold value and wherein an entity has low affinity for the entities assigned to the new cluster when an aggregate affinity of the entity for the entities currently assigned to the new cluster is at most equal to a threshold value.
-
5. The method of claim 4 where an aggregate affinity of an entity for the entities in the new cluster is an average of the affinities of the entity for each entity currently assigned to the new cluster.
-
6. The method of claim 4 wherein an aggregate affinity of an entity for the entities in the new cluster is a linear combination of the affinities of the entity for each entity currently assigned to the new cluster.
-
7. The method of claim 4 wherein a final threshold value is determined by repeatedly clustering the entities using different threshold values.
-
8. The method of claim 7 wherein a threshold value that produces the most desirable clustering of entities is selected as the final threshold value.
-
9. The method of claim 8 wherein desirability of clustering is proportional to a score calculated for the clustering, the score equal to the number of large clusters multiplied by the number of entities contained within the large clusters, where large clusters are clusters exceeding a threshold number of members.
-
10. The method of claim 8 wherein desirability of clustering is proportional to a score calculated for the clustering, the score calculated from differences between an expected clustering and clustering obtained using a threshold value.
-
11. The method of claim 10 wherein the expected clustering is obtained from pair-wise similarity values for pairs of entities.
-
12. The method of claim 10 wherein the expected clustering is obtained from extrinsic data, such as known relationships between entities.
-
13. The method of claim 8 wherein desirability of clustering is inversely proportional to a combination of an implied inter-cluster error and an implied intra-cluster error.
-
14. The method of claim 1 wherein the affinities for all possible pairs of entities are stored in a similarity matrix.
-
15. The method of claim 14 wherein each entity is a vector of gene expression measurements for a particular gene and wherein the affinity of a first vector to a second vector is one minus a normalized distance between the endpoints of the first vector and the second vector.
-
16. The method of claim 1 further including:
-
when all entities have been assigned to clusters, for each donor cluster in time-of-creation order, starting from the first cluster of the set of clusters, for each acceptor cluster in time-of-creation order, starting from the cluster in the set of clusters next created after the donor cluster;
identifying any entities in the donor cluster that have a higher affinity for the acceptor cluster than for the donor cluster and moving the identified entities from the donor cluster to the acceptor cluster.
-
-
17. The method of claim 16 further including:
-
iteratively comparing the affinity of each entity to that entity'"'"'s current cluster and to other clusters;
marking each entity having a higher affinity of a cluster other than the entity'"'"'s current cluster for relocation; and
moving all marked clusters to the cluster for which they have highest affinity until no entities have higher affinities for clusters other than their current clusters.
-
-
18. The method of claim 17 further including terminating iteration of the comparing, marking, and moving steps after a fixed number of iterations, even when a number of entities have higher affinities for clusters other than their current clusters.
-
19. A system for recognizing related subgroups of entities within sets of entities and clustering the related entities into clusters, each cluster containing entities related according to a similarity metric, the system comprising:
-
a computer that can execute programs, accept input from a user, and display output to a user;
a computer-readable, initial set of non-clustered entities;
a similarity metric component that provides, in computer-readable form, similarity values that relate each entity with the remaining entities in the initial set of non-clustered entities;
a similarity cutoff value that partitions possible similarity values into low-affinity similarity values and high-affinity similarity values; and
a cluster analysis program that is executed by the computer, that receives, as input, the initial set of non-clustered entities, the similarity metric component, and similarity cutoff value, and that partitions the initial set of non-clustered entities into clusters by iteratively creating successive clusters, during each iteration creating a new cluster and selecting non-clustered entities to associate with the new cluster, until no non-clustered entities remain. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
opens and initializes a new, current cluster; and
repeatedly identifies a candidate entity not assigned to a cluster having an affinity for the current cluster at least as great as any other entity not assigned to a cluster;
when the affinity of the candidate entity is a high affinity, as determined by the cluster analysis program with reference to the similarity cutoff value, adding the candidate entity to the current cluster and removing the candidate entity from the list of entities not assigned to a cluster; and
when the affinity of the candidate entity is a low affinity, when an alternate candidate entity contained within the current cluster having at least as low an affinity for the current cluster as any other entity contained within the current cluster has a low affinity for the current cluster, removing the alternate candidate entity from the current cluster and reassigning the alternate candidate entity to the list of entities not assigned to a cluster; and
when an alternate candidate entity contained within the current cluster having at least as low an affinity for the current cluster as any other entity contained within the current cluster has a high affinity for the current cluster, closing the current cluster;
until the current cluster is closed.
-
-
23. The system of claim 22 wherein, following assignment of all entities in the initial set of entities to clusters, entities from earlier-constructed clusters may be reassigned to later-constructed clusters for which the entities have a higher affinity.
-
24. Electronic signals embodied in a carrier wave that encode computer instructions that implement the cluster analysis program of claim 19.
-
25. A computer-readable medium containing computer instructions of the cluster analysis program of claim 19.
-
26. A computer-readable medium containing an indication of the clusters generated by the cluster analysis program of claim 19.
-
27. Electronic signals embodied in a carrier wave that encode an indication of the clusters generated by the cluster analysis program of claim 19.
Specification