Unsupervised identification of nonlinear data cluster in multidimensional data
First Claim
1. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without using labels encoded in the data items, the method comprising the operations of:
- representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster centers, the number of cluster centers less than the plurality of data items, each of the input vectors contributing to each cluster center;
linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;
encoding each input vector as an encoded vector having a coded vector component for each cluster center, each vector component determined as a function of a distance between the input vector and the respective cluster center, and a distance between the respective cluster center and a cluster center nearest the input vector;
repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label associated with nearest cluster center.
4 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and software product provide for unsupervised identification of complex, nonlinear subspaces in high dimensional data. The system includes a vector quantization module, a weighted topology representing graph module, and an encoding module. The vector quantization module takes vector data inputs and extracts a group of inputs about a number of cluster centers, using a globally optimized clustering process. The weighted topology representing graph module creates a weighted graph of the vector space, using the cluster centers as nodes, weighting edges between nodes as a function of the density of the vectors between the linked nodes. The encoding module uses the weighted graph to recode the input vectors based on their proximity to the cluster centers and the connectedness of the graph. The recoded vectors are reinput into the vector quantization module, and the process repeated until termination, for example at a limited number of cluster centers. Upon termination, the clusters thus identified may be highly nonlinear in the original data space.
150 Citations
13 Claims
-
1. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without using labels encoded in the data items, the method comprising the operations of:
-
representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster centers, the number of cluster centers less than the plurality of data items, each of the input vectors contributing to each cluster center;
linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;
encoding each input vector as an encoded vector having a coded vector component for each cluster center, each vector component determined as a function of a distance between the input vector and the respective cluster center, and a distance between the respective cluster center and a cluster center nearest the input vector;
repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label associated with nearest cluster center. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
adjusting each cluster center as a function of a distance of the cluster center to each of the input vectors.
-
-
3. The method of claim 1, wherein quantizing the plurality of input vectors further comprises:
-
selecting a number of the input vectors as cluster centers; and
updating each cluster center to a mean of its distance to each input vector.
-
-
4. The method of claim 3, wherein updating each cluster center to a mean of its distance to each input vector further comprises:
-
for each input vector;
determining a distance of the input vector to each cluster center;
ranking the cluster centers by their distance to input vector;
updating a partial sum vector of each cluster center using the input vector and the ranking to the cluster center;
updating an exponentially decayed contribution of the input vector to each cluster center using the ranking of the cluster center; and
updating each cluster center to the mean of the partial sum vectors for the cluster center.
-
-
5. The method of claim 1, wherein linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers, further comprises:
for each edge between two cluster centers, weighting the edge as a function of a number of input vectors for which the two clusters are the closest cluster centers to each of the number of input vectors.
-
6. The method of claim 1, wherein linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers, further comprises:
-
for each edge between two cluster centers;
determining a number of input vectors for which the two cluster centers are the closest cluster centers to each of the input vectors;
establishing a weight of the edge as an inverse of the determined number of input vectors.
-
-
7. The method of claim 1, wherein encoding each input vector as an encoded vector further comprises:
-
encoding each input vector with an output code;
whereOk is the output code for a k-th cluster center;
rk is a distance rank of the k-th cluster center to the input vector being encoded;
α
is a weighting parameter;
pkq is a length of a shortest path in the graph from a cluster center closest to the input vector being encoded to the k-th cluster center; and
τ
is a decay parameter.
-
-
8. The method of claim 1, wherein the distance between the input vector and the respective cluster center vector is a distance rank, and the distance between the respective cluster center and the cluster center nearest the input vector is a distance rank.
-
9. A computer implemented method for identifying clusters of associated data items in a database containing N data items, without using encoded labeling information in the data items, the method comprising the operations of:
-
representing each data item by an input vector V having a plurality of components representing portions of the data of the data item;
selecting a subset of K input vectors as cluster centers C, where K<
N;
updating each cluster center Ck (k=1 to K) as a function of a distance of each vector Vn (n=1 to N) to the cluster center Ck;
linking the cluster centers C with edges to form a graph, each edge between two cluster centers Ci and Cj weighted according to a density of the vectors V between the two cluster centers Ci and Cj;
encoding each input vector V with an output code Ok for each cluster center Ck to form an encoded vector V′
, the output code Ok for each cluster center Ck determined as function of a distance of the cluster center Ck to the input vector V and a shortest path from the cluster center Ck to a cluster center Co nearest the input vector V;
repeating the selecting, linking, and encoding operations using the encoded vectors V′
as the input vectors V until a termination condition is satisfied; and
for each encoded vector V′
, labeling the data item represented by the encoded vector V with a label associated with the nearest cluster center Co.
-
-
10. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
-
representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each of the input vectors contributing to each cluster center vector;
forming a graph by;
associating pairs of cluster center vectors with an edge, such that each cluster center vector is associated with at least one other cluster center vector; and
weighting each edge between a pair of cluster center vectors as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
encoding each input vector as an encoded vector having a coded vector component for each cluster center vector, each vector component determined as a function of a distance of the input vector to the respective cluster center vector and a shortest path in the graph from the respective cluster center vector to a cluster center vector nearest the input vector;
repeating the operations of quantizing, forming, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label defined for the cluster center vector nearest the encoded vector.
-
-
11. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
-
representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
quantizing the plurality of input vectors by associating each input vector with one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each cluster center vector having components determined as a function of a distance of the cluster center vector to each of the plurality of input vectors;
linking pairs of cluster center vectors with edges to form a graph, each edge between a pair of cluster center vectors weighted as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
encoding each input vector as an encoded vector having a vector component for each cluster center vector, each vector component determined according to a relationship between the input vector and the cluster center vector and a relationship between the cluster center vector and a cluster center vector closest to the input vector;
repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label defined for the cluster center vector nearest the encoded vector.
-
-
12. A computer system for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the system comprising:
-
a database containing a plurality of data items, each data item associated with a vector having a plurality of components representing portions of the data of the data item;
a quantizing module coupled to the database to receive the plurality of input vectors and associate each input vector with one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each cluster center vector having components determined as a function of a distance of the cluster center vector to each of the plurality of input vectors;
a graph module coupled to the quantizing module to form a graph by linking pairs of cluster center vectors with edges, and to weight each edge between a pair of cluster center vectors as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
an encoding module coupled to the graph module and the database to encode each input vector as an encoded vector having a vector component for each cluster center vector, and to determine each vector component according to a relationship between the input vector and the cluster center vector and a relationship between the cluster center vector and a cluster center vector closest to the input vector; and
an executive module that iteratively inputs the encoded vectors to the quantizing module as the input vectors until a termination condition is satisfied, and that labels the data item associated with the encoded vector with a label defined for the cluster center with which the encoded vector is associated.
-
-
13. A computer system for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
-
means for representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
means for quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster center, the number of cluster center less than the plurality of data items, each of the input vectors contributing to each cluster center;
means for linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;
means for encoding each input vector as an encoded vector having a coded vector component for each cluster center vector, each vector component determined as a function of a distance between the input vector and the respective cluster center vector, and a distance between the respective cluster center and a cluster center nearest the input vector;
means for repeating the operations of quantizing, forming, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
means for labeling a data item associated with an encoded vector remaining after the termination condition is satisfied, with a label associated with nearest cluster center to the encoded vector.
-
Specification