Unsupervised identification of nonlinear data cluster in multidimensional data

US 6,226,408 B1
Filed: 01/29/1999
Issued: 05/01/2001
Est. Priority Date: 01/29/1999
Status: Expired due to Term

First Claim

Patent Images

1. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without using labels encoded in the data items, the method comprising the operations of:

representing each data item by an input vector having a plurality of components representing portions of the data of the data item;

quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster centers, the number of cluster centers less than the plurality of data items, each of the input vectors contributing to each cluster center;

linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;

encoding each input vector as an encoded vector having a coded vector component for each cluster center, each vector component determined as a function of a distance between the input vector and the respective cluster center, and a distance between the respective cluster center and a cluster center nearest the input vector;

repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and

for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label associated with nearest cluster center.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method, and software product provide for unsupervised identification of complex, nonlinear subspaces in high dimensional data. The system includes a vector quantization module, a weighted topology representing graph module, and an encoding module. The vector quantization module takes vector data inputs and extracts a group of inputs about a number of cluster centers, using a globally optimized clustering process. The weighted topology representing graph module creates a weighted graph of the vector space, using the cluster centers as nodes, weighting edges between nodes as a function of the density of the vectors between the linked nodes. The encoding module uses the weighted graph to recode the input vectors based on their proximity to the cluster centers and the connectedness of the graph. The recoded vectors are reinput into the vector quantization module, and the process repeated until termination, for example at a limited number of cluster centers. Upon termination, the clusters thus identified may be highly nonlinear in the original data space.

150 Citations

13 Claims

1. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without using labels encoded in the data items, the method comprising the operations of:
- representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
  
  quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster centers, the number of cluster centers less than the plurality of data items, each of the input vectors contributing to each cluster center;
  
  linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;
  
  encoding each input vector as an encoded vector having a coded vector component for each cluster center, each vector component determined as a function of a distance between the input vector and the respective cluster center, and a distance between the respective cluster center and a cluster center nearest the input vector;
  
  repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
  
  for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label associated with nearest cluster center.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein quantizing the plurality of input vectors further comprises:
3. The method of claim 1, wherein quantizing the plurality of input vectors further comprises:
- selecting a number of the input vectors as cluster centers; and
  
  updating each cluster center to a mean of its distance to each input vector.
4. The method of claim 3, wherein updating each cluster center to a mean of its distance to each input vector further comprises:
- for each input vector;
  
  determining a distance of the input vector to each cluster center;
  
  ranking the cluster centers by their distance to input vector;
  
  updating a partial sum vector of each cluster center using the input vector and the ranking to the cluster center;
  
  updating an exponentially decayed contribution of the input vector to each cluster center using the ranking of the cluster center; and
  
  updating each cluster center to the mean of the partial sum vectors for the cluster center.
5. The method of claim 1, wherein linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers, further comprises:
- for each edge between two cluster centers, weighting the edge as a function of a number of input vectors for which the two clusters are the closest cluster centers to each of the number of input vectors.
6. The method of claim 1, wherein linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers, further comprises:
- for each edge between two cluster centers;
  
  determining a number of input vectors for which the two cluster centers are the closest cluster centers to each of the input vectors;
  
  establishing a weight of the edge as an inverse of the determined number of input vectors.
7. The method of claim 1, wherein encoding each input vector as an encoded vector further comprises:
- encoding each input vector with an output code;
  
  $\begin{matrix} O_{k} = e^{- (r_{k} + α \cdot p_{kq}) / τ} & (4) \end{matrix}$
  
  where O_kis the output code for a k-th cluster center;
  
  r_kis a distance rank of the k-th cluster center to the input vector being encoded;
  
  α
  
  is a weighting parameter;
  
  p_kqis a length of a shortest path in the graph from a cluster center closest to the input vector being encoded to the k-th cluster center; and
  
  τ
  
  is a decay parameter.
8. The method of claim 1, wherein the distance between the input vector and the respective cluster center vector is a distance rank, and the distance between the respective cluster center and the cluster center nearest the input vector is a distance rank.

9. A computer implemented method for identifying clusters of associated data items in a database containing N data items, without using encoded labeling information in the data items, the method comprising the operations of:
- representing each data item by an input vector V having a plurality of components representing portions of the data of the data item;
  
  selecting a subset of K input vectors as cluster centers C, where K<
  
  N;
  
  updating each cluster center C_k(k=1 to K) as a function of a distance of each vector V_n(n=1 to N) to the cluster center C_k;
  
  linking the cluster centers C with edges to form a graph, each edge between two cluster centers C_iand C_jweighted according to a density of the vectors V between the two cluster centers C_iand C_j;
  
  encoding each input vector V with an output code O_kfor each cluster center C_kto form an encoded vector V′
  
  , the output code O_kfor each cluster center C_kdetermined as function of a distance of the cluster center C_kto the input vector V and a shortest path from the cluster center C_kto a cluster center C_onearest the input vector V;
  
  repeating the selecting, linking, and encoding operations using the encoded vectors V′
  
  as the input vectors V until a termination condition is satisfied; and
  
  for each encoded vector V′
  
  , labeling the data item represented by the encoded vector V with a label associated with the nearest cluster center C_o.

10. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
- representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
  
  quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each of the input vectors contributing to each cluster center vector;
  
  forming a graph by;
  
  associating pairs of cluster center vectors with an edge, such that each cluster center vector is associated with at least one other cluster center vector; and
  
  weighting each edge between a pair of cluster center vectors as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
  
  encoding each input vector as an encoded vector having a coded vector component for each cluster center vector, each vector component determined as a function of a distance of the input vector to the respective cluster center vector and a shortest path in the graph from the respective cluster center vector to a cluster center vector nearest the input vector;
  
  repeating the operations of quantizing, forming, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
  
  for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label defined for the cluster center vector nearest the encoded vector.

11. A computer implemented method for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
- representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
  
  quantizing the plurality of input vectors by associating each input vector with one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each cluster center vector having components determined as a function of a distance of the cluster center vector to each of the plurality of input vectors;
  
  linking pairs of cluster center vectors with edges to form a graph, each edge between a pair of cluster center vectors weighted as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
  
  encoding each input vector as an encoded vector having a vector component for each cluster center vector, each vector component determined according to a relationship between the input vector and the cluster center vector and a relationship between the cluster center vector and a cluster center vector closest to the input vector;
  
  repeating the operations of quantizing, linking, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
  
  for each encoded vector remaining after the termination condition is satisfied, labeling the data item associated with the encoded vector with a label defined for the cluster center vector nearest the encoded vector.

12. A computer system for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the system comprising:
- a database containing a plurality of data items, each data item associated with a vector having a plurality of components representing portions of the data of the data item;
  
  a quantizing module coupled to the database to receive the plurality of input vectors and associate each input vector with one of a number of cluster center vectors, the number of cluster center vectors less than the plurality of data items, each cluster center vector having components determined as a function of a distance of the cluster center vector to each of the plurality of input vectors;
  
  a graph module coupled to the quantizing module to form a graph by linking pairs of cluster center vectors with edges, and to weight each edge between a pair of cluster center vectors as a function of a number of input vectors to which the pair of cluster center vectors are the closest cluster center vectors;
  
  an encoding module coupled to the graph module and the database to encode each input vector as an encoded vector having a vector component for each cluster center vector, and to determine each vector component according to a relationship between the input vector and the cluster center vector and a relationship between the cluster center vector and a cluster center vector closest to the input vector; and
  
  an executive module that iteratively inputs the encoded vectors to the quantizing module as the input vectors until a termination condition is satisfied, and that labels the data item associated with the encoded vector with a label defined for the cluster center with which the encoded vector is associated.

13. A computer system for unsupervised identification of associated sets of data items in a collection of data items without use of labeling information encoded in the data items, the method comprising the operations of:
- means for representing each data item by an input vector having a plurality of components representing portions of the data of the data item;
  
  means for quantizing the plurality of input vectors by associating each input vector with a closest one of a number of cluster center, the number of cluster center less than the plurality of data items, each of the input vectors contributing to each cluster center;
  
  means for linking the cluster centers with edges to form a graph, each edge between two cluster centers weighted according to a density of the input vectors between the two cluster centers;
  
  means for encoding each input vector as an encoded vector having a coded vector component for each cluster center vector, each vector component determined as a function of a distance between the input vector and the respective cluster center vector, and a distance between the respective cluster center and a cluster center nearest the input vector;
  
  means for repeating the operations of quantizing, forming, and encoding using the encoded vectors as the input vectors until a termination condition is satisfied; and
  
  means for labeling a data item associated with an encoded vector remaining after the termination condition is satisfied, with a label associated with nearest cluster center to the encoded vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fair Isaac Corporation
Original Assignee
HNC Software Inc. (Fair Isaac Corporation)
Inventors
Sirosh, Joseph
Primary Examiner(s)
Boudreau, Leo H.
Assistant Examiner(s)
DESIRE, GREGORY M

Application Number

US09/240,403
Time in Patent Office

823 Days
Field of Search

382/110, 382/155-157, 382/159-160, 382/190, 382/197-198, 382/209, 382/224-229, 706/14-17, 706/20, 706/25-27, 704/6-7, 704/9-10, 704/259, 707/3-6
US Class Current

382/224
CPC Class Codes

G06F 18/23213 with fixed number of cluste...

G06F 18/2323 based on graph theory, e.g....

Unsupervised identification of nonlinear data cluster in multidimensional data

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

150 Citations

13 Claims

Specification

Use Cases

Quick Links

Others

Unsupervised identification of nonlinear data cluster in multidimensional data

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

150 Citations

13 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others