System and method for sequence-based subspace pattern clustering

US 7,565,346 B2
Filed: 05/31/2004
Issued: 07/21/2009
Est. Priority Date: 05/31/2004
Status: Expired due to Fees

First Claim

Patent Images

1. An apparatus for facilitating subspace clustering, said apparatus comprising:

a processor;

an arrangement for accepting input data;

an arrangement for discerning pattern similarity in the input data, wherein said discerning arrangement is configured to;

define a pattern space;

divide the pattern space into grids;

establish a grid of cells corresponding to the input data; and

construct a tree structure which summarizes frequent patterns discerned among the input data, wherein said tree structure is configured to determine at least one of;

a number of occurrences of a given pattern; and

a density of any cell in the grid of cells; and

an arrangement for clustering the input data on the basis of discerned pattern similarity, wherein said clustering arrangement is configured to merge cells of at least a threshold density into clusters;

said arrangement for discerning pattern similarity comprising an arrangement for discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data;

wherein said arrangement for discerning pattern similarity is configured to employ a distance function for determining a sequence-based distance between data objects, the distance function comprising;

given two data objects x and y, a subspace S, and a dimension kε

S, the sequence-based distance between x and y is as follows;

${dist}_{k, S} (x, y) = \max_{i \in ??} \langle (x_{i} - y_{i}) - (x_{k} - y_{k}) \rangle; and$ wherein the clustered input data is stored in a computer memory.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. There is presented herein a novel method that offers this capability.

Citations

3 Claims

1. An apparatus for facilitating subspace clustering, said apparatus comprising:
- a processor;
  
  an arrangement for accepting input data;
  
  an arrangement for discerning pattern similarity in the input data, wherein said discerning arrangement is configured to;
  
  define a pattern space;
  
  divide the pattern space into grids;
  
  establish a grid of cells corresponding to the input data; and
  
  construct a tree structure which summarizes frequent patterns discerned among the input data, wherein said tree structure is configured to determine at least one of;
  
  a number of occurrences of a given pattern; and
  
  a density of any cell in the grid of cells; and
  
  an arrangement for clustering the input data on the basis of discerned pattern similarity, wherein said clustering arrangement is configured to merge cells of at least a threshold density into clusters;
  
  said arrangement for discerning pattern similarity comprising an arrangement for discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data;
  
  wherein said arrangement for discerning pattern similarity is configured to employ a distance function for determining a sequence-based distance between data objects, the distance function comprising;
  
  given two data objects x and y, a subspace S, and a dimension kε
  
  S, the sequence-based distance between x and y is as follows;
  
  ${dist}_{k, S} (x, y) = \max_{i \in ??} \langle (x_{i} - y_{i}) - (x_{k} - y_{k}) \rangle; and$ wherein the clustered input data is stored in a computer memory.

2. A method of facilitating subspace clustering, said method comprising the steps of:
- (a) accepting input data;
  
  (b) discerning pattern similarity in the input data, said discerning step comprising;
  
  defining a pattern space;
  
  dividing the pattern space into grids;
  
  establishing a grid of cells corresponding to the input data; and
  
  constructing a tree structure which summarizes frequent patterns discerned among the input data, wherein said step of constructing a tree structure comprises determining at least one of;
  
  a number of occurrences of a given pattern; and
  
  a density of any cell in the grid of cells; and
  
  (c) clustering the input data on the basis of discerned pattern similarity, wherein said step of clustering the input data comprises merging cells of at least a threshold density into clusters;
  
  said discerning step comprising discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data;
  
  wherein said discerning pattern similarity in the input data comprises employing a distance function for determining a sequence-based distance between data objects, the distance function comprising;
  
  given two data objects x and y, a subspace S, and a dimension kε
  
  S, the sequence-based distance between x and y is as follows;
  
  ${dist}_{k, S} (x, y) = \max_{i ε S} \langle (x_{i} - y_{i}) - (x_{k} - y_{k}) \rangle; and$ wherein the clustered input data is stored in a computer memory.

3. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for facilitating subspace clustering, said method comprising the steps of:
- (a) accepting input data;
  
  (b) discerning pattern similarity in the input data, said discerning step comprising;
  
  defining a pattern space;
  
  dividing the pattern space into grids;
  
  establishing a grid of cells corresponding to the input data; and
  
  constructing a tree structure which summarizes frequent patterns discerned among the input data, wherein said step of constructing a tree structure comprises determining at least one of;
  
  a number of occurrences of a given pattern; and
  
  a density of any cell in the grid of cells; and
  
  (c) clustering the input data on the basis of discerned pattern similarity, wherein said step of clustering the input data comprises merging cells of at least a threshold density into clusters;
  
  said discerning step comprising discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data;
  
  wherein said discerning pattern similarity in the input data comprises employing a distance function for determining a sequence-based distance between data objects, the distance function comprising;
  
  given two data objects x and y, a subspace S,. and a dimension kε
  
  S, the sequence-based distance between x and y is as follows;
  
  ${dist}_{k, S} (x, y) = \max_{i \in S} \langle (x_{i} - y_{i}) - (x_{k} - y_{k}) \rangle .$

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yu, Philip S., Fan, Wei, Wang, Haixun
Primary Examiner(s)
Cottingham; John R.
Assistant Examiner(s)
LOVEL, KIMBERLY M

Application Number

US10/858,541
Publication Number

US 20050278324A1
Time in Patent Office

1,877 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 18/22 Matching criteria, e.g. pro...

System and method for sequence-based subspace pattern clustering

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

3 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for sequence-based subspace pattern clustering

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

3 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links