System and method for sequence-based subspace pattern clustering
First Claim
1. An apparatus for facilitating subspace clustering, said apparatus comprising:
- a processor;
an arrangement for accepting input data;
an arrangement for discerning pattern similarity in the input data, wherein said discerning arrangement is configured to;
define a pattern space;
divide the pattern space into grids;
establish a grid of cells corresponding to the input data; and
construct a tree structure which summarizes frequent patterns discerned among the input data, wherein said tree structure is configured to determine at least one of;
a number of occurrences of a given pattern; and
a density of any cell in the grid of cells; and
an arrangement for clustering the input data on the basis of discerned pattern similarity, wherein said clustering arrangement is configured to merge cells of at least a threshold density into clusters;
said arrangement for discerning pattern similarity comprising an arrangement for discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data;
wherein said arrangement for discerning pattern similarity is configured to employ a distance function for determining a sequence-based distance between data objects, the distance function comprising;
given two data objects x and y, a subspace S, and a dimension kε
S, the sequence-based distance between x and y is as follows;
wherein the clustered input data is stored in a computer memory.
3 Assignments
0 Petitions
Accused Products
Abstract
Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. There is presented herein a novel method that offers this capability.
-
Citations
3 Claims
-
1. An apparatus for facilitating subspace clustering, said apparatus comprising:
-
a processor; an arrangement for accepting input data; an arrangement for discerning pattern similarity in the input data, wherein said discerning arrangement is configured to; define a pattern space; divide the pattern space into grids; establish a grid of cells corresponding to the input data; and construct a tree structure which summarizes frequent patterns discerned among the input data, wherein said tree structure is configured to determine at least one of; a number of occurrences of a given pattern; and a density of any cell in the grid of cells; and an arrangement for clustering the input data on the basis of discerned pattern similarity, wherein said clustering arrangement is configured to merge cells of at least a threshold density into clusters; said arrangement for discerning pattern similarity comprising an arrangement for discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data; wherein said arrangement for discerning pattern similarity is configured to employ a distance function for determining a sequence-based distance between data objects, the distance function comprising; given two data objects x and y, a subspace S, and a dimension kε
S, the sequence-based distance between x and y is as follows;wherein the clustered input data is stored in a computer memory.
-
-
2. A method of facilitating subspace clustering, said method comprising the steps of:
-
(a) accepting input data; (b) discerning pattern similarity in the input data, said discerning step comprising; defining a pattern space; dividing the pattern space into grids; establishing a grid of cells corresponding to the input data; and constructing a tree structure which summarizes frequent patterns discerned among the input data, wherein said step of constructing a tree structure comprises determining at least one of; a number of occurrences of a given pattern; and a density of any cell in the grid of cells; and (c) clustering the input data on the basis of discerned pattern similarity, wherein said step of clustering the input data comprises merging cells of at least a threshold density into clusters; said discerning step comprising discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data; wherein said discerning pattern similarity in the input data comprises employing a distance function for determining a sequence-based distance between data objects, the distance function comprising; given two data objects x and y, a subspace S, and a dimension kε
S, the sequence-based distance between x and y is as follows;wherein the clustered input data is stored in a computer memory.
-
-
3. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for facilitating subspace clustering, said method comprising the steps of:
-
(a) accepting input data; (b) discerning pattern similarity in the input data, said discerning step comprising; defining a pattern space; dividing the pattern space into grids; establishing a grid of cells corresponding to the input data; and constructing a tree structure which summarizes frequent patterns discerned among the input data, wherein said step of constructing a tree structure comprises determining at least one of; a number of occurrences of a given pattern; and a density of any cell in the grid of cells; and (c) clustering the input data on the basis of discerned pattern similarity, wherein said step of clustering the input data comprises merging cells of at least a threshold density into clusters; said discerning step comprising discerning pattern similarity among both tabular data and sequential data contained in the input data, wherein said tabular data is transformed and represented as sequential data; wherein said discerning pattern similarity in the input data comprises employing a distance function for determining a sequence-based distance between data objects, the distance function comprising; given two data objects x and y, a subspace S,. and a dimension kε
S, the sequence-based distance between x and y is as follows;
-
Specification