Computational method for discovering patterns in data sets
First Claim
1. A method in a computer of pattern discovery in a data base, comprising the steps of:
- a) operating a computer to provide a data base comprising M samples, each sample being described in terms of N attributes represented by X={X1, . . . ZN }, each attribute Xi comprising a random variable taking on values from its alphabet α
i ={α
i, . . . , α
min} where mi is the cardinality of an alphabet of the ith attribute, wherein a primary event is a realization (xi) of Xi which takes on a value from α
i ;
b) operating said computer to set an initial pattern candidate set C comprising a plurality of candidates c and an Attributed Hypergraph (AHG) pattern set P (AHG P) to be empty, placing all primary events ri into a primary event set R, and setting order O=1;
c) operating said computer to increment O and generating a candidate set C of order O from previous and primary event set R and if C is empty, go to step (e);
otherwise go to step (d);
(d) operating said computer to count, for each candidate c in C, its occurrences o and to calculate its expected occurrence e given by
space="preserve" listing-type="equation"> e=M o.sub.i /M! ##EQU21## where o.sub.i is the occurrence of the i-th primary event in c and M is the total number of instances in said database, if said expected occurrence is less than a preselected threshold, remove c from C, if the expected occurrence is greater than said preselected threshold, calculate an adjusted residual d of this candidate given by
space="preserve" listing-type="equation"> d=(o-e)/√
V! ##EQU22## where v is a maximum likelihood estimate of the variance of residual (o-e), if |d| is larger than a preselected confidence level, put c into AHG P, and if c is a negative pattern and the occurrence o is less than said preselected threshold, remove c from C, and if order O is equal to a highest order in demand, go to step (e) below, otherwise go to step (b);
(e) operating said computer to output AHG P.
1 Assignment
0 Petitions
Accused Products
Abstract
Automatic discovery of qualitative and quantitative patterns inherent in data sets is accomplished by use of a unified framework which employs adjusted residual analysis in statistics to test the significance of the pattern candidates generated from data sets. This framework consists of a search engine for different order patterns, a mechanism to avoid exhaustive search by eliminating impossible pattern candidates, an attributed hypergraph (AHG) based knowledge representation language and an inference engine which measures the weight of evidence of each pattern for classification and prediction. If a pattern candidate passes the statistical significance test of adjusted residual, it is regarded as a pattern and represented by an attributed hyperedge in AHG. In the task of classification and/or prediction, the weights of evidence are calculated and compared to draw the conclusion.
192 Citations
4 Claims
-
1. A method in a computer of pattern discovery in a data base, comprising the steps of:
-
a) operating a computer to provide a data base comprising M samples, each sample being described in terms of N attributes represented by X={X1, . . . ZN }, each attribute Xi comprising a random variable taking on values from its alphabet α
i ={α
i, . . . , α
min} where mi is the cardinality of an alphabet of the ith attribute, wherein a primary event is a realization (xi) of Xi which takes on a value from α
i ;b) operating said computer to set an initial pattern candidate set C comprising a plurality of candidates c and an Attributed Hypergraph (AHG) pattern set P (AHG P) to be empty, placing all primary events ri into a primary event set R, and setting order O=1; c) operating said computer to increment O and generating a candidate set C of order O from previous and primary event set R and if C is empty, go to step (e);
otherwise go to step (d);(d) operating said computer to count, for each candidate c in C, its occurrences o and to calculate its expected occurrence e given by
space="preserve" listing-type="equation"> e=M o.sub.i /M! ##EQU21## where o.sub.i is the occurrence of the i-th primary event in c and M is the total number of instances in said database, if said expected occurrence is less than a preselected threshold, remove c from C, if the expected occurrence is greater than said preselected threshold, calculate an adjusted residual d of this candidate given by
space="preserve" listing-type="equation"> d=(o-e)/√
V! ##EQU22## where v is a maximum likelihood estimate of the variance of residual (o-e), if |d| is larger than a preselected confidence level, put c into AHG P, and if c is a negative pattern and the occurrence o is less than said preselected threshold, remove c from C, and if order O is equal to a highest order in demand, go to step (e) below, otherwise go to step (b);(e) operating said computer to output AHG P. - View Dependent Claims (2, 3, 4)
-
Specification