Method and apparatus for classification of data by aggregating emerging patterns
First Claim
1. A method of classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes using a computer processor, said method including the steps of:
- for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
calculating aggregate scores of said instances for all of said classes;
determining base scores for each of said classes; and
for each test instance, performing the sub-steps of;
calculating aggregate and normalised scores of test instance for each class; and
assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said mining step includes the steps of;
determining borders of large itemsets using a largeborder based technique; and
determining supports and growth rates of emerging patterns for said class.
1 Assignment
0 Petitions
Accused Products
Abstract
Emerging patterns (EPs) are itemsets having supports that change significantly from one dataset to another. A classifier, CAEP, is disclosed using the following main ideas based on EPs: (i) Each EP can sharply differentiate the class membership of a (possibly small) fraction of instances containing the EP, due to the big difference between the EP'"'"'s supports in the opposing classes; the differentiating power of the EP is defined in terms of the EP'"'"'s supports and ratio, on instances containing the EP. (ii) For each instance t, by aggregating (124) the differentiating power of a fixed, automatically selected set of EPs, a score is obtained for each class (126). The scores for all classes are normalized (144) and the largest score determines t'"'"'s class (146). CAEP is suitable for many applications, even those with large volumes of high dimensional data. CAEP does not depend on dimension reduction on data and is usually equally accurate on all classes even if their populations are unbalanced.
-
Citations
32 Claims
-
1. A method of classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes using a computer processor, said method including the steps of:
-
for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
calculating aggregate scores of said instances for all of said classes;
determining base scores for each of said classes; and
for each test instance, performing the sub-steps of;
calculating aggregate and normalised scores of test instance for each class; and
assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said mining step includes the steps of;
determining borders of large itemsets using a largeborder based technique; and
determining supports and growth rates of emerging patterns for said class. - View Dependent Claims (2, 3, 4)
-
-
5. An apparatus having a computer processor for classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes, said apparatus including:
-
means for, for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
means for calculating aggregate scores of said instances for all of said classes;
means for determining base scores for each of said classes; and
means for, for each test instance, performing specified operations, said performing means including;
means for calculating aggregate and normalised scores of test instance for each class; and
means for assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said mining means includes;
means for determining two large borders of instances of said set and of said opponent set; and
means for finding all emerging pattern borders using multiple border pairs.
-
-
6. A method of classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes using a computer processor, said method including the steps of:
-
for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
calculating aggregate scores of said instances for all of said classes;
determining base scores for each of said classes; and
for each test instance, performing the sub-steps of;
calculating aggregate and normalised scores of test instance for each class; and
assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said mining step includes the steps of;
determining two large borders of instances of said set and of said opponent set; and
finding all emerging pattern borders using multiple border pairs.
-
-
7. An apparatus having a computer processor for classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes, said apparatus including:
-
means for, for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
means for calculating aggregate scores of said instances for all of said classes;
means for determining base scores for each of said classes; and
means for, for each test instance, performing specified operations, said performing means including;
means for calculating aggregate and normalised scores of test instance for each class; and
means for assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said mining means further includes;
means for determining borders of large itemsets using a large-border based technique; and
means for determining supports and growth rates of emerging patterns for said class. - View Dependent Claims (8, 9, 10)
-
-
11. A system for classifying data using a processor, said system including:
-
means for inputting samples of said data to be classified;
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories;
means for computing base scores of aggregate differentiating scores for all samples and categories; and
means for assigning a category to each sample, said category assigned to a sample having a normalized score that is maximal for said sample;
wherein said mining means includes;
means for determining two large borders of large itemsets in two categories having predetermined support thresholds;
means for finding emerging pattern borders using MBD-LLBORDER processing;
means for enumerating emerging patterns contained in found emerging pattern borders; and
means for checking through actual supports and growth rates of samples in said two categories.
-
-
12. A system for ranking and classifying data using a processor, said system including:
-
means for inputting samples of said data to be classified;
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories;
means for computing base scores of aggregate differentiating scores for all samples and categories; and
means for ranking each category against each sample by measuring a normalized score for said sample, where the greater a normalized score, the higher is the rank of said sample, said normalized score formed by dividing said aggregate differentiating score by a corresponding base score;
wherein said mining means includes;
means for determining two large borders of large itemsets in two categories having predetermined support thresholds;
means for finding emerging pattern borders using MBD-LLBORDER processing;
means for enumerating emerging patterns contained in found emerging pattern borders; and
means for checking through actual supports and growth rates of samples in said two categories.
-
-
13. A computer program product having a computer readable medium having a computer program recorded therein for classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes, said computer program product including:
-
computer program source code means for, for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
computer program source code means for calculating aggregate scores of said instances for all of said classes;
computer program source code means for determining base scores for each of said classes; and
computer program source code means for, for each test instance, performing specified operations, said computer program source code performing means including;
computer program source code means for calculating aggregate and normalised scores of test instance for each class; and
computer program source code means for assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said computer program source code mining means further includes;
computer program source code means for determining borders of large itemsets using a large-border based technique; and
computer program source code means for determining supports and growth rates of emerging patterns for said class. - View Dependent Claims (14, 15, 16)
-
-
17. A system for extracting emerging patterns from data using a processor, said system including:
-
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories; and
means for computing base scores for said categories;
wherein said mining means includes;
means for determining two large borders of large itemsets in two categories having predetermined support thresholds;
means for finding emerging pattern borders using MBD-LLBORDER processing;
means for enumerating emerging patterns contained in found emerging pattern borders; and
means for checking through actual supports and growth rates of samples in said two categories.
-
-
18. A computer program product having a computer readable medium having a computer program recorded therein for classifying data by aggregating emerging patterns in said data using datasets for a plurality of classes, said computer program product including:
-
computer program source code means for, for each of said classes, mining an emerging pattern set dependent upon instances of said set and opponent instances dependent upon predetermined growth rate and support thresholds;
computer program source code means for calculating aggregate scores of said instances for all of said classes;
computer program source code means for determining base scores for each of said classes; and
computer program source code means for, for each test instance, performing specified operations, said computer program source code performing means including;
computer program source code means for calculating aggregate and normalised scores of test instance for each class; and
computer program source code means for assigning a specified class to the test instance for which the test instance has a largest normalized score;
wherein said computer program source code mining means includes;
computer program source code means for determining two large borders of instances of said set and of said opponent set; and
computer program source code means for finding all emerging pattern borders using multiple border pairs.
-
-
19. A system for extracting emerging patterns from data using a processor, said system including:
-
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories; and
means for computing base scores for said categories;
wherein said mining means includes means for manipulating borders, where each border is an ordered pair (L, R), if each of L and R is an anti-chain collection of sets, each element of L is a subset of an element of R, and each element of R is a superset of some element in L, wherein a collection of sets represented by, or a set interval of, such a border are sets Y such that Y is a superset of an element of L and is a subset of an element of R. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28)
means for obtaining samples from different input categories; and
means for adjustably discretizing said obtained samples.
-
-
25. The system according to claim 19, further including means for storing and managing said obtained samples and/or said discretized samples.
-
26. The system according to claim 19, wherein said emerging patterns are derived dependent upon one or more predetermined conditions include:
-
a support level threshold of a pattern in a category;
a growth rate threshold between categories;
a monotonically increasing weighting function for a growth rate;
a score specifying an aggregate differentiating score of a discretized sample and a set of emerging patterns of a category, said score being dependent upon supports and weighted growth rates of emerging patterns in a category; and
a base score on said aggregate differentiating score for each category.
-
-
27. The system according to claim 26, further including means for storing and managing derived emerging patterns and said one or more conditions for deriving said emerging patterns.
-
28. The system according to claim 19, further including means for selecting patterns that cover more training samples and have stronger differentiating power, said pattern selecting means including:
-
means for sorting emerging patterns between two categories into a list in decreasing order of growth rate and support;
means for initializing a set of essential emerging patterns, essE, to contain a first emerging pattern in said list; and
means for, for each next pattern in said list, ordering said set of essential emerging patterns, said ordering means including;
means for, for each J in said set of emerging patterns essE such that I is a sub-pattern or subset of J, replacing J by I if either of the following conditions is true;
growthrate C′
→
C(I) exceeds growthrate C′
→
C(J),suppc(I) greatly exceeds suppc(J) and growthrate C′
→
C(I) exceeds the threshold on growth rate; and
means for adding I to said set of emerging patterns essE if both of the above conditions are false and I is not a super-pattern or superset of any pattern in said set of emerging patterns essE.
-
-
29. A system for ranking and classifying data using a processor, said system including:
-
means for inputting samples of said data to be classified;
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories;
means for computing base scores of aggregate differentiating scores for all samples and categories; and
means for ranking each category against each sample by measuring a normalized score for said sample, where the greater a normalized score, the higher is the rank of said sample, said normalized score formed by dividing said aggregate differentiating score by a corresponding base score;
wherein said mining means includes means for manipulating borders, where each border is an ordered pair (L, R), if each of L and R is an anti-chain collection of sets, each element of L is a subset of an element of R, and each element of R is a superset of some element in L, wherein a collection of sets represented by, or a set interval of, such a border are sets Y such that Y is a superset of an element of L and is a subset of an element of R. - View Dependent Claims (30)
-
-
31. A system for classifying data using a processor, said system including:
-
means for inputting samples of said data to be classified;
means for mining emerging patterns for all of a number of categories of said data;
means for computing aggregate differentiating scores for all samples of said data and said categories;
means for computing base scores of aggregate differentiating scores for all samples and categories; and
means for assigning a category to each sample, said category assigned to a sample having a normalized score that is maximal for said sample;
wherein said mining means includes means for manipulating borders, where each border is an ordered pair (L, R), if each of L and R is an anti-chain collection of sets, each element of L is a subset of an element of R, and each element of R is a superset of some element in L, wherein a collection of sets represented by, or a set interval of, such a border are sets Y such that Y is a superset of an element of L and is a subset of an element of R. - View Dependent Claims (32)
-
Specification