Integrated database and data-mining system
First Claim
1. A method for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the method comprising the steps of:
- a) performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;
b) determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
c) generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
d) determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;
e) repeating steps (c) and (d) with n=n+1 until the candidate set is empty; and
f) generating rules from the union of the determined frequent itemsets.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for mining data relationships from an integrated database and data-mining system are disclosed. A set of frequent 1-itemsets is generated using a group-by query on data transactions. From these frequent 1-itemsets and the transactions, frequent 2-itemsets are determined. A candidate set of (n+2)-itemsets are generated from the frequent 2-itemsets, where n=1. Frequent (n+2)-itemsets are determined from candidate set and the transaction table using a query operation. The candidate set and frequent (n+2)-itemset are generated for (n+1) until the candidate set is empty. Rules are then extracted from the union of the determined frequent itemsets.
-
Citations
33 Claims
-
1. A method for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the method comprising the steps of:
-
a) performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;
b) determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
c) generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
d) determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;
e) repeating steps (c) and (d) with n=n+1 until the candidate set is empty; and
f) generating rules from the union of the determined frequent itemsets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
each transaction corresponds to a plurality of items; and
the step of performing a group-by query includes the steps of counting the number of transactions that contain each item and selecting the items that have a support above a user-specified threshold in determining the frequent 1-itemsets, the support of an item being the number of transactions that contain the item.
-
-
3. The method as recited in claim 1, wherein the frequent 2-itemsets are generated using a join query.
-
4. The method as recited in claim 3, wherein the step of determining frequent 2-itemsets includes the steps of:
-
joining an 1-itemset with itself and two copies of the transaction table using join predicates;
grouping results of the joining step for a pair of items to count the support of the items in the pair; and
removing all 2-itemsets having a support below a specified threshold.
-
-
5. The method as recited in claim 1, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
-
6. The method as recited in claim 1, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
-
7. The method as recited in claim 1, wherein:
-
the frequent (n+2)-itemsets are determined using cascaded subqueries; and
the step of determining frequent (n+2)-itemsets includes the steps of;
selecting distinct first items in the candidate itemsets using a subquery;
joining the distinct first items with the transaction table;
cascading (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
determining which of the (n+2)-itemsets are frequent using the results of the last subqueries.
-
-
8. The method as recited in claim 1, wherein the step of generating rules includes the steps of:
-
putting all items from the frequent itemsets into a table F;
generating a set of candidate rules from the table F using a table function;
joining the candidate rules with the table F; and
filtering out from the candidate rules those that do not meet a confidence criteria.
-
-
9. The method as recited in claim 1, wherein:
-
the query engine is an object-relational engine; and
the steps of generating 2-itemsets and (n+1)-itemsets include the steps of;
selecting from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
if the Vertical method is selected, then transforming the transaction table into a vertical format; and
generating the frequent (n+1)-itemsets using the selected method.
-
-
10. The method as recited in claim 9 wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
-
11. The method as recited in claim 1, wherein the query engine is an SQL engine.
-
12. An integrated data-mining system comprising:
-
a database having a table of data transactions;
a query engine coupled to the database;
a query preprocessor for generating queries and input parameters for the query engine;
means for performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;
means for determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
means for generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
means for determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query having cascaded subqueries;
means for repeatedly generating the candidate set of (n+2)-itemsets and determining frequent (n+2)-itemsets with n=n+1 until the candidate set is empty; and
means for generating rules from the union of the determined frequent itemsets. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
each transaction corresponds to a plurality of items; and
the means for performing a group-by query includes means for counting the number of transactions that contain each item and selecting the items that have a support above a user-specified threshold in determining the frequent 1-itemsets, the support of an item being the number of transactions that contain the item.
-
-
15. The data-mining system as recited in claim 12, wherein the frequent 2-itemsets are generated using a join query.
-
16. The data-mining system as recited in claim 15, wherein the means for determining frequent 2-itemsets includes:
-
means for joining an 1-itemset with itself and two copies of the transaction table using join predicates;
means for grouping results of the joining step for a pair of items to count the support of the items in the pair; and
means for removing all 2-itemsets having a support below a specified threshold.
-
-
17. The data-mining system as recited in claim 12, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
-
18. The data-mining system as recited in claim 12, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
-
19. The data-mining system as recited in claim 12, wherein:
-
the frequent (n+2)-itemsets are determined using cascaded subqueries; and
the means for determining frequent (n+2)-itemsets includes;
means for selecting distinct first items in the candidate itemsets using a subquery;
means for joining the distinct first items with the transaction table;
means for cascading (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
means for determining which of the (n+2)-itemsets are frequent using the results of the last subqueries.
-
-
20. The data-mining system as recited in claim 12, wherein the means for generating rules includes:
-
means for putting all items from the frequent itemsets into a table F;
means for generating a set of candidate rules from the table F using a table function;
means for joining the candidate rules with the table F; and
means for filtering out from the candidate rules those that do not meet a confidence criteria.
-
-
21. The data-mining system as recited in claim 12, wherein:
-
the query engine is an object-relational engine; and
the means for generating 2-itemsets and (n+1)-itemsets include;
means for selecting from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
if the Vertical method is selected, then means for transforming the transaction table into a vertical format; and
means for generating the frequent (n+1)-itemsets using the selected method.
-
-
22. The data-mining system as recited in claim 21, wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
-
23. The data-mining system as recited in claim 12, wherein the query engine is an SQL engine.
-
13. A computer program product for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the product comprising:
-
a) means, recorded on the recording medium, for directing the system to perform a group-by query on the transaction table to generate a set of frequent 1-itemsets;
b) means, recorded on the recording medium, for directing the system to determine frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
c) means, recorded on the recording medium, for directing the system to generate a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
d) means, recorded on the recording medium, for directing the system to determine frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;
e) means, recorded on the recording medium, for directing the system to repeatedly generate the candidate set of (n+2)-itemsets and determine frequent (n+2)-itemsets with n=n+1 until the candidate set is empty; and
f) means, recorded on the recording medium, for directing the system to generate rules from the union of the determined frequent itemsets. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
each transaction corresponds to a plurality of items; and
the means for directing to perform a group-by query includes means, recorded on the recording medium, for directing the system to count the number of transactions that contain each item and selecting the items that have a support above a user-specified threshold in determining the frequent 1-itemsets, the support of an item being the number of transactions that contain the item.
-
-
25. The product as recited in claim 13, wherein the frequent 2-itemsets are generated using a join query.
-
26. The product as recited in claim 25, wherein the means for directing to determine frequent 2-itemsets includes:
-
means, recorded on the recording medium, for directing the system to join an 1-itemset with itself and two copies of the transaction table using join predicates;
means, recorded on the recording medium, for directing the system to group results of the joining step for a pair of items to count the support of the items in the pair; and
means, recorded on the recording medium, for directing the system to remove all 2-itemsets having a support below a specified threshold.
-
-
27. The product as recited in claim 13, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
-
28. The product as recited in claim 13, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
-
29. The product as recited in claim 13, wherein:
-
the frequent (n+2)-itemsets are determined using cascaded subqueries; and
the means for directing to determine frequent (n+2)-itemsets includes;
means, recorded on the recording medium, for directing the system to select distinct first items in the candidate itemsets using a subquery;
means, recorded on the recording medium, for directing the system to join the distinct first items with the transaction table;
means, recorded on the recording medium, for directing the system to cascade (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
means, recorded on the recording medium, for directing the system to determine which of the (n+2)-itemsets are frequent using the results of the last subqueries.
-
-
30. The product as recited in claim 13, wherein the means for directing to generate rules includes:
-
means, recorded on the recording medium, for directing the system to put all items from the frequent itemsets into a table F;
means, recorded on the recording medium, for directing the system to generate a set of candidate rules from the table F using a table function;
means, recorded on the recording medium, for directing the system to join the candidate rules with the table F; and
means, recorded on the recording medium, for directing the system to filter out from the candidate rules those that do not meet a confidence criteria.
-
-
31. The product as recited in claim 13, wherein:
-
the query engine is an object-relational engine; and
the means for directing the system to generate 2-itemsets and (n+1)-itemsets include;
means, recorded on the recording medium, for directing the system to select from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
if the Vertical method is selected, then means, recorded on the recording medium, for directing the system to transform the transaction table into a vertical format; and
means, recorded on the recording medium, for directing the system to generate the frequent (n+1)-itemsets using the selected method.
-
-
32. The product as recited in claim 31, wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
-
33. The product as recited in claim 13, wherein the query engine is an SQL engine.
Specification