Integrated database and data-mining system

US 6,324,533 B1
Filed: 05/29/1998
Issued: 11/27/2001
Est. Priority Date: 05/29/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the method comprising the steps of:

a) performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;

b) determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;

c) generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;

d) determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;

e) repeating steps (c) and (d) with n=n+1 until the candidate set is empty; and

f) generating rules from the union of the determined frequent itemsets.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for mining data relationships from an integrated database and data-mining system are disclosed. A set of frequent 1-itemsets is generated using a group-by query on data transactions. From these frequent 1-itemsets and the transactions, frequent 2-itemsets are determined. A candidate set of (n+2)-itemsets are generated from the frequent 2-itemsets, where n=1. Frequent (n+2)-itemsets are determined from candidate set and the transaction table using a query operation. The candidate set and frequent (n+2)-itemset are generated for (n+1) until the candidate set is empty. Rules are then extracted from the union of the determined frequent itemsets.

Citations

33 Claims

1. A method for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the method comprising the steps of:
- a) performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;
  
  b) determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
  
  c) generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
  
  d) determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;
  
  e) repeating steps (c) and (d) with n=n+1 until the candidate set is empty; and
  
  f) generating rules from the union of the determined frequent itemsets.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as recited in claim 1, wherein:
3. The method as recited in claim 1, wherein the frequent 2-itemsets are generated using a join query.
4. The method as recited in claim 3, wherein the step of determining frequent 2-itemsets includes the steps of:
- joining an 1-itemset with itself and two copies of the transaction table using join predicates;
  
  grouping results of the joining step for a pair of items to count the support of the items in the pair; and
  
  removing all 2-itemsets having a support below a specified threshold.
5. The method as recited in claim 1, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
6. The method as recited in claim 1, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
7. The method as recited in claim 1, wherein:
- the frequent (n+2)-itemsets are determined using cascaded subqueries; and
  
  the step of determining frequent (n+2)-itemsets includes the steps of;
  
  selecting distinct first items in the candidate itemsets using a subquery;
  
  joining the distinct first items with the transaction table;
  
  cascading (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
  
  determining which of the (n+2)-itemsets are frequent using the results of the last subqueries.
8. The method as recited in claim 1, wherein the step of generating rules includes the steps of:
- putting all items from the frequent itemsets into a table F;
  
  generating a set of candidate rules from the table F using a table function;
  
  joining the candidate rules with the table F; and
  
  filtering out from the candidate rules those that do not meet a confidence criteria.
9. The method as recited in claim 1, wherein:
- the query engine is an object-relational engine; and
  
  the steps of generating 2-itemsets and (n+1)-itemsets include the steps of;
  
  selecting from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
  
  if the Vertical method is selected, then transforming the transaction table into a vertical format; and
  
  generating the frequent (n+1)-itemsets using the selected method.
10. The method as recited in claim 9 wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
11. The method as recited in claim 1, wherein the query engine is an SQL engine.

12. An integrated data-mining system comprising:
- a database having a table of data transactions;
  
  a query engine coupled to the database;
  
  a query preprocessor for generating queries and input parameters for the query engine;
  
  means for performing a group-by query on the transaction table to generate a set of frequent 1-itemsets;
  
  means for determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
  
  means for generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
  
  means for determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query having cascaded subqueries;
  
  means for repeatedly generating the candidate set of (n+2)-itemsets and determining frequent (n+2)-itemsets with n=n+1 until the candidate set is empty; and
  
  means for generating rules from the union of the determined frequent itemsets.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 14. The data-mining system as recited in claim 12, wherein:
15. The data-mining system as recited in claim 12, wherein the frequent 2-itemsets are generated using a join query.
16. The data-mining system as recited in claim 15, wherein the means for determining frequent 2-itemsets includes:
- means for joining an 1-itemset with itself and two copies of the transaction table using join predicates;
  
  means for grouping results of the joining step for a pair of items to count the support of the items in the pair; and
  
  means for removing all 2-itemsets having a support below a specified threshold.
17. The data-mining system as recited in claim 12, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
18. The data-mining system as recited in claim 12, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
19. The data-mining system as recited in claim 12, wherein:
- the frequent (n+2)-itemsets are determined using cascaded subqueries; and
  
  the means for determining frequent (n+2)-itemsets includes;
  
  means for selecting distinct first items in the candidate itemsets using a subquery;
  
  means for joining the distinct first items with the transaction table;
  
  means for cascading (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
  
  means for determining which of the (n+2)-itemsets are frequent using the results of the last subqueries.
20. The data-mining system as recited in claim 12, wherein the means for generating rules includes:
- means for putting all items from the frequent itemsets into a table F;
  
  means for generating a set of candidate rules from the table F using a table function;
  
  means for joining the candidate rules with the table F; and
  
  means for filtering out from the candidate rules those that do not meet a confidence criteria.
21. The data-mining system as recited in claim 12, wherein:
- the query engine is an object-relational engine; and
  
  the means for generating 2-itemsets and (n+1)-itemsets include;
  
  means for selecting from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
  
  if the Vertical method is selected, then means for transforming the transaction table into a vertical format; and
  
  means for generating the frequent (n+1)-itemsets using the selected method.
22. The data-mining system as recited in claim 21, wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
23. The data-mining system as recited in claim 12, wherein the query engine is an SQL engine.

13. A computer program product for mining rules from an integrated database and data-mining system having a table of data transactions and a query engine, the product comprising:
- a) means, recorded on the recording medium, for directing the system to perform a group-by query on the transaction table to generate a set of frequent 1-itemsets;
  
  b) means, recorded on the recording medium, for directing the system to determine frequent 2-itemsets from the frequent 1-itemsets and the transaction table;
  
  c) means, recorded on the recording medium, for directing the system to generate a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1;
  
  d) means, recorded on the recording medium, for directing the system to determine frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query operation;
  
  e) means, recorded on the recording medium, for directing the system to repeatedly generate the candidate set of (n+2)-itemsets and determine frequent (n+2)-itemsets with n=n+1 until the candidate set is empty; and
  
  f) means, recorded on the recording medium, for directing the system to generate rules from the union of the determined frequent itemsets.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The product as recited in claim 13, wherein:
25. The product as recited in claim 13, wherein the frequent 2-itemsets are generated using a join query.
26. The product as recited in claim 25, wherein the means for directing to determine frequent 2-itemsets includes:
- means, recorded on the recording medium, for directing the system to join an 1-itemset with itself and two copies of the transaction table using join predicates;
  
  means, recorded on the recording medium, for directing the system to group results of the joining step for a pair of items to count the support of the items in the pair; and
  
  means, recorded on the recording medium, for directing the system to remove all 2-itemsets having a support below a specified threshold.
27. The product as recited in claim 13, wherein the candidate (n+2)-itemsets are generated using an (n+2)-way join operation.
28. The product as recited in claim 13, wherein the frequent (n+2)-itemsets are determined using an (n+3)-way join query, on the candidate itemsets and (n+2) copies of the transaction table.
29. The product as recited in claim 13, wherein:
- the frequent (n+2)-itemsets are determined using cascaded subqueries; and
  
  the means for directing to determine frequent (n+2)-itemsets includes;
  
  means, recorded on the recording medium, for directing the system to select distinct first items in the candidate itemsets using a subquery;
  
  means, recorded on the recording medium, for directing the system to join the distinct first items with the transaction table;
  
  means, recorded on the recording medium, for directing the system to cascade (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and
  
  means, recorded on the recording medium, for directing the system to determine which of the (n+2)-itemsets are frequent using the results of the last subqueries.
30. The product as recited in claim 13, wherein the means for directing to generate rules includes:
- means, recorded on the recording medium, for directing the system to put all items from the frequent itemsets into a table F;
  
  means, recorded on the recording medium, for directing the system to generate a set of candidate rules from the table F using a table function;
  
  means, recorded on the recording medium, for directing the system to join the candidate rules with the table F; and
  
  means, recorded on the recording medium, for directing the system to filter out from the candidate rules those that do not meet a confidence criteria.
31. The product as recited in claim 13, wherein:
- the query engine is an object-relational engine; and
  
  the means for directing the system to generate 2-itemsets and (n+1)-itemsets include;
  
  means, recorded on the recording medium, for directing the system to select from Vertical, GatherCount, and GatherJoin a method having the best cost for generating the (n+1)-itemsets;
  
  if the Vertical method is selected, then means, recorded on the recording medium, for directing the system to transform the transaction table into a vertical format; and
  
  means, recorded on the recording medium, for directing the system to generate the frequent (n+1)-itemsets using the selected method.
32. The product as recited in claim 31, wherein the cost for generating the itemsets is based on statistics about the data transactions and the candidate itemsets.
33. The product as recited in claim 13, wherein the query engine is an SQL engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Thomas, Shiby, Agrawal, Rakesh, Sarawagi, Sunita
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
LE, UYEN T

Application Number

US09/087,561
Time in Patent Office

1,278 Days
Field of Search

707/1, 707/3, 707/4, 707/6, 705/10, 706/47, 706/46
US Class Current

1/1
CPC Class Codes

G06F 16/2465   Query processing support fo...

G06F 2216/03   Data mining

Y10S 707/99933   Query processing, i.e. sear...

Integrated database and data-mining system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Integrated database and data-mining system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links