Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining
First Claim
1. A method of generating a tabulation of occurrences of one or more class attributes of database records as a combination of other attributes and attribute values in the database records comprising the steps ofa) scanning the database records;
- b) for each scanned database record, reformatting the data within said record from a first record format that includes attribute values for one or more attribute names and values for a selected one or more class attributes into a second record format with records that include combinations of attribute names, attribute values and the values for the selected one or more class attributes; and
c) grouping combinations of data that are scanned from the database to build statistics for the one or more class attributes.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention concerns a method and apparatus for generating a tabulation of counts of occurrences of value combinations of a set of attributes over a relation consisting of a set of database records. The gathered counts (also referred to as sufficient statistics) of attribute occurrences or correlation counts is most preferably used in building a classification or density estimation model from the database records that can be used to predict some attribute values based on other attribute values. A new SQL operator designated the `UNPIVOT` operator operates by scanning the database records and for each record reorganizes that data to form an UNPIVOTED data record that include the combinations of attribute name, attribute value and the values for one or more selected class attributes. The UNPIVOTED table can be used to produce the desired sufficient statistics in one scan of the data using standard database engines. While materialization of UNPIVOTED table would cause a large added scan cost overhead, the UNPIVOT operator allows us to achieve the counts without the added cost by combining the UNPIVOT operator with other SQL `select` and `group by` operators the UNPIVOTED table can be counted without the need for materializing it. The result is a guaranteed one pass algorithm that does not incur the added scan cost factor. The savings in scan cost can extend to several orders of magnitude compared to other methodologies for getting the counts supported by current database engines. The sufficient statistics so gathered can be used to drive a variety of data mining algorithms.
-
Citations
21 Claims
-
1. A method of generating a tabulation of occurrences of one or more class attributes of database records as a combination of other attributes and attribute values in the database records comprising the steps of
a) scanning the database records; -
b) for each scanned database record, reformatting the data within said record from a first record format that includes attribute values for one or more attribute names and values for a selected one or more class attributes into a second record format with records that include combinations of attribute names, attribute values and the values for the selected one or more class attributes; and c) grouping combinations of data that are scanned from the database to build statistics for the one or more class attributes. - View Dependent Claims (2, 3, 4, 7, 8, 11)
-
-
5. A method of generating a tabulation of occurrences of selected one or more class attributes of database records as a combination of other attributes and attribute values in the database records comprising the steps of
a) scanning the database records; -
b) for each scanned database record, reformatting the data within said record into records that include combinations of attribute name, attribute value and the selected one or more class attributes, the SQL language processor processes the UNPIVOT command to reformat the data; and c) grouping combinations of data that are scanned from the database to build statistics for the one or more class attributes, the scanning performed by an SQL language processor that processes an UNPIVOT command the UNPIVOT command is combined with a SELECT and a GROUP BY SQL command to provide a counts table without producing an UNPIVOTED table. - View Dependent Claims (6)
-
-
9. A method of generating a tabulation of occurrences of selected one or more class attributes of database records as a combination of other attributes and attribute values in the database records comprising the steps of
a) scanning the database records, the scanning performed by an SQL language processor that processes an UNPIVOT command; -
b) for each scanned database record, reformatting the data within said record into records that include combinations of attribute name, attribute value and the selected one or more class attributes, the reformatting performed by the SQL language processor that processes the UNPIVOT command; and c) grouping combinations of data that are scanned from the database to build statistics for more than one class attributes and more than one class attribute is evaluated to build a database classifier and wherein a statistical count table is built during the process of building the database classifier by means of multiple invocations of the UNPIVOT operator to data in the database.
-
-
10. A method of generating a tabulation of occurrences of selected one or more class attributes of database records as a combination of other attributes and attribute values in the database records comprising the steps of
a) reorganizing the database records into a specified format by causing an SQL language processor to execute a PIVOT SQL command; -
b) scanning the database records the scanning performed by an SQL language processor that processes an UNPIVOT command; c) for each scanned database record, reformatting the data within said record into records that include combinations of attribute name, attribute value and the selected one or more class attributes, the reformatting performed by the SQL language processor that processes the UNPIVOT command; and d) grouping combinations of data that are scanned from the database to build statistics for one or more class attributes.
-
-
12. A method for use with a database classifier on a computer including a computer storage device for storing the database;
- said database classifier processing statistical data from the records of the database, said method characterized by the steps of;
a) scanning records of a database and reformatting each scanned record into a set of records that are based on the contents of the database record attributes so that each of the database records provides multiple records containing combinations of attribute names with attribute types for that record along with the contents of a classifier variable; b) counting instances of the classifier variable for each of the records in the set of records; c) grouping the results of the counting step into a count table for use by the classifier; and d) scanning the database for other records to iteratively build a complete count table for use by the database classifier. - View Dependent Claims (13, 14)
- said database classifier processing statistical data from the records of the database, said method characterized by the steps of;
-
15. A method for use with a database classifier;
- said database classifier processing statistical data from the records of the database;
said method characterized by the steps of;a) scanning a database record and creating multiple associated records in a temporary store where the contents of each of the associated records contains a combination of a record attribute and an attribute value of the database record and a classifier variable of said database record; b) determining a count of the combinations of record attribute name and attribute values for each value or value range of the classifier variable; and c) scanning other records from the database and determining counts for the other records; and d) preparing a tabulation of the counts that are grouped by attribute name, attribute value and classifier value or value range. - View Dependent Claims (16)
- said database classifier processing statistical data from the records of the database;
-
17. Apparatus for use in creating a database classifier;
- said database classifier processing statistical data from the records of the database;
the apparatus comprising;a) data access means for reading database records from the database; b) memory means for storing temporary data records based on the data in a database record; and c) processor means for creating temporary data records in said memory means where the contents of each of said temporary records contains a combination of an attribute name and an attribute value of the database record for different classifier attribute values of said database;
said processor means including means for determining combinations of attribute name and attribute value for the temporary records; and
preparing a tabulation of the counts that are grouped by attribute name, attribute value and classifier value from the temporary records. - View Dependent Claims (19, 20, 21)
- said database classifier processing statistical data from the records of the database;
-
18. Apparatus for use in creating a database classifier, said classifier processing statistical data from the records of the database;
- the apparatus comprising;
a) data access means for reading database records from the database; b) memory means for temporarily storing data records based on the data in a database record; and c) processor means for creating temporary data records in said memory means where the contents of each of the temporary records contains a combination of an attribute name and an attribute value of the database record for different classifier variable values of said database;
said processor means including access means for accessing multiple records from the database and determining combinations of attribute name and attribute value for the other records; and
preparing a tabulation of the counts that are grouped by attribute name, attribute value and classifier value;
the processor means comprises means for processing SQL commands and wherein an UNPIVOT SQL command reformats the data of a data table which in combination with SQL SELECT and GROUP BY commands creates combinations of attributes and attribute values in a tabulated form in said memory means.
- the apparatus comprising;
Specification