Parallel classification for data mining in a shared-memory multiprocessor system
First Claim
1. A method for generating a decision-tree classifier in a shared-memory multiprocessor system from a set of records, the tree having a plurality of nodes, the method comprising the steps of:
- (a) generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
(b) assigning each attribute list of the current node to one of the processors;
(c) each processor accessing the attribute lists assigned to the processor, in the shared memory, to determine a best split for each attribute list;
(d) the processors cooperatively determining, through the shared memory, a global best split for all the attribute lists associated with the current node;
(e) reassigning each attribute list of the current node to one of the processors;
(f) each processor splitting the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
(g) repeating steps (b)-(f) with each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for generating a decision-tree classifier in parallel in a shared-memory multiprocessor system is disclosed. The processors first generate in the shared memory an attribute list for each record attribute. Each attribute list is assigned to a processor. The processors independently determine the best splits for their respective assigned lists, and cooperatively determine a global best split for all attribute lists. The attribute lists are reassigned to the processors and split according to the global best split into the lists for child nodes. The split attribute lists are again assigned to the processors and the process is repeated for each new child node until each attribute list for the new child nodes includes only tuples of the same record class or a fixed number of tuples.
132 Citations
51 Claims
-
1. A method for generating a decision-tree classifier in a shared-memory multiprocessor system from a set of records, the tree having a plurality of nodes, the method comprising the steps of:
-
(a) generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
(b) assigning each attribute list of the current node to one of the processors;
(c) each processor accessing the attribute lists assigned to the processor, in the shared memory, to determine a best split for each attribute list;
(d) the processors cooperatively determining, through the shared memory, a global best split for all the attribute lists associated with the current node;
(e) reassigning each attribute list of the current node to one of the processors;
(f) each processor splitting the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
(g) repeating steps (b)-(f) with each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
the classifier is a multilevel tree structure, each level having a plurality of nodes;
the method further comprises the step of creating, for every K consecutive nodes at the same level of the tree, K different files for storing the attribute lists of the K nodes; and
the determination of the best splits for the respective attribute lists of the K nodes and the building of the data structure for the K nodes are done in parallel by each processor.
-
-
12. The method as recited in claim 11, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
-
13. The method as recited in claim 11, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
-
14. The method as recited in claim 1, wherein:
-
all processors of the system are initially associated with the current node;
the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
-
-
15. The method as recited in claim 14, wherein:
-
a processor becomes free when all the nodes associated with the processor can no longer be split; and
the free processor would join the other processors in processing the attributes lists for other nodes.
-
-
16. The method as recited in claim 1, further comprising the step of pruning the decision-tree classifier to obtain a more compact classifier.
-
17. The method as recited in claim 16, wherein the step of pruning is based on a Minimum Description Length model of the classifier.
-
18. A computer program product for use with a shared-memory multiprocessor system for generating a decision-tree classifier from a set of records, the tree having a plurality of nodes, the product comprising:
-
a computer-readable medium;
means, provided on the computer-readable medium, for directing the processors to cooperatively generate in the shared memory an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
means, provided on the computer-readable medium, for directing the system to assign each attribute list of the current node to one of the processors;
means, provided on the computer-readable medium, for directing each processor to access the attribute lists assigned to the processor, in the shared memory, and determine a best split for each attribute list;
means, provided on the computer-readable medium, for directing the processors to cooperatively determine, through the shared memory, a global best split for all the attribute lists associated with the current node;
means, provided on the computer-readable medium, for directing the processors to reassign each attribute list of the current node to one of the processors;
means, provided on the computer-readable medium, for directing each processor to split the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
means provided on the computer-readable medium, for directing the processors to process each newly created child node as the current node, until each attribute list for the newly child nodes includes only tuples of the same record class. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
the classifier is a multilevel tree structure, each level having a plurality of nodes;
the computer program product further comprises means, provided on the computer-readable medium, for directing the system to create, for every K consecutive nodes at the same level of the tree, K different files for storing the attribute lists of the K nodes; and
the determination of the best splits for the respective attribute lists of the K nodes and the building of the data structure for the K nodes are done in parallel by each processor.
-
-
29. The computer program product as recited in claim 28, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
-
30. The computer program product as recited in claim 28, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
-
31. The computer program product as recited in claim 18, wherein:
-
all processors of the system are initially associated with the current node;
the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
-
-
32. The computer program product as recited in claim 31, wherein:
-
a processor becomes free when all the nodes associated with the processor can no longer be split; and
the free processor would join the other processors in processing the attributes lists for other nodes.
-
-
33. The computer program product as recited in claim 18, further comprising means, provided on the computer-readable medium, for directing the system to prune the decision-tree classifier to obtain a more compact classifier.
-
34. The computer program product as recited in claim 33, wherein the pruning is based on a Minimum Description Length model of the classifier.
-
35. A database system for generating a decision-tree classifier in a shared-memory multiprocessor computer from a set of records, the tree having a plurality of nodes, the system comprising:
-
means for generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
means for assigning each attribute list of the current node to one of the processors;
means for accessing the attribute lists in the shared memory, by each processor, to determine a best split for each attribute list;
means for cooperatively determining by the processors, through the shared memory, a global best split for all the attribute lists associated with the current node;
means for reassigning each attribute list of the current node to one of the processors;
means for splitting, by each processor, the reassigned attribute lists according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
means for processing each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class. - View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
the system further comprises means for creating, for every K consecutive nodes at the same level of the tree, K different files for storing the attribute lists of the K nodes; and
the determination of the best splits for the respective attribute lists of the K nodes and the building of the data structure for the K nodes are done in parallel by each processor.
-
-
46. The system as recited in claim 45, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
-
47. The system as recited in claim 45, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
-
48. The system as recited in claim 35, wherein:
-
all processors of the computer are initially associated with the current node;
the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
-
-
49. The system as recited in claim 48, wherein:
-
a processor becomes free when all the nodes associated with the processor can no longer be split; and
the free processor would join the other processors in processing the attributes lists for other nodes.
-
-
50. The system as recited in claim 35, further comprising means for pruning the decision-tree classifier to obtain a more compact classifier.
-
51. The system as recited in claim 50 wherein the means for pruning is based on a Minimum Description Length model of the classifier.
Specification