Parallel classification for data mining in a shared-memory multiprocessor system

US 6,230,151 B1
Filed: 04/16/1998
Issued: 05/08/2001
Est. Priority Date: 04/16/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for generating a decision-tree classifier in a shared-memory multiprocessor system from a set of records, the tree having a plurality of nodes, the method comprising the steps of:

(a) generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;

(b) assigning each attribute list of the current node to one of the processors;

(c) each processor accessing the attribute lists assigned to the processor, in the shared memory, to determine a best split for each attribute list;

(d) the processors cooperatively determining, through the shared memory, a global best split for all the attribute lists associated with the current node;

(e) reassigning each attribute list of the current node to one of the processors;

(f) each processor splitting the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and

(g) repeating steps (b)-(f) with each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for generating a decision-tree classifier in parallel in a shared-memory multiprocessor system is disclosed. The processors first generate in the shared memory an attribute list for each record attribute. Each attribute list is assigned to a processor. The processors independently determine the best splits for their respective assigned lists, and cooperatively determine a global best split for all attribute lists. The attribute lists are reassigned to the processors and split according to the global best split into the lists for child nodes. The split attribute lists are again assigned to the processors and the process is repeated for each new child node until each attribute list for the new child nodes includes only tuples of the same record class or a fixed number of tuples.

132 Citations

51 Claims

1. A method for generating a decision-tree classifier in a shared-memory multiprocessor system from a set of records, the tree having a plurality of nodes, the method comprising the steps of:
- (a) generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
  
  (b) assigning each attribute list of the current node to one of the processors;
  
  (c) each processor accessing the attribute lists assigned to the processor, in the shared memory, to determine a best split for each attribute list;
  
  (d) the processors cooperatively determining, through the shared memory, a global best split for all the attribute lists associated with the current node;
  
  (e) reassigning each attribute list of the current node to one of the processors;
  
  (f) each processor splitting the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
  
  (g) repeating steps (b)-(f) with each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method as recited in claim 1, wherein the steps (b)-(f) are repeated with each newly created child node as the current node, until each attribute list for the newly created child nodes includes no more than a predetermined number of tuples.
  - 3. The method as recited in claim 1, wherein the classifier is a multilevel tree structure, and each level has a plurality of nodes.
  - 4. The method as recited in claim 1, wherein each processor accesses the assigned attribute lists for all of the nodes at the same level of the tree only once in determining the respective best splits.
  - 5. The method as recited in claim 4, wherein a shared variable is used to dynamically synchronize the assignment of the attribute lists to the processors.
  - 6. The method as recited in claim 5, wherein the assignment of the attribute lists to the processors is the same for all of the nodes at the same level of the tree.
  - 7. The method as recited in claim 1, wherein each processor accesses the reassigned attribute lists for all of the nodes at the same level of the tree only once in splitting the reassigned lists.
  - 8. The method as recited in claim 7, wherein a shared variable is used to dynamically synchronize the reassignment of the attribute lists to processors.
  - 9. The method as recited in claim 8, wherein the reassignment of the attribute lists to processors is the same for all of the nodes at the same level of the tree.
  - 10. The method as recited in claim 1, further comprising the step of building a data structure to determine whether a tuple of an attribute list belongs to a left child attribute list or a right child attribute list.
  - 11. The method as recited in claim 10, wherein:
12. The method as recited in claim 11, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
13. The method as recited in claim 11, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
14. The method as recited in claim 1, wherein:
- all processors of the system are initially associated with the current node;
  
  the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
  
  when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
15. The method as recited in claim 14, wherein:
- a processor becomes free when all the nodes associated with the processor can no longer be split; and
  
  the free processor would join the other processors in processing the attributes lists for other nodes.
16. The method as recited in claim 1, further comprising the step of pruning the decision-tree classifier to obtain a more compact classifier.
17. The method as recited in claim 16, wherein the step of pruning is based on a Minimum Description Length model of the classifier.

18. A computer program product for use with a shared-memory multiprocessor system for generating a decision-tree classifier from a set of records, the tree having a plurality of nodes, the product comprising:
- a computer-readable medium;
  
  means, provided on the computer-readable medium, for directing the processors to cooperatively generate in the shared memory an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
  
  means, provided on the computer-readable medium, for directing the system to assign each attribute list of the current node to one of the processors;
  
  means, provided on the computer-readable medium, for directing each processor to access the attribute lists assigned to the processor, in the shared memory, and determine a best split for each attribute list;
  
  means, provided on the computer-readable medium, for directing the processors to cooperatively determine, through the shared memory, a global best split for all the attribute lists associated with the current node;
  
  means, provided on the computer-readable medium, for directing the processors to reassign each attribute list of the current node to one of the processors;
  
  means, provided on the computer-readable medium, for directing each processor to split the attribute lists reassigned to the processor according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
  
  means provided on the computer-readable medium, for directing the processors to process each newly created child node as the current node, until each attribute list for the newly child nodes includes only tuples of the same record class.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 19. The computer program product as recited in claim 18, wherein the each newly created child node is processed as the current node until each attribute list for the newly created child nodes includes no more than a predetermined number of tuples.
  - 20. The computer program product as recited in claim 18, wherein the classifier is a multilevel tree structure, and each level has a plurality of nodes.
  - 21. The computer program product as recited in claim 18, wherein each processor accesses the respective assigned attribute lists for all of the nodes at the same level of the tree only once in determining the respective best splits.
  - 22. The computer program product as recited in claim 21, wherein a shared variable is used to dynamically synchronize the assignment of the attribute lists to the processors.
  - 23. The computer program product as recited in claim 22, wherein the assignment of the attribute lists to the processors is the same for all of the nodes at the same level of the tree.
  - 24. The computer program product as recited in claim 18, wherein each processor accesses the reassigned attribute lists for all of the nodes at the same level of the tree only once in splitting the reassigned lists.
  - 25. The computer program product as recited in claim 24, wherein a shared variable is used to dynamically synchronize the reassignment of the attribute lists to processors.
  - 26. The computer program product as recited in claim 25, wherein the reassignment of the attribute lists to processors is the same for all of the nodes at the same level of the tree.
  - 27. The computer program product as recited in claim 18, further comprising means, provided on the computer-readable medium, for directing the system to build a data structure to determine whether a tuple of an attribute list belongs to a left child attribute list or a right child attribute list.
  - 28. The computer program product as recited in claim 27, wherein:
29. The computer program product as recited in claim 28, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
30. The computer program product as recited in claim 28, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
31. The computer program product as recited in claim 18, wherein:
- all processors of the system are initially associated with the current node;
  
  the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
  
  when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
32. The computer program product as recited in claim 31, wherein:
- a processor becomes free when all the nodes associated with the processor can no longer be split; and
  
  the free processor would join the other processors in processing the attributes lists for other nodes.
33. The computer program product as recited in claim 18, further comprising means, provided on the computer-readable medium, for directing the system to prune the decision-tree classifier to obtain a more compact classifier.
34. The computer program product as recited in claim 33, wherein the pruning is based on a Minimum Description Length model of the classifier.

35. A database system for generating a decision-tree classifier in a shared-memory multiprocessor computer from a set of records, the tree having a plurality of nodes, the system comprising:
- means for generating cooperatively by the processors, in the shared memory, an attribute list for each attribute of the records, the attribute lists corresponding a current node and including tuples each having information on a record class;
  
  means for assigning each attribute list of the current node to one of the processors;
  
  means for accessing the attribute lists in the shared memory, by each processor, to determine a best split for each attribute list;
  
  means for cooperatively determining by the processors, through the shared memory, a global best split for all the attribute lists associated with the current node;
  
  means for reassigning each attribute list of the current node to one of the processors;
  
  means for splitting, by each processor, the reassigned attribute lists according to the global best split into new attribute lists, the new lists corresponding to child nodes of the current node and residing in the shared memory; and
  
  means for processing each newly created child node as the current node, until each attribute list for the newly created child nodes includes only tuples of the same record class.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 36. The system as recited in claim 35, wherein each newly created child node is processed as the current node until each attribute list for the newly created child nodes includes no more than a predetermined number of tuples.
  - 37. The system as recited in claim 35, wherein the classifier is a multilevel tree structure, and each level has a plurality of nodes.
  - 38. The system as recited in claim 35, wherein each processor accesses the assigned attribute lists for all of the nodes at the same level of the tree only once in determining the respective best splits.
  - 39. The system as recited in claim 38, wherein a shared variable is used to dynamically synchronize the assignment of the attribute lists to the processors.
  - 40. The system as recited in claim 39, wherein the assignment of the attribute lists to the processors is the same for all of the nodes at the same level of the tree.
  - 41. The system as recited in claim 35, wherein each processor accesses the reassigned attribute lists for all of the nodes at the same level of the tree only once in splitting the reassigned lists.
  - 42. The system as recited in claim 41, wherein a shared variable is used to dynamically synchronize the reassignment of the attribute lists to processors.
  - 43. The system as recited in claim 42, wherein the reassignment of the attribute lists to processors is the same for all of the nodes at the same level of the tree.
  - 44. The system as recited in claim 35, further comprising means for building a data structure to determine whether a tuple of an attribute list belongs to a left child attribute list or a right child attribute list.
  - 45. The system as recited in claim 44, wherein the classifier is a multilevel tree structure, each level having a plurality of nodes;
46. The system as recited in claim 45, wherein a barrier synchronization between two groups of K nodes is used to prevent any two processors from simultaneously accessing a same file.
47. The system as recited in claim 45, wherein K conditional variables are used to prevent any two processors from simultaneously accessing a same file.
48. The system as recited in claim 35, wherein:
- all processors of the computer are initially associated with the current node;
  
  the processors to which the attribute lists are assigned and reassigned are selected from the processors associated with the current node; and
  
  when the child nodes are created from the current node, the processors associated with the current node are divided into two subsets, each associated with a child node.
49. The system as recited in claim 48, wherein:
- a processor becomes free when all the nodes associated with the processor can no longer be split; and
  
  the free processor would join the other processors in processing the attributes lists for other nodes.
50. The system as recited in claim 35, further comprising means for pruning the decision-tree classifier to obtain a more compact classifier.
51. The system as recited in claim 50 wherein the means for pruning is based on a Minimum Description Length model of the classifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Ho, Ching-Tien, Zaki, Mohammed J., Agrawal, Rakesh
Primary Examiner(s)
Lintz, Paul R.
Assistant Examiner(s)
Khatri, Anil

Application Number

US09/061,808
Time in Patent Office

1,118 Days
Field of Search

395/706, 705/6, 706/49, 706/50, 706/12, 707/1, 707/101
US Class Current

706/12
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 2216/03   Data mining

G06F 8/453   Data distribution

G06N 20/00   Machine learning

G06N 5/01   Dynamic search techniques; ...

Y10S 707/99942   Manipulating data structure...

Parallel classification for data mining in a shared-memory multiprocessor system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

132 Citations

51 Claims

Specification

Solutions

Use Cases

Quick Links

Parallel classification for data mining in a shared-memory multiprocessor system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

132 Citations

51 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links