Methods and systems for processing large graphs using density-based processes using map-reduce
First Claim
1. A computer-implemented method for processing a graph comprising graph data in a network comprising a plurality of individual processor-based machines, the method comprising:
- partitioning the graph into a plurality of partitions wherein the number of the plurality of partitions is a function of the number of the plurality of machines;
assigning each partition to a respective machine of the plurality of machines by applying a splitting function to the graph data;
applying a density-based filter to each partition to produce a plurality of partitions of a filtered graph, the density-based filter applies one or more density constraints to each partition such that vertices with at least a minimum number of neighboring nodes within a distance defined by the density constraints include nodes with edges pointing to neighboring nodes and vertices that do not have the minimum number of neighboring nodes within the distance defined by the density constraints do not include nodes with edges pointing to neighboring nodes;
applying a partial connectivity detector process that detects connectivity between nodes to each of the partitions of the filtered graph to produce sub-clusters of nodes of the filtered graph based on the edges between the nodes; and
merging the sub-clusters of nodes through a message based merge process.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are directed to a density-based clustering algorithm that decomposes and reformulates the DBSCAN algorithm to facilitate its performance on the Map-Reduce model. The DBSCAN algorithm is reformulated into connectivity problem using a density filter method and a partial connectivity detector. The density-based clustering algorithm uses message passing and edge adding to increase the speed of result merging, it also uses message mining techniques to further decrease the number of iterations to process the input graph. The algorithm is scalable, and can be accelerated by using more machines in a distributed computer network implementing the Map-Reduce program.
126 Citations
18 Claims
-
1. A computer-implemented method for processing a graph comprising graph data in a network comprising a plurality of individual processor-based machines, the method comprising:
-
partitioning the graph into a plurality of partitions wherein the number of the plurality of partitions is a function of the number of the plurality of machines; assigning each partition to a respective machine of the plurality of machines by applying a splitting function to the graph data; applying a density-based filter to each partition to produce a plurality of partitions of a filtered graph, the density-based filter applies one or more density constraints to each partition such that vertices with at least a minimum number of neighboring nodes within a distance defined by the density constraints include nodes with edges pointing to neighboring nodes and vertices that do not have the minimum number of neighboring nodes within the distance defined by the density constraints do not include nodes with edges pointing to neighboring nodes; applying a partial connectivity detector process that detects connectivity between nodes to each of the partitions of the filtered graph to produce sub-clusters of nodes of the filtered graph based on the edges between the nodes; and merging the sub-clusters of nodes through a message based merge process. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for identifying clusters in a large graph using a network of coupled machines, each machine executing a Map-Reduce process, the system comprising:
-
a processor; and one or more stored sequences of instructions which, when executed by the processor, cause the processor to carry out the steps of; a splitting function component partitioning the graph into a plurality of partitions by applying a splitting function to the graph and assigning each partition to a respective machine of the plurality of machines; applying a density-based filter applied to each partition to produce a plurality of partitions of a filtered graph, the density-based filter applies one or more density constraints to each partition such that vertices with at least a minimum number of neighboring nodes within a distance defined by the density constraints include nodes with edges pointing to neighboring nodes and vertices that do not have the minimum number of neighboring nodes within the distance defined by the density constraints do not include nodes with edges pointing to neighboring nodes; applying a partial connectivity detector applied to each of the partitions of the filtered graph to produce sub-clusters of nodes of the filtered graph based on the edges between the nodes; and a merger merging the sub-clusters of nodes through a message based merge process. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A non-volatile, non-transitory machine-readable medium containing one or more sequences of instructions for controlling access to an application program on a computer network, the instructions configured to cause a processor to:
-
partition a graph to be processed by a network of coupled computers into a plurality of partitions wherein the number of the plurality of partitions is a function of the number of the plurality of machines; assign each partition to a respective computer of a plurality of computers comprising the network by applying a splitting function to the graph; apply a density-based filter to each partition to produce a plurality of partitions of a filtered graph, the density-based filter applies one or more density constraints to each partition such that vertices with at least a minimum number of neighboring nodes within a distance defined by the density constraints include nodes with edges pointing to neighboring nodes and vertices that do not have the minimum number of neighboring nodes within the distance defined by the density constraints do not include nodes with edges pointing to neighboring nodes; apply a partial connectivity detector process that detects connectivity between nodes to each of the partitions of the filtered graph to produce sub-clusters of nodes of the filtered graph based on the edges between the nodes; and merge the sub-clusters of nodes through a message based merge process.
-
Specification