Web graph compression through scalable pattern mining
First Claim
1. A machine-implemented method for compressing a graph including a plurality of nodes and a plurality of links between ones of the plurality of nodes, the machine-implemented method comprising:
- clustering similar nodes of the plurality nodes;
scanning a node list, corresponding to the clustered similar nodes of the plurality of nodes to determine frequencies of occurrence of links of the clustered similar nodes;
sorting a list of the links of the clustered similar nodes based on the determined frequencies of occurrence of the links of the clustered similar nodes to produce a sorted link list;
adding each of the links, having a frequency of occurrence greater than a predetermined number, from the sorted link list as an item of a prefix tree, along with an associated node list;
walking the prefix tree to produce a list of candidate virtual nodes;
determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula;
sorting the list of candidate virtual nodes based on results of applying the compression performance formula;
selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance;
creating the virtual node; and
substituting links to the virtual node for links corresponding to the one of the patterns, whereinthe machine-implemented method is implemented either on one processing device or on a networked plurality of processing devices.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and a processing device are provided for compressing a web graph including multiple nodes and links between the multiple nodes. Nodes of the web graph may be clustered into groups including no more than a predetermined number of nodes. A list of links of the clustered nodes may be created and sorted based on a frequency of occurrence of each of the links. A prefix tree may be created based on the sorted list of links. The prefix tree may be walked to find candidate virtual nodes. The candidate virtual nodes may be analyzed according to a selection criteria and a virtual node may be selected. The prefix tree may be adjusted to account for the selection of the virtual node and the virtual node may be added to the web graph.
10 Citations
11 Claims
-
1. A machine-implemented method for compressing a graph including a plurality of nodes and a plurality of links between ones of the plurality of nodes, the machine-implemented method comprising:
-
clustering similar nodes of the plurality nodes; scanning a node list, corresponding to the clustered similar nodes of the plurality of nodes to determine frequencies of occurrence of links of the clustered similar nodes; sorting a list of the links of the clustered similar nodes based on the determined frequencies of occurrence of the links of the clustered similar nodes to produce a sorted link list; adding each of the links, having a frequency of occurrence greater than a predetermined number, from the sorted link list as an item of a prefix tree, along with an associated node list; walking the prefix tree to produce a list of candidate virtual nodes; determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula; sorting the list of candidate virtual nodes based on results of applying the compression performance formula; selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance; creating the virtual node; and substituting links to the virtual node for links corresponding to the one of the patterns, wherein the machine-implemented method is implemented either on one processing device or on a networked plurality of processing devices. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A machine-implemented method for compressing a graph including a plurality of nodes and a plurality of links between ones of the plurality of nodes, the machine-implemented method comprising:
-
clustering similar nodes of the plurality nodes; producing a sorted link list of the clustered similar nodes based on frequencies of occurrence of links of the clustered similar nodes; adding each of the links from the sorted link list as an item of a prefix tree, along with an associated node list; producing a list of candidate virtual nodes based on the prefix tree; determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the clustered similar nodes having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree;selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance; and substituting links to the virtual node for links corresponding to the one of the patterns, wherein the machine-implemented method is implemented either on one processing device or on a networked plurality of processing devices. - View Dependent Claims (7, 8, 9)
-
-
10. A processing device comprising:
-
at least one processor; and a memory connected to the at least one processor, the memory including instructions for performing a method comprising; clustering groups of similar nodes of a graph; producing a sorted link list of the clustered similar nodes based on frequencies of occurrence of links of the clustered similar nodes; adding each of the links from the sorted link list as an item of a prefix tree, along with an associated node list; producing a list of candidate virtual nodes based on the prefix tree; determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the clustered similar nodes having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree;selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance; and substituting links to the virtual node for links corresponding to the one of the patterns. - View Dependent Claims (11)
-
Specification