Web graph compression through scalable pattern mining
First Claim
1. A machine-implemented method for compressing a graph including a plurality of nodes and a plurality of links between ones of the plurality of nodes, the machine-implemented method comprising:
- clustering similar nodes of the plurality nodes;
scanning a node list, corresponding to the clustered similar nodes of the plurality of nodes to determine frequencies of occurrence of links of the clustered similar nodes;
sorting a list of the links of the clustered similar nodes based on the determined frequencies of occurrence of the links of the clustered similar nodes to produce a sorted link list;
adding each of the links, having a frequency of occurrence greater than a predetermined number, from the sorted link list as an item of a prefix tree, along with an associated node list;
walking the prefix tree to produce a list of candidate virtual nodes;
determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the clustered similar nodes having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree;
sorting the list of candidate virtual nodes based on results of applying the compression performance formula;
selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance;
creating the virtual node; and
substituting links to the virtual node for links corresponding to the one of the patterns, whereinthe machine-implemented method is implemented either on one processing device or on a networked plurality of processing devices.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and a processing device are provided for compressing a web graph including multiple nodes and links between the multiple nodes. Nodes of the web graph may be clustered into groups including no more than a predetermined number of nodes. A list of links of the clustered nodes may be created and sorted based on a frequency of occurrence of each of the links. A prefix tree may be created based on the sorted list of links. The prefix tree may be walked to find candidate virtual nodes. The candidate virtual nodes may be analyzed according to a selection criteria and a virtual node may be selected. The prefix tree may be adjusted to account for the selection of the virtual node and the virtual node may be added to the web graph.
-
Citations
13 Claims
-
1. A machine-implemented method for compressing a graph including a plurality of nodes and a plurality of links between ones of the plurality of nodes, the machine-implemented method comprising:
-
clustering similar nodes of the plurality nodes; scanning a node list, corresponding to the clustered similar nodes of the plurality of nodes to determine frequencies of occurrence of links of the clustered similar nodes; sorting a list of the links of the clustered similar nodes based on the determined frequencies of occurrence of the links of the clustered similar nodes to produce a sorted link list; adding each of the links, having a frequency of occurrence greater than a predetermined number, from the sorted link list as an item of a prefix tree, along with an associated node list; walking the prefix tree to produce a list of candidate virtual nodes; determining a respective compression performance for each pattern corresponding to a respective candidate virtual node according to a compression performance formula Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the clustered similar nodes having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree;sorting the list of candidate virtual nodes based on results of applying the compression performance formula; selecting one of the list of candidate virtual nodes as a virtual node based on one of the patterns having a highest determined compression performance; creating the virtual node; and substituting links to the virtual node for links corresponding to the one of the patterns, wherein the machine-implemented method is implemented either on one processing device or on a networked plurality of processing devices. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A processing device comprising:
-
at least one processor; and a memory connected to the at least one processor, the memory further comprising; instructions for determining similar nodes of a graph and clustering the similar nodes, each of the nodes representing a respective network document and each link of a respective node being a link to a second respective node, instructions for finding a plurality of patterns among the clustered similar nodes, the instructions for finding a plurality of patterns among the clustered similar nodes further comprising; instructions for creating a histogram of frequencies of occurrence of links of the clustered similar nodes, and instructions for adding a plurality of items to a prefix tree based on a descending frequency of occurrence order of the links of the clustered similar nodes, each of the plurality of items including a respective link ID and respective node IDs of associated ones of the clustered similar nodes, each link corresponding to the respective link ID having at least a predetermined frequency of occurrence, instructions for walking ones of the plurality of items of the prefix tree from a root item to find a next unprocessed item having a longest path and more than one associated node, instructions for adding a next candidate virtual node to a list of candidate virtual nodes based on the found next unprocessed item having the longest path and the more than one associated node ID, instructions for marking the next found unprocessed item to indicate processing of the item, instructions for walking up the prefix tree to find a new next unprocessed item having a larger number of associated nodes than the next found item, and instructions for adding a new next candidate virtual node to a list of candidate virtual nodes based on the found new next unprocessed item having the larger number of associated nodes; and the memory further comprising; instructions for selecting one of the found plurality of patterns and creating a virtual node, the instructions for selecting one of the found plurality of patterns and creating a virtual node further comprising; instructions for determining a compression performance with respect to each one of the list of candidate virtual nodes according to a compression performance formula Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the clustered similar nodes having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree,instructions for selecting one of the list of candidate virtual nodes having a highest compression performance, and instructions for creating the virtual node based on the selected one of the list of candidate virtual nodes; and instructions for replacing links corresponding to the selected virtual node with links to the virtual node. - View Dependent Claims (7, 8, 9)
-
-
10. A system comprising:
-
a plurality of processing devices networked together, each of the processing devices comprising; at least one processor; and a memory connected to the at least one processor, the memory of a first one of the processing devices comprising; instructions for determining groups of similar nodes of a graph and clustering the groups of similar nodes, instructions for providing respective groups of the groups of similar nodes to respective second ones of the plurality of processing devices, instructions for receiving respective virtual nodes from each of the second ones of the plurality of processing devices, and instructions for replacing links corresponding to the virtual nodes with links to the virtual nodes; and the memories of each of the second ones of the plurality of processing devices further comprising; instructions for creating a histogram of frequencies of occurrence of links of similar nodes of the respective group of the groups of similar nodes, instructions for adding a plurality of items to a prefix tree based on a descending frequency of occurrence order of the links of the similar nodes of the respective group of the groups of similar nodes, each of the plurality of items including a respective link ID and respective node IDs of corresponding ones of the similar nodes, each link corresponding to the respective link ID having least a predetermined frequency of occurrence, instructions for walking ones of the plurality of items of the prefix tree from a root item to find a next unprocessed item having a longest path and more than one respective node ID, instructions for adding a next candidate virtual node to a list of candidate virtual nodes based on the found next unprocessed item having the longest path and more than one respective node ID, instructions for marking the next found unprocessed item to indicate processing of the item, instructions for walking up the prefix tree to find a new next unprocessed item having a larger number of node IDs than the next found item, instructions for adding a new next candidate virtual node to a list of candidate virtual nodes based on the found new next unprocessed item having a larger number of node IDs than the next found item, instructions for calculating a compression performance of respective ones of the items of the candidate virtual nodes according to;
Compression(P)=(P.frequency−
1)(P.size−
1)−
1, where Compression(P) is compression performance for a pattern P, P.frequency is a number of the similar nodes of the respective group having links corresponding to the pattern P, and P.size is a length of the pattern P in the prefix tree,instructions for selecting one item from the list of candidate virtual nodes having a highest compression performance, and instructions for providing, to the first one of the processing devices as a virtual node, the selected one item from the list of candidate virtual nodes. - View Dependent Claims (11, 12, 13)
-
Specification