GRAPH QUERYING, GRAPH MOTIF MINING AND THE DISCOVERY OF CLUSTERS
First Claim
Patent Images
1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:
- (a) selecting one or more vertices or edges of a database graph as features;
(b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;
(c) evaluating the feature vectors; and
(d) determining a statistical significance of the feature vectors based on the evaluating step (c).
2 Assignments
0 Petitions
Accused Products
Abstract
A method for analyzing, querying, and mining graph databases using subgraph and similarity querying. An index structure, known as a closure tree, is defined for topological summarization of a set of graphs. In addition, a significance model is created in which the graphs are transformed into histograms of primitive components. Finally, connected substructures or clusters, comprising paths or trees, are detected in networks found in the graph databases using a random walk technique and a repeated random walk technique.
92 Citations
14 Claims
-
1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:
-
(a) selecting one or more vertices or edges of a database graph as features; (b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph; (c) evaluating the feature vectors; and (d) determining a statistical significance of the feature vectors based on the evaluating step (c). - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method for finding a significant group of proteins in a genome scale protein interaction network comprising:
-
(a) obtaining a graph G=(V,E) representing a genome scale protein interaction network, wherein V comprises a set of nodes representing proteins in the graph and E comprises a set of weighted undirected edges between pairs of nodes, wherein the edges are weighted by a probability of interaction; (b) beginning on an initial node, moving to a neighboring node based on the weight of connecting edges; (c) moving to a new neighboring node based on the weight of connecting edges at every time tick for a defined period of time; (d) teleporting to the initial node and repeating steps (b) and (c) based on a restart probability α
;(e) determining a significant group of nodes comprising a cluster of proteins based on a proximity of a node to the initial node, wherein the proximity is based on a percentage of time spent on the node during steps (b) and (c); (f) repeating steps (b)-(d), wherein every node in the network is used as the initial node; and (g) inserting the cluster of proteins into a priority queue based on a statistical significance of the proximity of the nodes of each cluster. - View Dependent Claims (7)
-
-
8. A computer-implemented apparatus for determining a significance of frequent subgraphs in a database graph comprising:
-
(a) means for selecting one or more vertices or edges of a database graph as features; (b) means for transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph; (c) means for evaluating the feature vectors; and (d) means for determining a statistical significance of the feature vectors based on the means for evaluating (c). - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer-implemented apparatus for finding a significant group of proteins in a genome scale protein interaction network comprising:
-
(a) means for obtaining a graph G=(V,E) representing a genome scale protein interaction network, wherein V comprises a set of nodes representing proteins in the graph and E comprises a set of weighted undirected edges between pairs of nodes, wherein the edges are weighted by a probability of interaction; (b) means for, beginning on an initial node, moving to a neighboring node based on the weight of connecting edges; (c) means for moving to a new neighboring node based on the weight of connecting edges at every time tick for a defined period of time; (d) means for teleporting to the initial node and repeating (b) and (c) based on a restart probability α
;(e) means for determining a significant group of nodes comprising a cluster of proteins based on a proximity of a node to the initial node, wherein the proximity is based on a percentage of time spent on the node during (b) and (c); (f) means for repeating (b)-(d), wherein every node in the network is used as the initial node; and (g) means for inserting the cluster of proteins into a priority queue based on a statistical significance of the proximity of the nodes of each cluster. - View Dependent Claims (14)
-
Specification