Graph querying, graph motif mining and the discovery of clusters

US 20070239694A1
Filed: 02/27/2007
Published: 10/11/2007
Est. Priority Date: 02/27/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for conducting a database graph query, comprising:

(a) obtaining a first database graph and a second database graph, wherein the first database graph and second database graph each have two or more vertices and one or more edges;

(b) mapping the first database graph to the second database graph, wherein;

(i) each vertex in the first database graph has a corresponding vertex in the second database graph; and

(ii) each edge in the first database graph has a corresponding edge in the second database graph;

(c) creating a graph closure tree comprised of a union of the first database graph and the second database graph based on the mapping, wherein each node of the graph closure tree comprises a graph closure of the node'"'"'s children and each child of a leaf node comprises a database graph; and

(d) conducting a graph query based on the graph closure tree.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for analyzing, querying, and mining graph databases using subgraph and similarity querying. An index structure, known as a closure tree, is defined for topological summarization of a set of graphs. In addition, a significance model is created in which the graphs are transformed into histograms of primitive components. Finally, connected substructures or clusters, comprising paths or trees, are detected in networks found in the graph databases using a random walk technique and a repeated random walk technique.

195 Citations

14 Claims

1. A computer-implemented method for conducting a database graph query, comprising:
- (a) obtaining a first database graph and a second database graph, wherein the first database graph and second database graph each have two or more vertices and one or more edges;
  
  (b) mapping the first database graph to the second database graph, wherein;
  
  (i) each vertex in the first database graph has a corresponding vertex in the second database graph; and
  
  (ii) each edge in the first database graph has a corresponding edge in the second database graph;
  
  (c) creating a graph closure tree comprised of a union of the first database graph and the second database graph based on the mapping, wherein each node of the graph closure tree comprises a graph closure of the node'"'"'s children and each child of a leaf node comprises a database graph; and
  
  (d) conducting a graph query based on the graph closure tree.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the mapping comprises extending the first database graph by creating dummy vertices and dummy edges wherein every vertex and every edge of first database graph has a corresponding element in the second database graph.
  - 3. The method of claim 1, wherein the graph closure tree comprises:
    - a set of closure vertices, wherein each vertex comprises a union of attribute values of each vertex of the first database graph and each corresponding vertex of the second database graph; and
      
      a set of closure edges, wherein each edge comprises a union of attribute values of each edge of the first database graph and each corresponding edge of the second database graph.
  - 4. The method of claim 1, wherein the mapping comprises constructing a bipartite graph between the first database graph and the second database graph, wherein:
    - a first partition of the bipartite graph comprises vertices from the first database graph;
      
      a second partition of the bipartite graph comprises vertices from the second database graph;
      
      edges of the bipartite graph are formed by connecting the vertices from the first database graph to the vertices in the second database graph; and
      
      defining the mapping as a maximum similarity for each edge and vertex of the bipartite graph.
  - 5. The method of claim 1, wherein the mapping comprises (a) computing an initial similarity matrix for the first database graph and the second database graph, wherein each entry of the similarity matrix represents a weight similarity of each vertex of the first database graph to each vertex of the second database graph;
    - (b) creating a priority queue comprised of vertex pairs based on the weight similarity, wherein each vertex pair comprises a vertex from the first database graph and a most similar vertex from the second database graph based on the weight similarity;
      
      (c) processing the priority queue by;
      
      (i) marking a first vertex pair in the priority queue as matched;
      
      (ii) assigning a higher similarity weight to unmatched vertex pairs that are neighbors to the first vertex pair;
      
      (iii) repeating steps (c)(i) and (c)(ii) for each subsequent vertex pair in the priority queue until all vertices in the first database graph have been marked as matched.
  - 6. The method of claim 1, wherein the database graph query comprises a subgraph query, wherein the subgraph query comprising determining if a subgraph is sub-isomorphic by:
    - (a) for each vertex u of the first database graph G1, defining a level-n adjacent subgraph, wherein the level-n adjacent subgraph contains all vertices reachable from the vertex u within a distance of n;
      
      (b) constructing a bipartite graph B for G1 and the second database graph G2, wherein;
      
      (i) vertex sets of the bipartite graph are vertex sets of G1 and G2;
      
      (ii) for any two vertices uε
      
      G1, vε
      
      G2, if u is level-n pseudo compatible to v, then (u,v) comprises an edge of B, wherein vertex u is called level-n pseudo compatible to v if a level-n adjacent subgraph of u is level-n sub-isomorphic to that of v, wherein G1 is called level-n sub-isomorphic if every vertex in G1 is matched to a vertex in G2.
  - 7. The method of claim 6, wherein the database graph query comprises. pruning nodes of the graph closure tree based on the subgraph sub-isomorphism;
    - and verifying each level-n sub-isomorphic subgraph for exact subgraph isomorphism.

8. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:
- selecting one or more vertices or edges of a database graph as features;
  
  transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;
  
  evaluating the feature vectors; and
  
  determining a statistical significance of the feature vectors based on the evaluating.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method of claim 8, wherein the selecting is based on one or more of the following:
    - a frequency that a vertex or edge occurs in the database graph;
      
      a size of a vertex or edge in the database graph;
      
      a structural overlap between vertices and/or edges in the database graph; and
      
      a co-occurrence of vertices and/or edges in the database graph.
  - 10. The method of claim 8, wherein the evaluating comprises modeling a probability that the selected features occur in a random vector through statistical observations.
  - 11. The method of claim 10, wherein the random vector is constrained by a size of the random vector.
  - 12. The method of claim 8, wherein:
    - the evaluating comprises exploring closed sub-vectors of the feature vectors, wherein said exploring comprises evaluating sets of closed vectors in a defined order and pruning duplicate sets; and
      
      the determining comprises evaluating the statistical significance of each closed sub-vector that is not within a pruned duplicate set.

13. A computer-implemented method for finding a significant group of proteins in a genome scale protein interaction network comprising:
- (a) obtaining graph G=(V,E) representing a genome scale protein interaction network, wherein V comprises a set of nodes/proteins in the graph and E comprises a set of weighted undirected edges between pairs of nodes/proteins, wherein the edges are weighted by a probability of interaction;
  
  (b) beginning on an initial node, moving to a neighboring node based on the weight of connecting edges;
  
  (c) moving to a new neighboring node based on the weight of connecting edges at every time tick for a defined period of time;
  
  (d) teleporting to the initial node and repeating steps (b) and (c) based on a restart probability α
  
  ; and
  
  (e) determining a significant group of proteins based on a proximity of a node to the initial node, wherein the proximity is based on a percentage of time spent on the node during steps (b) and (c);
  
  (f) repeating steps (b)-(d) wherein every node in the network is used as the initial node;
  
  (g) inserting a cluster of proteins into a priority queue based on a statistical significance of each cluster;
- View Dependent Claims (14)
- - 14. The method of claim 13, wherein a current order of the cluster of proteins in the priority queue is not processed for reordering upon insertion of the cluster of proteins into the priority queue until a confidence level, that the current order in the priority queue will change, is above a defined threshold, wherein said confidence level is based on a probability or reordering that is based on a Gaussian distribution $N$
    - ( 4 ( i + 1 ) ⁢
      
      
      
      V 
      
      , σ
      
      ) , σ
      
      ), wherein $\frac{4}{(i + 1) \langle V \rangle}$ comprises an estimated mean of distribution and σ
      
      is obtained using an element wise affinity change from levels i−
      
      1 to level i of the priority queue.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
He, Huahai, Singh, Ambuj, Camoglu, Orhan, Can, Tolga

Granted Patent

US 7,933,915 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/9024 Graphs; Linked lists G06F16...

Graph querying, graph motif mining and the discovery of clusters

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

195 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Graph querying, graph motif mining and the discovery of clusters

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

195 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links