GRAPH QUERYING, GRAPH MOTIF MINING AND THE DISCOVERY OF CLUSTERS

US 20110173189A1
Filed: 03/28/2011
Published: 07/14/2011
Est. Priority Date: 02/27/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:

(a) selecting one or more vertices or edges of a database graph as features;

(b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;

(c) evaluating the feature vectors; and

(d) determining a statistical significance of the feature vectors based on the evaluating step (c).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for analyzing, querying, and mining graph databases using subgraph and similarity querying. An index structure, known as a closure tree, is defined for topological summarization of a set of graphs. In addition, a significance model is created in which the graphs are transformed into histograms of primitive components. Finally, connected substructures or clusters, comprising paths or trees, are detected in networks found in the graph databases using a random walk technique and a repeated random walk technique.

92 Citations

14 Claims

1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:
- (a) selecting one or more vertices or edges of a database graph as features;
  
  (b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;
  
  (c) evaluating the feature vectors; and
  
  (d) determining a statistical significance of the feature vectors based on the evaluating step (c).
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the selecting step (a) is based on one or more of the following:
    - a frequency that a vertex or edge occurs in the database graph;
      
      a size of a vertex or edge in the database graph;
      
      a structural overlap between vertices or edges in the database graph; and
      
      a co-occurrence of vertices or edges in the database graph.
  - 3. The method of claim 1, wherein the evaluating step (c) comprises modeling a probability that the selected features occur in a random vector through statistical observations.
  - 4. The method of claim 3, wherein the random vector is constrained by a size of the random vector.
  - 5. The method of claim 1, wherein:
    - the evaluating step (c) comprises exploring closed sub-vectors of the feature vectors, wherein said exploring step comprises evaluating sets of closed vectors in a defined order and pruning duplicate sets; and
      
      the determining step (d) comprises evaluating the statistical significance of each closed sub-vector that is not within a pruned duplicate set.

6. A computer-implemented method for finding a significant group of proteins in a genome scale protein interaction network comprising:
- (a) obtaining a graph G=(V,E) representing a genome scale protein interaction network, wherein V comprises a set of nodes representing proteins in the graph and E comprises a set of weighted undirected edges between pairs of nodes, wherein the edges are weighted by a probability of interaction;
  
  (b) beginning on an initial node, moving to a neighboring node based on the weight of connecting edges;
  
  (c) moving to a new neighboring node based on the weight of connecting edges at every time tick for a defined period of time;
  
  (d) teleporting to the initial node and repeating steps (b) and (c) based on a restart probability α
  
  ;
  
  (e) determining a significant group of nodes comprising a cluster of proteins based on a proximity of a node to the initial node, wherein the proximity is based on a percentage of time spent on the node during steps (b) and (c);
  
  (f) repeating steps (b)-(d), wherein every node in the network is used as the initial node; and
  
  (g) inserting the cluster of proteins into a priority queue based on a statistical significance of the proximity of the nodes of each cluster.
- View Dependent Claims (7)
- - 7. The method of claim 6, wherein a current order of the cluster of proteins in the priority queue is not processed for reordering upon insertion of the cluster of proteins into the priority queue until a confidence level, that the current order in the priority queue will change, is above a defined threshold, wherein said confidence level is based on a probability or reordering that is based on a Gaussian distribution N(4/(i+1)|V|, σ
    - ), wherein 4/(i+1)|V| comprises an estimated mean of distribution and σ
      
      is obtained using an element wise affinity change from levels i−
      
      1 to level i of the priority queue.

8. A computer-implemented apparatus for determining a significance of frequent subgraphs in a database graph comprising:
- (a) means for selecting one or more vertices or edges of a database graph as features;
  
  (b) means for transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;
  
  (c) means for evaluating the feature vectors; and
  
  (d) means for determining a statistical significance of the feature vectors based on the means for evaluating (c).
- View Dependent Claims (9, 10, 11, 12)
- - 9. The apparatus of claim 8, wherein the means for selecting (a) is based on one or more of the following:
    - a frequency that a vertex or edge occurs in the database graph;
      
      a size of a vertex or edge in the database graph;
      
      a structural overlap between vertices or edges in the database graph; and
      
      a co-occurrence of vertices or edges in the database graph.
  - 10. The apparatus of claim 8, wherein the means for evaluating (c) comprises means for modeling a probability that the selected features occur in a random vector through statistical observations.
  - 11. The apparatus of claim 10, wherein the random vector is constrained by a size of the random vector.
  - 12. The apparatus of claim 8, wherein:
    - the means for evaluating (c) comprises means for exploring closed sub-vectors of the feature vectors, wherein said means for exploring comprises evaluating sets of closed vectors in a defined order and pruning duplicate sets; and
      
      the means for determining (d) comprises means for evaluating the statistical significance of each closed sub-vector that is not within a pruned duplicate set.

13. A computer-implemented apparatus for finding a significant group of proteins in a genome scale protein interaction network comprising:
- (a) means for obtaining a graph G=(V,E) representing a genome scale protein interaction network, wherein V comprises a set of nodes representing proteins in the graph and E comprises a set of weighted undirected edges between pairs of nodes, wherein the edges are weighted by a probability of interaction;
  
  (b) means for, beginning on an initial node, moving to a neighboring node based on the weight of connecting edges;
  
  (c) means for moving to a new neighboring node based on the weight of connecting edges at every time tick for a defined period of time;
  
  (d) means for teleporting to the initial node and repeating (b) and (c) based on a restart probability α
  
  ;
  
  (e) means for determining a significant group of nodes comprising a cluster of proteins based on a proximity of a node to the initial node, wherein the proximity is based on a percentage of time spent on the node during (b) and (c);
  
  (f) means for repeating (b)-(d), wherein every node in the network is used as the initial node; and
  
  (g) means for inserting the cluster of proteins into a priority queue based on a statistical significance of the proximity of the nodes of each cluster.
- View Dependent Claims (14)
- - 14. The apparatus of claim 13, wherein a current order of the cluster of proteins in the priority queue is not processed for reordering upon insertion of the cluster of proteins into the priority queue until a confidence level, that the current order in the priority queue will change, is above a defined threshold, wherein said confidence level is based on a probability or reordering that is based on a Gaussian distribution N(4/(i+1)|V|, σ
    - ), wherein 4/(i+1)|V| comprises an estimated mean of distribution and σ
      
      is obtained using an element wise affinity change from levels i−
      
      1 to level i of the priority queue.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
He, Huahai, Ranu, Sayan, Singh, Ambuj Kumar

Granted Patent

US 8,396,884 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/722
CPC Class Codes

G06F 16/9024 Graphs; Linked lists G06F16...

GRAPH QUERYING, GRAPH MOTIF MINING AND THE DISCOVERY OF CLUSTERS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

92 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

GRAPH QUERYING, GRAPH MOTIF MINING AND THE DISCOVERY OF CLUSTERS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

92 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others