Graph querying, graph motif mining and the discovery of clusters

US 8,396,884 B2
Filed: 03/28/2011
Issued: 03/12/2013
Est. Priority Date: 02/27/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:

(a) selecting one or more vertices or edges of a database graph as features using, as criteria for the selection, a frequency that a vertex or edge occurs in the database graph, a size of a vertex or edge in the database graph, a structural overlap between vertices or edges in the database graph, or a co-occurrence of vertices or edges in the database graph;

(b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected features in the database graph;

(c) evaluating the feature vectors by modeling a probability that the selected features occur in a random one of the feature vectors; and

(d) determining a statistical significance of the feature vectors based on the evaluating step (c), by computing a probability of occurrence of the feature vectors in a random one of the features vector based on the modeled probability, and then obtaining a probability distribution on support of the features vector in a database of random vectors using the probability of occurrence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for analyzing, querying, and mining graph databases using subgraph and similarity querying. An index structure, known as a closure tree, is defined for topological summarization of a set of graphs. In addition, a significance model is created in which the graphs are transformed into histograms of primitive components. Finally, connected substructures or clusters, comprising paths or trees, are detected in networks found in the graph databases using a random walk technique and a repeated random walk technique.

41 Citations

10 Claims

1. A computer-implemented method for determining a significance of frequent subgraphs in a database graph comprising:
- (a) selecting one or more vertices or edges of a database graph as features using, as criteria for the selection, a frequency that a vertex or edge occurs in the database graph, a size of a vertex or edge in the database graph, a structural overlap between vertices or edges in the database graph, or a co-occurrence of vertices or edges in the database graph;
  
  (b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected features in the database graph;
  
  (c) evaluating the feature vectors by modeling a probability that the selected features occur in a random one of the feature vectors; and
  
  (d) determining a statistical significance of the feature vectors based on the evaluating step (c), by computing a probability of occurrence of the feature vectors in a random one of the features vector based on the modeled probability, and then obtaining a probability distribution on support of the features vector in a database of random vectors using the probability of occurrence.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the selecting step (a) is based on one or more of the following:
    - a frequency that a vertex or edge occurs in the database graph;
      
      a size of a vertex or edge in the database graph;
      
      a structural overlap between vertices or edges in the database graph; and
      
      a co-occurrence of vertices or edges in the database graph.
  - 3. The method of claim 1, wherein the evaluating step (c) comprises modeling a probability that the selected features occur in a random vector through statistical observations.
  - 4. The method of claim 3, wherein the random vector is constrained by a size of the random vector.
  - 5. The method of claim 1, wherein:
    - the evaluating step (c) comprises exploring closed sub-vectors of the feature vectors, wherein said exploring step comprises evaluating sets of closed vectors in a defined order and pruning duplicate sets; and
      
      the determining step (d) comprises evaluating the statistical significance of each closed sub-vector that is not within a pruned duplicate set.

6. A computer-implemented apparatus for determining a significance of frequent subgraphs in a database graph comprising:
- at least one or more computer systems configured to perform the steps of;
  
  (a) selecting one or more vertices or edges of a database graph as features using, as criteria for the selection, a frequency that a vertex or edge occurs in the database graph, a size of a vertex or edge in the database graph, a structural overlap between vertices or edges in the database graph, or a co-occurrence of vertices or edges in the database graph;
  
  (b) transforming the selected features into feature vectors, wherein each feature vector comprises a frequency of the selected feature in the database graph;
  
  (c) evaluating the feature vectors by modeling a probability that the selected features occur in a random one of the feature vectors; and
  
  (d) determining a statistical significance of the feature vectors based on the evaluating step (c), by computing a probability of occurrence of the feature vectors in a random one of the features vector based on the modeled probability, and then obtaining a probability distribution on support of the features vector in a database of random vectors using the probability of occurrence.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The apparatus of claim 6, wherein the means for selecting (a) is based on one or more of the following:
    - a frequency that a vertex or edge occurs in the database graph;
      
      a size of a vertex or edge in the database graph;
      
      a structural overlap between vertices or edges in the database graph; and
      
      a co-occurrence of vertices or edges in the database graph.
  - 8. The apparatus of claim 6, wherein the means for evaluating (c) comprises means for modeling a probability that the selected features occur in a random vector through statistical observations.
  - 9. The apparatus of claim 8, wherein the random vector is constrained by a size of the random vector.
  - 10. The apparatus of claim 6, wherein:
    - the means for evaluating (c) comprises means for exploring closed sub-vectors of the feature vectors, wherein said means for exploring comprises evaluating sets of closed vectors in a defined order and pruning duplicate sets; and
      
      the means for determining (d) comprises means for evaluating the statistical significance of each closed sub-vector that is not within a pruned duplicate set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
Singh, Ambuj Kumar, He, Huahai, Ranu, Sayan
Primary Examiner(s)
GIRMA, ANTENEH B

Application Number

US13/073,452
Publication Number

US 20110173189A1
Time in Patent Office

715 Days
Field of Search

None
US Class Current

707/760
CPC Class Codes

G06F 16/9024 Graphs; Linked lists G06F16...

Graph querying, graph motif mining and the discovery of clusters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

10 Claims

Specification

Use Cases

Quick Links

Others

Graph querying, graph motif mining and the discovery of clusters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

10 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others