Method and system for processing data queries

US 9,639,575 B2
Filed: 03/30/2012
Issued: 05/02/2017
Est. Priority Date: 03/30/2012
Status: Active Grant

First Claim

Patent Images

1. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:

dividing the query into one or more atoms which define individual queries on the graph data set;

calculating the execution cost of each atom in the query;

determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms;

determining a query execution plan which is a set of said query paths which can be executed in parallel;

executing said atoms on each of said nodes in accordance with said query execution plan; and

combining the results of each query path to produce a result set that is the answer to said query;

wherein the calculation of th execution cost is carried out by determining for each atom, a weight which for an atom i(j,k,l)is
W(j_j,k_i,l_i)=min{wgt(j_s),wgt(k_p),wgt(l_o)}where wgt(x_pos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention relates to a method and system that provide a high performance and extremely scalable triple store within the Resource Description Framework (or alternative data models), with optimized query execution. An embodiment of the invention provides a data storage and analysis system to support scalable monitoring and analysis of business processes along multiple configurable perspectives and levels of granularity. This embodiment analyses data from processes that have been already executed and from ongoing processes, as a continuous flow of information. This embodiment provides defining and monitoring processes based on no initial domain knowledge about the process and such that the process will be built only from the incoming flow of information. Another embodiment of the invention provides a grid infrastructure that allows storage of data across many grid nodes and distribution of the workload, avoiding the bottleneck represented by constantly querying a database.

Citations

24 Claims

1. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:
- dividing the query into one or more atoms which define individual queries on the graph data set;
  
  calculating the execution cost of each atom in the query;
  
  determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms;
  
  determining a query execution plan which is a set of said query paths which can be executed in parallel;
  
  executing said atoms on each of said nodes in accordance with said query execution plan; and
  
  combining the results of each query path to produce a result set that is the answer to said query;
  
  wherein the calculation of th execution cost is carried out by determining for each atom, a weight which for an atom i(j,k,l)is
  W(j_j,k_i,l_i)=min{wgt(j_s),wgt(k_p),wgt(l_o)}where wgt(x_pos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16)
- - 2. A method according to claim 1, further including the step of storing data into the graph data set.
  - 3. A method according to claim 2, wherein the step of storing data includes the step of converting the data in each triple from character strings to numeric representations.
  - 4. A method according to claim 3 wherein the numeric representations are sequence numbers.
  - 5. A method according to claim 4 wherein the step of storing data includes the sub-steps of:
    - parsing data to be stored to determine a triple containing character strings;
      
      comparing each of said character strings against a plurality of stored character strings and their associated sequence numbers;
      
      if a character string matches a stored character string, using the sequence number associated with said stored character string to replace the character string in the triple;
      
      if a character string does not match any stored character string, obtaining a new sequence number, replacing the character string in the triple with said new sequence number and storing said character string associated with said new sequence number; and
      
      storing said triple.
  - 6. A method according to claim 1 wherein, where a query path contains a plurality of atoms, when said step of executing is performed for an atom which is not a first atom in a query path, said atom is only executed on the results of previously executed atoms in the query path.
  - 7. A method according to claim 6 wherein, where the size of results from an atom exceeds a predetermined threshold, a subset of the results from said atom are passed to the subsequent atom or atoms in the query path before the execution of said atom is completed.
  - 8. A method according to claim 1 wherein the calculation of the execution cost for each atom is carried out by assessing the number of instances in said graph data set of one or more constants contained within said atom.
  - 9. A method according to claim 1 wherein a query path may contain atoms which can be executed in parallel to each other as well as atoms which can be executed sequentially.
  - 10. A method according to claim 1 wherein the step of executing includes executing the first atom in a query path on each node only in respect of data stored on said node.
  - 11. A method according to claim 1 wherein the query includes a filter which is a limit on the value of one or more variables in the graph data set, and the step of dividing further includes assigning said filter to each of the atoms which relate to said variables.
  - 12. A method according to claim 11 wherein if said filter is disjunctive, the method includes the further steps of:
    - determining whether it is possible to assign said filter as a whole to an atom which features all of the variables in the filter and if so, assigning it to all such atoms;
      
      if it is not possible to so assign said filter, dividing said filter into a plurality of conjunctive filters and assigning a query path to each of said conjunctive filters.
  - 13. A method according to claim 12 further including the step of, when said filter cannot be assigned to any atom, applying said filter to the results of said step of combining.
  - 15. A method according to claim 1 wherein a query path is determined by:
    - setting the atom with the lowest execution cost as the first atom in the path;
      
      setting as a subsequent atom in the path an atom which is dependent on said firs in that it shares one or more variables with said first atom; and
      
      repeating until all atoms which are dependent on the first atom or any subsequent atoms have been set in the path.
  - 16. A method according to claim 15 wherein, where there are a plurality of atoms which are dependent, the subsequent atoms are set in order of execution cost of those atoms.

14. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:
- dividing the query into one or more atoms which define individual queries on the graph data set;
  
  calculating the execution cost of each atom in the query;
  
  determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms;
  
  determining a query execution plan which is a set of said query paths which can be executed in parallel;
  
  executing said atoms on each of said nodes in accordance with said query execution plan; and
  
  combining the results of each query path t . duce a result set that is the answer to said query; and
  
  wherein the graph data set stores the data as triples comprising a subject, a predicate and an object, wherein the subject and object are variables and the predicate describes the relationship between the subject and the object; and
  
  wherein the calculation of the execution cost is carried out by determining, for each atom, a weight, which for an atom i(j,k,l) is
  W(j_j,k_i,l_i)=min{wgt(j_s),wgt(k_p),wgt(l_o)}where wgt(x_pos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.

17. A system for storing business process data, the system having a plurality of nodes which are connected to each other via a network, each node having a data storage and a data processing device, wherein:
- the system stores business process data as a graph data set in a distributed fashion on the data storages of the plurality of nodes; and
  
  the data processing device of each node is arranged to processing a query on the graph data set by;
  
  dividing the query into one or more atoms which define individual queries on the graph data set;
  
  calculating the execution cost of each atom in the query;
  
  determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms;
  
  determining a query execution plan which is a set of said query paths which can be executed in parallel;
  
  executing said atoms on each of said nodes in accordance with said query execution plan;
  
  combining the results of each query path to produce a result set that is the answer to said query;
  
  wherein the calculation of the execution cost is carried out by determining, for each atom, a weight, which for an atom i(j,k,l) is
  W(j_j,k_i,l_i)=min{wgt(j_s),wgt(k_p),wgt(l_o)}where wgt(x_pos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. A system according to claim 17 wherein the graph data set stores the data as triples comprising a subject, a predicate and an object, wherein the subject and object are variables and the predicate describes the relationship between the subject and the object and at least one of the data processing devices is arranged to converting the data in each triple from character strings to numeric representations prior to storing the data.
  - 19. A system according to claim 18 wherein the data processing device is arranged to monitor business process messages and toparse data to be stored to determine a triple containing character strings;
20. A system according to claim 17 wherein, where a query path contains a plurality of atoms, the data processing device is arranged to execute atoms which are not a first atom in a query path, only on the results of previously executed atoms in the query path.
21. A system according to claim 20 wherein, where the size of results from an atom exceeds a predetermined threshold, the data processing device is arranged to execute a subsequent atom or atoms in the query path on a subset of the results from said atom before the execution of said atom is completed.
22. A system according to claim 17 wherein the data processing device calculates the execution cost for each atom by assessing the number of instances in said graph data set of one or more constants contained within said atom.
23. A system according to claim 17 wherein the data processing device determines a query path by:
- setting the atom with the lowest execution cost as the first atom in the path;
  
  setting as a subsequent atom in the path an atom which is dependent on said first atom in that it shares one or more variables with said first atom; and
  
  repeating until all atoms which are dependent on the first atom or any subsequent atoms have been set in the path.
24. A system according to claim 17 wherein the data processing device on each node is arranged to execute the first atom in a query path on the respective node only in respect of data stored on said node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
British Telecommunications PLC (BT Group PLC), Emirates Telecommunications Corporation, Khalifa University Of Science, Technology & Research
Original Assignee
British Telecommunications PLC (BT Group PLC), Emirates Telecommunications Corporation, Khalifa University Of Science, Technology & Research
Inventors
Leida, Marcello, Afzal, Ali, Taylor, Paul, Majeed, Basim
Primary Examiner(s)
Vo, Truong

Application Number

US13/435,747
Publication Number

US 20130262443A1
Time in Patent Office

1,859 Days
Field of Search

707769, 707722
US Class Current
CPC Class Codes

G06F 16/2452 Query translation

G06F 16/24534 Query rewriting; Transforma...

Method and system for processing data queries

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for processing data queries

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links