Method and system for processing data queries
First Claim
1. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:
- dividing the query into one or more atoms which define individual queries on the graph data set;
calculating the execution cost of each atom in the query;
determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms;
determining a query execution plan which is a set of said query paths which can be executed in parallel;
executing said atoms on each of said nodes in accordance with said query execution plan; and
combining the results of each query path to produce a result set that is the answer to said query;
wherein the calculation of th execution cost is carried out by determining for each atom, a weight which for an atom i(j,k,l)is
W(jj,ki,li)=min{wgt(js),wgt(kp),wgt(lo)}where wgt(xpos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to a method and system that provide a high performance and extremely scalable triple store within the Resource Description Framework (or alternative data models), with optimized query execution. An embodiment of the invention provides a data storage and analysis system to support scalable monitoring and analysis of business processes along multiple configurable perspectives and levels of granularity. This embodiment analyses data from processes that have been already executed and from ongoing processes, as a continuous flow of information. This embodiment provides defining and monitoring processes based on no initial domain knowledge about the process and such that the process will be built only from the incoming flow of information. Another embodiment of the invention provides a grid infrastructure that allows storage of data across many grid nodes and distribution of the workload, avoiding the bottleneck represented by constantly querying a database.
-
Citations
24 Claims
-
1. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:
-
dividing the query into one or more atoms which define individual queries on the graph data set; calculating the execution cost of each atom in the query; determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms; determining a query execution plan which is a set of said query paths which can be executed in parallel; executing said atoms on each of said nodes in accordance with said query execution plan; and combining the results of each query path to produce a result set that is the answer to said query; wherein the calculation of th execution cost is carried out by determining for each atom, a weight which for an atom i(j,k,l)is
W(jj,ki,li)=min{wgt(js),wgt(kp),wgt(lo)}where wgt(xpos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16)
-
-
14. A method of processing a query on a graph data set stored on a plurality of nodes, the method including the steps of:
-
dividing the query into one or more atoms which define individual queries on the graph data set; calculating the execution cost of each atom in the query; determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms; determining a query execution plan which is a set of said query paths which can be executed in parallel; executing said atoms on each of said nodes in accordance with said query execution plan; and combining the results of each query path t . duce a result set that is the answer to said query; and wherein the graph data set stores the data as triples comprising a subject, a predicate and an object, wherein the subject and object are variables and the predicate describes the relationship between the subject and the object; and wherein the calculation of the execution cost is carried out by determining, for each atom, a weight, which for an atom i(j,k,l) is
W(jj,ki,li)=min{wgt(js),wgt(kp),wgt(lo)}where wgt(xpos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively.
-
-
17. A system for storing business process data, the system having a plurality of nodes which are connected to each other via a network, each node having a data storage and a data processing device, wherein:
-
the system stores business process data as a graph data set in a distributed fashion on the data storages of the plurality of nodes; and the data processing device of each node is arranged to processing a query on the graph data set by; dividing the query into one or more atoms which define individual queries on the graph data set; calculating the execution cost of each atom in the query; determining one or more query paths which set out an order in which one or more atoms are to be executed using said calculated execution costs and interdependence between said atoms; determining a query execution plan which is a set of said query paths which can be executed in parallel; executing said atoms on each of said nodes in accordance with said query execution plan; combining the results of each query path to produce a result set that is the answer to said query; wherein the calculation of the execution cost is carried out by determining, for each atom, a weight, which for an atom i(j,k,l) is
W(jj,ki,li)=min{wgt(js),wgt(kp),wgt(lo)}where wgt(xpos) is the number of matching triples with value x in the subject, predicate or object position pos if x is a constant, or the total size of the graph data set if x is a variable, and s, p and o are the subject, predicate and object positions respectively. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
compare each of said character strings against a plurality of stored character strings and their associated sequence numbers; if a character string matches a stored character string, use the sequence number associated with said stored character string to replace the character string in the triple; if a character string does not match any stored character string, obtain a new sequence number, replacing the character string in the triple with said new sequence number and storing said character string associated with said new sequence number; and store said triple.
-
-
20. A system according to claim 17 wherein, where a query path contains a plurality of atoms, the data processing device is arranged to execute atoms which are not a first atom in a query path, only on the results of previously executed atoms in the query path.
-
21. A system according to claim 20 wherein, where the size of results from an atom exceeds a predetermined threshold, the data processing device is arranged to execute a subsequent atom or atoms in the query path on a subset of the results from said atom before the execution of said atom is completed.
-
22. A system according to claim 17 wherein the data processing device calculates the execution cost for each atom by assessing the number of instances in said graph data set of one or more constants contained within said atom.
-
23. A system according to claim 17 wherein the data processing device determines a query path by:
-
setting the atom with the lowest execution cost as the first atom in the path; setting as a subsequent atom in the path an atom which is dependent on said first atom in that it shares one or more variables with said first atom; and repeating until all atoms which are dependent on the first atom or any subsequent atoms have been set in the path.
-
-
24. A system according to claim 17 wherein the data processing device on each node is arranged to execute the first atom in a query path on the respective node only in respect of data stored on said node.
Specification