Self-described query execution in a massively parallel SQL execution engine
First Claim
1. A method of query execution in a massively parallel processing (MPP) data storage system comprising a master node and a cluster of multiple distributed segments that access data in distributed storage, comprising:
- producing a self-described query plan at the master node that is responsive to a query for accessing data in the distributed storage to satisfy the query, said producing comprising incorporating, into a query plan at the master node, metadata and other information needed by the segments to execute the query plan to create said self-described query plan, wherein said metadata and other information comprise information as to locations of said data in said distributed storage that are accessed by said self-described query plan, and catalog information for functions and operators used in the self-described query plan for processing the data, and wherein said metadata and other information are stored in a store at said master node, wherein in the event that a part of such metadata or a part of such other information needed by the segments to execute the query plan is stored at the cluster of multiple distributed segments, the master node includes an identifier associated with the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments and excludes the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments from the query plan;
broadcasting said self-described query plan to said segments for execution; and
executing the self-described query plan to process said data.
9 Assignments
0 Petitions
Accused Products
Abstract
A query is executed in a massively parallel processing data storage system comprising a master node communicating with a cluster of multiple segments that access data in distributed storage by producing a self-described query plan at the master node that incorporates changeable metadata and information needed to execute the self-described query plan on the segments, and that incorporates references to obtain static metadata and information for functions and operators of the query plan from metadata stores on the segments. The distributed storage may be the Hadoop distributed file system, and the query plan may be a full function SQL query plan.
-
Citations
24 Claims
-
1. A method of query execution in a massively parallel processing (MPP) data storage system comprising a master node and a cluster of multiple distributed segments that access data in distributed storage, comprising:
-
producing a self-described query plan at the master node that is responsive to a query for accessing data in the distributed storage to satisfy the query, said producing comprising incorporating, into a query plan at the master node, metadata and other information needed by the segments to execute the query plan to create said self-described query plan, wherein said metadata and other information comprise information as to locations of said data in said distributed storage that are accessed by said self-described query plan, and catalog information for functions and operators used in the self-described query plan for processing the data, and wherein said metadata and other information are stored in a store at said master node, wherein in the event that a part of such metadata or a part of such other information needed by the segments to execute the query plan is stored at the cluster of multiple distributed segments, the master node includes an identifier associated with the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments and excludes the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments from the query plan; broadcasting said self-described query plan to said segments for execution; and executing the self-described query plan to process said data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. Computer readable storage media for storing executable instructions for controlling the operation of one or more computers in a massively parallel processing (MPP) data storage system comprising a master node and a cluster of multiple distributed segments that access data in distributed storage to perform a method of query execution comprising:
-
producing a self-described query plan at the master node that is responsive to a query for accessing data in the distributed storage to satisfy the query, said producing comprising incorporation, into a query plan at the master node, metadata and other information needed by the segments to execute the query plan to create said self-described query plan, wherein said metadata and other information comprise information as to locations of said data in said distributed storage that are accessed by said self-described query plan, and catalog information for functions and operators used in the self-described query plan for processing the data, and wherein said metadata and other information are stored in a store at said master node, wherein in the event that a part of such metadata or a part of such other information needed by the segments to execute the query plan is stored at the cluster of multiple distributed segments, the master node includes an identifier associated with the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments and excludes the part of such metadata or the part of such other information that is stored at the cluster of multiple distributed segments from the query plan; broadcasting said self-described query plan to said segments for execution; and executing the self-described query plan to process said data. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24)
-
Specification