Parallel streaming of external data
First Claim
1. A computer-implemented method comprising:
- representing, by a first distributed system, as an external table, data stored in a second distributed system;
receiving, by a first master node of a first distributed system, a query that requests rows of data of the external table representing data stored in the second distributed system,wherein the first distributed system comprises the first master node and multiple segment nodes,wherein the first master node assigns each segment node to operate on a respective portion of data stored in the first distributed system,wherein the external table has an associated protocol that (i) is invoked when a segment node of the first distributed system receives a request from the first master node to access data represented by the external table and that (ii) causes the segment node to communicate with a segment extension service that provides direct streaming and conversion of data stored in the second distributed system,wherein the second distributed system is distinct from the first distributed system and comprises a second master node that coordinates requests among data nodes that each manage a different data fragment of data stored on the second distributed system, andwherein the segments nodes of the first distributed system are each operable to stream data in parallel from the second distributed system;
obtaining, by the first master node, statistics of the data stored in the second distributed system represented by the external table;
evaluating, by the first master node, a plurality of query plans using one or more cost criteria, including computing a cost associated with each of the query plans, wherein the cost criteria includes (i) a scanning cost of reading all the data stored in the second distributed system based on the statistics for broadcast to the segment nodes of the first distributed system, and (ii) a scanning cost of reading, from data nodes of the second distributed system in parallel, portions of the external table requested by segment nodes of the first distributed system;
selecting, by the first master node, a query plan from the plurality of query plans based at least in part on the cost associated with each of the query plans; and
computing, by the first master node, a result for the received query according to the selected query plan.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for streaming external data in parallel from a second distributed system to a first distributed system. One of the methods includes receiving a query that requests a join of first rows of a first table in a first distributed system with second rows of an external table, the external table representing data in a second distributed system. Each of the segment nodes communicates with a respective extension service that obtains fragments from one or more data nodes of the second distributed system according to location information for the respective fragments, and provides to the segment node a stream of data corresponding to second rows of the external table. Each of the segment nodes computes joined rows between the first rows of the first table and the stream of data corresponding to second rows of the external table.
-
Citations
21 Claims
-
1. A computer-implemented method comprising:
-
representing, by a first distributed system, as an external table, data stored in a second distributed system; receiving, by a first master node of a first distributed system, a query that requests rows of data of the external table representing data stored in the second distributed system, wherein the first distributed system comprises the first master node and multiple segment nodes, wherein the first master node assigns each segment node to operate on a respective portion of data stored in the first distributed system, wherein the external table has an associated protocol that (i) is invoked when a segment node of the first distributed system receives a request from the first master node to access data represented by the external table and that (ii) causes the segment node to communicate with a segment extension service that provides direct streaming and conversion of data stored in the second distributed system, wherein the second distributed system is distinct from the first distributed system and comprises a second master node that coordinates requests among data nodes that each manage a different data fragment of data stored on the second distributed system, and wherein the segments nodes of the first distributed system are each operable to stream data in parallel from the second distributed system; obtaining, by the first master node, statistics of the data stored in the second distributed system represented by the external table; evaluating, by the first master node, a plurality of query plans using one or more cost criteria, including computing a cost associated with each of the query plans, wherein the cost criteria includes (i) a scanning cost of reading all the data stored in the second distributed system based on the statistics for broadcast to the segment nodes of the first distributed system, and (ii) a scanning cost of reading, from data nodes of the second distributed system in parallel, portions of the external table requested by segment nodes of the first distributed system; selecting, by the first master node, a query plan from the plurality of query plans based at least in part on the cost associated with each of the query plans; and computing, by the first master node, a result for the received query according to the selected query plan. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; representing, by a first distributed system, as an external table, data stored in a second distributed system; receiving, by a first master node of a first distributed system, a query that requests rows of data of the external table representing data stored in the second distributed system, wherein the first distributed system comprises the first master node and multiple segment nodes, wherein the first master node assigns each segment node to operate on a respective portion of data stored in the first distributed system, wherein the external table has an associated protocol that (i) is invoked when a segment node of the first distributed system receives a request from the first master node to access data represented by the external table and that (ii) causes the segment node to communicate with a segment extension service that provides direct streaming and conversion of data stored in the second distributed system, wherein the second distributed system is distinct from the first distributed system and comprises a second master node that coordinates requests among data nodes that each manage a different data fragment of data stored on the second distributed system, and wherein the segments nodes of the first distributed system are each operable to stream data in parallel from the second distributed system; obtaining, by the first master node, statistics of the data stored in the second distributed system represented by the external table; evaluating, by the first master node, a plurality of query plans using one or more cost criteria, including computing a cost associated with each of the query plans, wherein the cost criteria includes (i) a scanning cost of reading all the data stored in the second distributed system based on the statistics for broadcast to the segment nodes of the first distributed system, and (ii) a scanning cost of reading, from data nodes of the second distributed system in parallel, portions of the external table requested by segment nodes of the first distributed system; selecting, by the first master node, a query plan from the plurality of query plans based at least in part on the cost associated with each of the query plans; and computing, by the first master node, a result for the received query according to the selected query plan. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
15. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
representing, by a first distributed system, as an external table, data stored in a second distributed system; receiving, by a first master node of a first distributed system, a query that requests rows of data of the external table representing data stored in the second distributed system, wherein the first distributed system comprises the first master node and multiple segment nodes, wherein the first master node assigns each segment node to operate on a respective portion of data stored in the first distributed system, wherein the external table has an associated protocol that (i) is invoked when a segment node of the first distributed system receives a request from the first master node to access data represented by the external table and that (ii) causes the segment node to communicate with a segment extension service that provides direct streaming and conversion of data stored in the second distributed system, wherein the second distributed system is distinct from the first distributed system and comprises a second master node that coordinates requests among data nodes that each manage a different data fragment of data stored on the second distributed system, and wherein the segments nodes of the first distributed system are each operable to stream data in parallel from the second distributed system; obtaining, by the first master node, statistics of the data stored in the second distributed system represented by the external table; evaluating, by the first master node, a plurality of query plans using one or more cost criteria, including computing a cost associated with each of the query plans, wherein the cost criteria includes (i) a scanning cost of reading all the data stored in the second distributed system based on the statistics for broadcast to the segment nodes of the first distributed system, and (ii) a scanning cost of reading, from data nodes of the second distributed system in parallel, portions of the external table requested by segment nodes of the first distributed system; selecting, by the first master node, a query plan from the plurality of query plans based at least in part on the cost associated with each of the query plans; and computing, by the first master node, a result for the received query according to the selected query plan. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification