×

Parallel streaming of external data

  • US 9,679,012 B1
  • Filed: 02/28/2014
  • Issued: 06/13/2017
  • Est. Priority Date: 02/28/2014
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method comprising:

  • representing, by a first distributed system, as an external table, data stored in a second distributed system;

    receiving, by a first master node of a first distributed system, a query that requests rows of data of the external table representing data stored in the second distributed system,wherein the first distributed system comprises the first master node and multiple segment nodes,wherein the first master node assigns each segment node to operate on a respective portion of data stored in the first distributed system,wherein the external table has an associated protocol that (i) is invoked when a segment node of the first distributed system receives a request from the first master node to access data represented by the external table and that (ii) causes the segment node to communicate with a segment extension service that provides direct streaming and conversion of data stored in the second distributed system,wherein the second distributed system is distinct from the first distributed system and comprises a second master node that coordinates requests among data nodes that each manage a different data fragment of data stored on the second distributed system, andwherein the segments nodes of the first distributed system are each operable to stream data in parallel from the second distributed system;

    obtaining, by the first master node, statistics of the data stored in the second distributed system represented by the external table;

    evaluating, by the first master node, a plurality of query plans using one or more cost criteria, including computing a cost associated with each of the query plans, wherein the cost criteria includes (i) a scanning cost of reading all the data stored in the second distributed system based on the statistics for broadcast to the segment nodes of the first distributed system, and (ii) a scanning cost of reading, from data nodes of the second distributed system in parallel, portions of the external table requested by segment nodes of the first distributed system;

    selecting, by the first master node, a query plan from the plurality of query plans based at least in part on the cost associated with each of the query plans; and

    computing, by the first master node, a result for the received query according to the selected query plan.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×