Low latency query engine for apache hadoop
First Claim
1. A system for performing queries in a HADOOP™
- distributed computing cluster having a plurality of data nodes storing data, the data nodes having processing circuitry, the system comprising;
a plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to;
receive a query from a client;
obtain, from the components of the HADOOP™
cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;
create query fragments based on the obtained location information;
construct a query plan based on the obtained location information;
distribute the query fragments among the nodes in the cluster according to the query plan;
receive intermediate results from the nodes in the cluster that receive the query fragments; and
aggregate the intermediate results for the client.
5 Assignments
0 Petitions
Accused Products
Abstract
A low latency query engine for APACHE HADOOP™ that provides real-time or near real-time, ad hoc query capability, while completing batch-processing of MapReduce. In one embodiment, the low latency query engine comprises a daemon that is installed on data nodes in a HADOOP™ cluster for handling query requests and all internal requests related to query execution. In a further embodiment, the low latency query engine comprises a daemon for providing name service and metadata distribution. The low latency query engine receives a query request via client, turns the request into collections of plan fragments and coordinates parallel and optimized execution of the plan fragments on remote daemons to generate results at a much faster speed than existing batch-oriented processing frameworks.
-
Citations
20 Claims
-
1. A system for performing queries in a HADOOP™
- distributed computing cluster having a plurality of data nodes storing data, the data nodes having processing circuitry, the system comprising;
a plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to;receive a query from a client; obtain, from the components of the HADOOP™
cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;create query fragments based on the obtained location information; construct a query plan based on the obtained location information; distribute the query fragments among the nodes in the cluster according to the query plan; receive intermediate results from the nodes in the cluster that receive the query fragments; and aggregate the intermediate results for the client. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- distributed computing cluster having a plurality of data nodes storing data, the data nodes having processing circuitry, the system comprising;
-
20. A method for operating a system that performs queries in a HADOOP™
- distributed computing cluster having a plurality of data nodes storing data, the plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to perform the method operations comprising;receiving a query from a client; obtaining, from the components of the HADOOP™
cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;creating query fragments based on the obtained location information; constructing a query plan based on the obtained location information; distributing the query fragments among the nodes in the cluster according to the query plan; receiving intermediate results from the nodes in the cluster that receive the query fragments; and aggregating the intermediate results for the client.
- distributed computing cluster having a plurality of data nodes storing data, the plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
Specification