Low latency query engine for Apache Hadoop
First Claim
1. A system for performing queries on stored data in a HADOOP™
- distributed computing cluster having a plurality of data nodes, each data node being a computing device having processing circuitry and memory circuitry, the system comprising;
a state store that tracks a status of each data node, wherein the state store is separate from the data nodes and is further coupled to a name node that tracks where file data are stored across the cluster; and
a plurality of data nodes forming a peer-to-peer network for the queries, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the HADOOP™
cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;
a query planner configured to;
receive queries from clients;
obtain, from the state store and the name node, (1) membership information regarding all query engine instances that are running in the cluster, and (2) location information regarding where data blocks relevant to the queries are distributed among the plurality of data nodes;
parse queries from clients to create query fragments based on data obtained from the state store and the name node; and
construct a query plan based on the data obtained from the state store;
a query coordinator configured to distribute the query fragments among the plurality of data nodes according to the query plan; and
a query execution engine configured to execute the query fragments, to obtain intermediate results from other data nodes that receive the query fragments, and to aggregate the intermediate results for the clients.
5 Assignments
0 Petitions
Accused Products
Abstract
A low latency query engine for APACHE HADOOP™ that provides real-time or near real-time, ad hoc query capability, while completing batch-processing of MapReduce. In one embodiment, the low latency query engine comprises a daemon that is installed on data nodes in a HADOOP™ cluster for handling query requests and all internal requests related to query execution. In a further embodiment, the low latency query engine comprises a daemon for providing name service and metadata distribution. The low latency query engine receives a query request via client, turns the request into collections of plan fragments and coordinates parallel and optimized execution of the plan fragments on remote daemons to generate results at a much faster speed than existing batch-oriented processing frameworks.
-
Citations
39 Claims
-
1. A system for performing queries on stored data in a HADOOP™
- distributed computing cluster having a plurality of data nodes, each data node being a computing device having processing circuitry and memory circuitry, the system comprising;
a state store that tracks a status of each data node, wherein the state store is separate from the data nodes and is further coupled to a name node that tracks where file data are stored across the cluster; and a plurality of data nodes forming a peer-to-peer network for the queries, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the HADOOP™
cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;a query planner configured to; receive queries from clients; obtain, from the state store and the name node, (1) membership information regarding all query engine instances that are running in the cluster, and (2) location information regarding where data blocks relevant to the queries are distributed among the plurality of data nodes; parse queries from clients to create query fragments based on data obtained from the state store and the name node; and construct a query plan based on the data obtained from the state store; a query coordinator configured to distribute the query fragments among the plurality of data nodes according to the query plan; and a query execution engine configured to execute the query fragments, to obtain intermediate results from other data nodes that receive the query fragments, and to aggregate the intermediate results for the clients. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- distributed computing cluster having a plurality of data nodes, each data node being a computing device having processing circuitry and memory circuitry, the system comprising;
-
23. A method of executing a query in a HADOOP™
- distributed computing cluster having multiple data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of HADOOP™
cluster, each peer having an instance of a query engine running in memory, each instance of the query engine is configured to perform;
the method comprising;receiving, by a one data node in the distributed computing cluster, a query; designating the one data node that receives the query as a coordinating data node; obtaining, by the coordinating data node and through a state store and a name node, (1) membership information regarding all query engine instances that are running in the cluster, and (2) location information regarding where data blocks relevant to the query are distributed among the plurality of data nodes, wherein the state store is separate from the data nodes; parsing the query to create fragments of the query based on data obtained from the state store and the name node; constructing a query plan based on the data obtained from the state store; distributing, by the coordinating data node and according to the query plan, the fragments of the query to data nodes in the distributed computing cluster that have data relevant to the query; receiving, from the data nodes having data relevant to the query, intermediate results corresponding to execution of the fragments of the query; and generating a final result based on the intermediate results for a client. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- distributed computing cluster having multiple data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of HADOOP™
Specification