Low latency query engine for apache hadoop

US 9,990,399 B2
Filed: 05/13/2016
Issued: 06/05/2018
Est. Priority Date: 03/13/2013
Status: Active Grant

First Claim

Patent Images

1. A system for performing queries in a HADOOP™

distributed computing cluster having a plurality of data nodes storing data, the data nodes having processing circuitry, the system comprising;

a plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™

cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to;

receive a query from a client;

obtain, from the components of the HADOOP™

cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;

create query fragments based on the obtained location information;

construct a query plan based on the obtained location information;

distribute the query fragments among the nodes in the cluster according to the query plan;

receive intermediate results from the nodes in the cluster that receive the query fragments; and

aggregate the intermediate results for the client.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A low latency query engine for APACHE HADOOP™ that provides real-time or near real-time, ad hoc query capability, while completing batch-processing of MapReduce. In one embodiment, the low latency query engine comprises a daemon that is installed on data nodes in a HADOOP™ cluster for handling query requests and all internal requests related to query execution. In a further embodiment, the low latency query engine comprises a daemon for providing name service and metadata distribution. The low latency query engine receives a query request via client, turns the request into collections of plan fragments and coordinates parallel and optimized execution of the plan fragments on remote daemons to generate results at a much faster speed than existing batch-oriented processing frameworks.

Citations

20 Claims

1. A system for performing queries in a HADOOP™
- distributed computing cluster having a plurality of data nodes storing data, the data nodes having processing circuitry, the system comprising;
  
  a plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
  
  cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to;
  
  receive a query from a client;
  
  obtain, from the components of the HADOOP™
  
  cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;
  
  create query fragments based on the obtained location information;
  
  construct a query plan based on the obtained location information;
  
  distribute the query fragments among the nodes in the cluster according to the query plan;
  
  receive intermediate results from the nodes in the cluster that receive the query fragments; and
  
  aggregate the intermediate results for the client.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The system of claim 1, wherein the instance of the query engine is further configured to stream the intermediate results to the client as the intermediate results arrive.
  - 3. The system of claim 1, wherein the instance of the query engine is further configured to execute a query fragment against local data where the instance of the query engine is located.
  - 4. The system of claim 3, wherein the query fragment against the local data is executed as the query fragment is received, without entering a queue for batched processing.
  - 5. The system of claim 3, wherein the instance of the query engine is further configured to transmit a result of executing the query fragment against the local data to another data node.
  - 6. The system of claim 3, wherein the instance of the query engine is further configured to utilize, for executing the query fragment, a different process based on a format in which the local data is stored.
  - 7. The system of claim 1, wherein the instance of the query engine is further configured to transmit the aggregated intermediate results to another data node for further aggregation.
  - 8. The system of claim 7, wherein the instance of the query engine is further configured to perform a join operation on the intermediate results before transmitting the aggregated intermediate results to the another data node.
  - 9. The system of claim 1, wherein the query plan includes utilizing one or more of:
    - a scan operator, a hash join operator, a hash aggregation operator, a union operator, a TopN operator, or an exchange operator.
  - 10. The system of claim 1, wherein the respective data node is coupled to specialized scan nodes for executing query fragments that are specific to different storage managers.
  - 11. The system of claim 1, wherein the query plan breaks the query into query fragments along scan lines.
  - 12. The system of claim 1, wherein the instance of the query engine is further configured to utilize instruction sets unique to a local processing circuitry to reduce a number of function calls necessary to execute the query fragments.
  - 13. The system of claim 1, wherein the data is stored in an unstructured format.
  - 14. The system of claim 1, wherein the instance of the query engine is further configured to:
    - extract a schema for data relevant to the query at query execution time; and
      
      convert local data relevant to the query from an unstructured format into a structured format, based on the schema, for a query fragment to be executed against the local data.
  - 15. The system of claim 14, wherein the structured format of the local data is held by the instance of the query engine as in-memory tuples.
  - 16. The system of claim 1, wherein the instance of the query engine store results associated with the query as in-memory tuples.
  - 17. The system of claim 1, wherein the query plan is further constructed based on a workload information of the peer network.
  - 18. The system of claim 1, wherein a query fragment is distributed to a data node among a plurality of data nodes that maintain replicas of a same data block.
  - 19. The system of claim 1, wherein the components of the HADOOP™
    - cluster maintain membership information regarding all query engine instances that are operated by the data nodes in the cluster.

20. A method for operating a system that performs queries in a HADOOP™
- distributed computing cluster having a plurality of data nodes storing data, the plurality of data nodes forming a peer network, a respective data node functioning as a peer in the peer network and being capable of interacting with components of the HADOOP™
  
  cluster, the respective data node operating an instance of a low latency query engine that queries data directly from the respective data node, the query engine is configured to perform the method operations comprising;
  
  receiving a query from a client;
  
  obtaining, from the components of the HADOOP™
  
  cluster, location information regarding where one or more data blocks relevant to the received query are distributed among nodes in the cluster;
  
  creating query fragments based on the obtained location information;
  
  constructing a query plan based on the obtained location information;
  
  distributing the query fragments among the nodes in the cluster according to the query plan;
  
  receiving intermediate results from the nodes in the cluster that receive the query fragments; and
  
  aggregating the intermediate results for the client.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cloudera Incorporated
Original Assignee
Cloudera Incorporated
Inventors
Kornacker, Marcel, Erickson, Justin, Li, Nong, Kuff, Lenni, Robinson, Henry Noel, Choi, Alan, Behm, Alex
Primary Examiner(s)
Vy, Hung T

Application Number

US15/154,727
Publication Number

US 20170132283A1
Time in Patent Office

753 Days
Field of Search

707718
US Class Current
CPC Class Codes

G06F 16/2453   Query optimisation

G06F 16/24535   of sub-queries or views

G06F 16/24542   Plan optimisation

G06F 16/24544   Join order optimisation

G06F 16/2471   Distributed queries

G06F 16/258   Data format conversion from...

Low latency query engine for apache hadoop

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Low latency query engine for apache hadoop

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links