Background format optimization for enhanced SQL-like queries in Hadoop
First Claim
Patent Images
1. A system for performing queries on stored data in a Hadoop™
- distributed computing cluster, the system comprising;
a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;
a query planner configured to parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;
a query coordinator configured to distribute the query fragments among the plurality of data nodes; and
a query execution engine comprising;
a transformation module configured to transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and
an execution module configured to execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client.
5 Assignments
0 Petitions
Accused Products
Abstract
A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.
147 Citations
15 Claims
-
1. A system for performing queries on stored data in a Hadoop™
- distributed computing cluster, the system comprising;
a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;a query planner configured to parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query; a query coordinator configured to distribute the query fragments among the plurality of data nodes; and a query execution engine comprising; a transformation module configured to transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and an execution module configured to execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client. - View Dependent Claims (2, 3, 4, 5)
- distributed computing cluster, the system comprising;
-
6. A method for performing queries on stored data in a Hadoop™
- distributed computing cluster system, the method comprising;
configuring a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
cluster, each peer having an instance of a query engine running in memory; andconfiguring each instance of the query engine to include; a query planner that parses a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query; a query coordinator that distributes the query fragments among the plurality of data nodes; and a query execution engine that; transforms whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and executes the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client. - View Dependent Claims (7, 8, 9, 10)
- distributed computing cluster system, the method comprising;
-
11. A non-transitory computer readable medium for performing queries on stored data in a Hadoop™
- distributed computing cluster system, the medium storing a plurality of instructions which, when executed by one or more processors, cause the system to perform a method comprising;
configuring a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
cluster, each peer having an instance of a query engine running in memory; andconfiguring each instance of the query engine to; parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query; distribute the query fragments among the plurality of data nodes; and transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client. - View Dependent Claims (12, 13, 14, 15)
- distributed computing cluster system, the medium storing a plurality of instructions which, when executed by one or more processors, cause the system to perform a method comprising;
Specification