Background format optimization for enhanced SQL-like queries in Hadoop

US 9,477,731 B2
Filed: 10/01/2013
Issued: 10/25/2016
Est. Priority Date: 10/01/2013
Status: Active Grant

First Claim

Patent Images

1. A system for performing queries on stored data in a Hadoop™

distributed computing cluster, the system comprising;

a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™

cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;

a query planner configured to parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;

a query coordinator configured to distribute the query fragments among the plurality of data nodes; and

a query execution engine comprising;

a transformation module configured to transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and

an execution module configured to execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

147 Citations

15 Claims

1. A system for performing queries on stored data in a Hadoop™
- distributed computing cluster, the system comprising;
  
  a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having an instance of a query engine running in memory, each instance of the query engine having;
  
  a query planner configured to parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;
  
  a query coordinator configured to distribute the query fragments among the plurality of data nodes; and
  
  a query execution engine comprising;
  
  a transformation module configured to transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and
  
  an execution module configured to execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the target format is a columnar format.
  - 3. The system of claim 1, wherein the target format is optimized for relational database processing.
  - 4. The system of claim 1, wherein, when the converted data is available, the query fragments are created for the target format.
  - 5. The system of claim 1, wherein, when the converted data is not available, the query fragments are created for the original format.

6. A method for performing queries on stored data in a Hadoop™
- distributed computing cluster system, the method comprising;
  
  configuring a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having an instance of a query engine running in memory; and
  
  configuring each instance of the query engine to include;
  
  a query planner that parses a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;
  
  a query coordinator that distributes the query fragments among the plurality of data nodes; and
  
  a query execution engine that;
  
  transforms whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and
  
  executes the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein the target format is a columnar format.
  - 8. The method of claim 6, wherein the target format is optimized for relational database processing.
  - 9. The method of claim 6, wherein, when the converted data is available, the query fragments are created for the target format.
  - 10. The method of claim 6, wherein, when the converted data is not available, the query fragments are created for the original format.

11. A non-transitory computer readable medium for performing queries on stored data in a Hadoop™
- distributed computing cluster system, the medium storing a plurality of instructions which, when executed by one or more processors, cause the system to perform a method comprising;
  
  configuring a plurality of data nodes forming a peer-to-peer network for the queries received from a client, each data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having an instance of a query engine running in memory; and
  
  configuring each instance of the query engine to;
  
  parse a query from the client and selectively creates query fragments based on an availability of converted data at the data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;
  
  distribute the query fragments among the plurality of data nodes; and
  
  transform whichever local data that corresponds to a format for which the query fragments are created into in-memory tuples based on the schema; and
  
  execute the query fragments on the in-memory tuples to obtain intermediate results from other data nodes that receive the query fragments and to aggregate the intermediate results for the client.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The non-transitory computer readable medium of claim 11, wherein the target format is a columnar format.
  - 13. The non-transitory computer readable medium of claim 11, wherein the target format is optimized for relational database processing.
  - 14. The non-transitory computer readable medium of claim 11, wherein, when the converted data is available, the query fragments are created for the target format.
  - 15. The non-transitory computer readable medium of claim 11, wherein, when the converted data is not available, the query fragments are created for the original format.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cloudera Incorporated
Original Assignee
Cloudera Incorporated
Inventors
Kornacker, Marcel, Erickson, Justin, Li, Nong, Kuff, Lenni, Robinson, Henry Noel, Choi, Alan, Behm, Alex
Primary Examiner(s)
Badawi, Sherief
Assistant Examiner(s)
Raab, Christopher J

Application Number

US14/043,753
Publication Number

US 20150095308A1
Time in Patent Office

1,120 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/24534   Query rewriting; Transforma...

G06F 16/24542   Plan optimisation

G06F 16/2471   Distributed queries

G06F 16/258   Data format conversion from...

G06F 16/27   Replication, distribution o...

Background format optimization for enhanced SQL-like queries in Hadoop

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

147 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Background format optimization for enhanced SQL-like queries in Hadoop

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

147 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links