BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

US 20170032003A1
Filed: 10/12/2016
Published: 02/02/2017
Est. Priority Date: 10/01/2013
Status: Active Grant

First Claim

Patent Images

1. A method of data processing for query execution, the method being performed by a query engine instance running on each data node of a plurality of data nodes which together form a Hadoop™

distributed computing cluster, wherein a query is processed by whichever data node that receives the query, the method comprising;

storing initial data in an original format at a data node in the plurality of data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™

cluster, each peer having an instance of a query engine running in memory;

converting, at the data node, the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and

storing the converted data together with the initial data.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

Citations

20 Claims

1. A method of data processing for query execution, the method being performed by a query engine instance running on each data node of a plurality of data nodes which together form a Hadoop™
- distributed computing cluster, wherein a query is processed by whichever data node that receives the query, the method comprising;
  
  storing initial data in an original format at a data node in the plurality of data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having an instance of a query engine running in memory;
  
  converting, at the data node, the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  storing the converted data together with the initial data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - receiving a set of one or more query fragments after the converted data is stored;
      
      transforming the converted data in response to the receipt; and
      
      executing the set of one or more query fragments on the transformed data.
  - 3. The method of claim 1, wherein the predetermined schedule is periodic.
  - 4. The method of claim 1, wherein the predetermined schedule is based on a number of sets of query fragments that have been received.
  - 5. The method of claim 1, further comprising:
    - storing new data in the original format to replace the initial data,wherein the predetermined schedule is based on a number of times stored data in the original format has been replaced.
  - 6. The method of claim 1, wherein the target format is a columnar format.
  - 7. The method of claim 1, further comprising deleting the converted data according to a schedule.
  - 8. The method of claim 1,wherein the initial data and the converted data are stored on a data node in a distributed computing cluster, andwherein the set of one or more query fragments are executed on the data node.

9. A system for data processing for query execution, the system having a plurality of data nodes of a plurality of data nodes which together form a Hadoop™
- distributed computing cluster, each data node having an instance of a query engine, wherein a query is processed by whichever data node that receives the query, and wherein the plurality of data nodes include;
  
  a first storing unit which stores initial data in an original format at a data node in the plurality of data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having the instance of the query engine running in memory;
  
  a converting unit which converts the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  a second storing unit which stores the converted data.
- View Dependent Claims (10, 11, 12)
- - 10. The system of claim 9, further comprising:
    - a receiving unit which receives a set of one or more query fragments after the converted data is stored;
      
      a transforming unit which transforms the converted data in response to the receipt; and
      
      an executing unit which executes the set of one or more query fragments on the transformed data.
  - 11. The system of claim 9, wherein the target format is a columnar format.
  - 12. The system of claim 9, wherein the system is a node in distributed computing cluster.

13. A non-transitory machine-readable storage medium having stored thereon instructions which when executed by one or more processors perform a method, the method being performed by a query engine instance running on each data node of a plurality of data nodes which together form a Hadoop™
- distributed computing cluster, wherein a query is processed by whichever data node that receives the query, the method comprising;
  
  storing initial data in an original format at a data node in the plurality of data nodes forming a peer-to-peer network for the query, each data node functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, each peer having the query engine instance running in memory;
  
  converting, at the data node, the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  storing the converted data.
- View Dependent Claims (14, 15)
- - 14. The machine-readable storage medium of claim 13, the method further comprising:
    - receiving a set of one or more query fragments after the converted data is stored;
      
      transforming the converted data in response to the receipt; and
      
      executing the set of one or more query fragments on the transformed data.
  - 15. The machine-readable storage medium of claim 13, wherein the target format is a columnar format.

16. A system for performing queries on stored data in a Hadoop™
- distributed computing cluster, the system comprising;
  
  a plurality of data nodes forming a peer-to-peer network for the queries received from a client, a respective data node of the plurality of data nodes functioning as a peer in the peer-to-peer network and being capable of interacting with components of the Hadoop™
  
  cluster, the respective data node operating an instance of a query engine that is configured to;
  
  parse a query from a client;
  
  selectively creates query fragments based on an availability of converted data at the respective data node, the converted data corresponding to data associated with the query, wherein the converted data is the data associated with the query converted from an original format into a target format that is specified by a schema, and wherein the query is processed by whichever data node that receives the query;
  
  distribute the query fragments among the plurality of data nodes;
  
  execute the query fragments on whichever local data that corresponds to a format for which the query fragments are created, based on the schema;
  
  obtain intermediate results from other data nodes that receive the query fragments; and
  
  aggregate the intermediate results for the client.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, wherein the target format is a columnar format.
  - 18. The system of claim 16, wherein the target format is optimized for relational database processing.
  - 19. The system of claim 16, wherein, when the converted data is available, the query fragments are created for the target format.
  - 20. The system of claim 16, wherein, when the converted data is not available, the query fragments are created for the original format.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cloudera Incorporated
Original Assignee
Cloudera Incorporated
Inventors
Kornacker, Marcel, Erickson, Justin, Li, Nong, Kuff, Lenni, Robinson, Henry Noel, Choi, Alan, Behm, Alex

Granted Patent

US 10,706,059 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24534   Query rewriting; Transforma...

G06F 16/24542   Plan optimisation

G06F 16/2471   Distributed queries

G06F 16/258   Data format conversion from...

G06F 16/27   Replication, distribution o...

BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links