BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

US 20150095308A1
Filed: 10/01/2013
Published: 04/02/2015
Est. Priority Date: 10/01/2013
Status: Active Grant

First Claim

Patent Images

1. A system for performing queries on stored data in a distributed computing cluster of a plurality of data nodes, comprising:

a query engine for each data node, having;

a query planner that parses a query from a client to create query fragments based on a schema specifying one or more formats in which data is stored on the data nodes,wherein, when data in a target format is stored, the query fragments are created for the target format, and when data in the target format is not stored, the query fragments are created for another format;

a query coordinator that distributes the query fragments among the plurality of data nodes; and

a query execution engine comprising;

a transformation module that transforms the data in the format for which the query fragments are created based on the schema; and

an execution module that executes the query fragments on the transformed data to obtain intermediate results that are aggregated and returned to the client.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

44 Citations

View as Search Results

19 Claims

1. A system for performing queries on stored data in a distributed computing cluster of a plurality of data nodes, comprising:
- a query engine for each data node, having;
  
  a query planner that parses a query from a client to create query fragments based on a schema specifying one or more formats in which data is stored on the data nodes,wherein, when data in a target format is stored, the query fragments are created for the target format, and when data in the target format is not stored, the query fragments are created for another format;
  
  a query coordinator that distributes the query fragments among the plurality of data nodes; and
  
  a query execution engine comprising;
  
  a transformation module that transforms the data in the format for which the query fragments are created based on the schema; and
  
  an execution module that executes the query fragments on the transformed data to obtain intermediate results that are aggregated and returned to the client.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the distributed computing cluster is a Hadoop cluster.
  - 3. The system of claim 1, wherein the target format is a columnar format.
  - 4. The system of claim 1, wherein the target format is optimized for relational database processing.

5. A method of data processing for query execution, comprising the steps of:
- storing initial data in an original format;
  
  converting the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  storing the converted data together with the initial data.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The method of claim 5, further comprising the steps of:
    - receiving a set of one or more query fragments after the converted data is stored;
      
      transforming the converted data in response to the receipt; and
      
      executing the set of one or more query fragments on the transformed data.
  - 7. The method of claim 5, wherein the predetermined schedule is periodic.
  - 8. The method of claim 6, wherein the predetermined schedule is based on a number of sets of query fragments that have been received.
  - 9. The method of claim 5, further comprising the step ofstoring new data in the original format to replace the initial data,wherein the predetermined schedule is based on a number of times stored data in the original format has been replaced.
  - 10. The method of claim 5, wherein the target format is a columnar format.
  - 11. The method of claim 5, further comprising the step of deleting the converted data according to a schedule.
  - 12. The method of claim 6,wherein the initial data and the converted data are stored on a data node in a distributed computing cluster, andwherein the set of one or more query fragments are executed on the data node.

13. A system for data processing for query execution, comprising:
- a first storing unit which stores initial data in an original format;
  
  a converting unit which converts the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  a second storing unit which stores the converted data.
- View Dependent Claims (14, 15, 16)
- - 14. The system of claim 13, further comprising:
    - a receiving unit which receives a set of one or more query fragments after the converted data is stored;
      
      a transforming unit which transforms the converted data in response to the receipt; and
      
      an executing unit which executes the set of one or more query fragments on the transformed data.
  - 15. The system of claim 13, wherein the target format is a columnar format.
  - 16. The system of claim 13, wherein the system is a node in distributed computing cluster.

17. A machine-readable storage medium having stored thereon instructions which when executed by one or more processors perform a method, the method comprising the steps of:
- storing initial data in an original format;
  
  converting the initial data to be in a target format that is optimized for relational database processing according to a predetermined schedule; and
  
  storing the converted data.
- View Dependent Claims (18, 19)
- - 18. The machine-readable storage medium of claim 17, the method further comprising the steps of:
    - receiving a set of one or more query fragments after the converted data is stored;
      
      transforming the converted data in response to the receipt; and
      
      executing the set of one or more query fragments on the transformed data.
  - 19. The machine-readable storage medium of claim 17, wherein the target format is a columnar format.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cloudera Incorporated
Original Assignee
Cloudera Incorporated
Inventors
Kornacker, Marcel, Erickson, Justin, Li, Nong, Kuff, Lenni, Robinson, Henry Noel, Choi, Alan, Behm, Alex

Granted Patent

US 9,477,731 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/718
CPC Class Codes

G06F 16/24534   Query rewriting; Transforma...

G06F 16/24542   Plan optimisation

G06F 16/2471   Distributed queries

G06F 16/258   Data format conversion from...

G06F 16/27   Replication, distribution o...

BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links