Tiered data processing for distributed data
First Claim
1. A system, comprising:
- a local data store, that stores one or more data objects of a distributed data set;
one or more compute nodes, respectively comprising at least one processor and a memory, storing program instructions that when executed cause the one or more compute nodes to perform a method, comprising;
receiving, from a client a query directed to the distributed data set, wherein at least one data object of the distributed data set is stored in a data store that is remote to a data processing engine;
generating a query plan to execute the query, wherein the query plan comprises a plurality of different query operations;
modifying the query plan to include a command directed to an interface for a remote data processing engine, wherein the command corresponds to a query operation of the plurality of different query operations and wherein the command causes performance of the query operation at the remote data processing engine as part of reassigning the query operation from local execution at the one or more computing devices to the remote data processing engine, wherein the remote data processing engine can access the data object in the remote data store to execute the query operation;
directing execution of the different query operations, comprising sending a request to the remote data processing engine to execute the command to perform the reassigned query operation;
generating a final result for the query based, at least in part, on one or more results of the reassigned query operation received from the remote data processing engine with a result for another one of the different operations determined by the one or more compute nodes with respect to the one or more data objects in the local data store; and
sending the final result for the query to the client.
1 Assignment
0 Petitions
Accused Products
Abstract
Data processing engines implement tiered data processing for distributed data in local and remote data stores. Requests to access distributed data including a data object in a remote data store are received at a data processing engine. A query plan is generated to service the access request. Different operations in the query plan are identified and assigned to one or more remote query processing engines that may access the remote data object. Requests to perform the different operations are sent to the one or more remote query processing engines. A final result is generated for the request based on the results received for the different operations from the remote query processing engine and results from operations performed with respect to locally stored data.
-
Citations
20 Claims
-
1. A system, comprising:
-
a local data store, that stores one or more data objects of a distributed data set; one or more compute nodes, respectively comprising at least one processor and a memory, storing program instructions that when executed cause the one or more compute nodes to perform a method, comprising; receiving, from a client a query directed to the distributed data set, wherein at least one data object of the distributed data set is stored in a data store that is remote to a data processing engine; generating a query plan to execute the query, wherein the query plan comprises a plurality of different query operations; modifying the query plan to include a command directed to an interface for a remote data processing engine, wherein the command corresponds to a query operation of the plurality of different query operations and wherein the command causes performance of the query operation at the remote data processing engine as part of reassigning the query operation from local execution at the one or more computing devices to the remote data processing engine, wherein the remote data processing engine can access the data object in the remote data store to execute the query operation; directing execution of the different query operations, comprising sending a request to the remote data processing engine to execute the command to perform the reassigned query operation; generating a final result for the query based, at least in part, on one or more results of the reassigned query operation received from the remote data processing engine with a result for another one of the different operations determined by the one or more compute nodes with respect to the one or more data objects in the local data store; and sending the final result for the query to the client. - View Dependent Claims (2, 3, 4)
-
-
5. A method, comprising:
performing, by one or more computing devices; receiving, from a client, a query directed to a distributed data set, wherein a data object of the distributed data set is stored in a data store that is remote to the one or more computing devices, wherein at least another one of the data objects is stored in a data store that is local to the one or more computing devices; generating a query plan to execute the query, wherein the query plan comprises a plurality of different query operations; modifying the query plan to include a command directed to an interface for a remote data processing engine, wherein the command corresponds to a query operation of the plurality of different operations and wherein the command causes performance of the query operation at the remote data processing engine as part of reassigning the query operation from local execution at the one or more computing devices to the remote data processing engine, wherein the remote data processing engine can access the data object in the remote data store to execute the query operation; sending a request to the remote data processing engine to execute the command to perform the reassigned query operation; and generating a final result for the query based, at least in part, on one or more results of the reassigned query operation received from the remote data processing engine and a result for another one of the different operations determined by the one or more computing devices with respect to the at least one other data object; and sending the final result for the query to the client. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
-
14. A non-transitory, computer-readable storage medium, storing program instructions that when executed by one or more computing devices cause the one or more computing devices to implement:
-
receiving a query from a client, wherein the query is directed to a distributed data set, wherein at least one data object in the distributed data set is stored in a data store that is remote to the one or more computing devices; generating a query plan that comprises a plurality of different query operations to execute the query; modifying the query plan to include a command directed to an interface for a remote data processing engine, wherein the command corresponds to a query operation of the plurality of different operations and wherein the command causes performance of the query operation at a remote data processing engine as part of reassigning of the query operation from local execution at the one or more computing devices to the remote data processing engine that can access the data object in the remote data store; sending a request to execute the command to perform the reassigned query operation to the remote data processing engine; generating a final result for the query based, at least in part, on one or more results of the reassigned query operation received from the remote data processing engine and a result for another one of the different operations determined by the one or more computing devices with respect to another data object in the distributed data set stored in a data store that is local to the one or more computing devices; and sending the final result for the query to the client. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification