Dynamic tree determination for data processing
First Claim
1. A method of determining a node hierarchy for processing a set of queries, comprising:
- receiving an instance query and a summary query to process data using a cluster of nodes;
specifying a first set of nodes in the cluster to process the instance query, the number of nodes to be included in the first set of nodes based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes;
determining at least a second set of nodes in the cluster based at least in part upon the predicted number of first results for the instance query to process the summary query, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes;
scheduling a plurality of jobs each corresponding to at least one of the instance or summary queries, each job to be executed on a single node;
causing each node of the first set of nodes to process the instance query, at least a portion of the first set of nodes operable to process the instance query in parallel;
causing at least a portion of the second set of nodes to process the summary query when a respective portion of the first set of nodes finishes processing the instance query, at least a portion of the second set of nodes operable to process the summary query in parallel; and
storing a result of the summary query to a specified location,wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes.
1 Assignment
0 Petitions
Accused Products
Abstract
Data can be processed in parallel across a cluster of nodes using a parallel processing framework. Using Web services calls between components allows the number of nodes to be scaled as necessary, and allows developers to build applications on the framework using a Web services interface. A job scheduler works together with a queuing service to distribute jobs to nodes as the nodes have capacity, such that jobs can be performed in parallel as quickly as the nodes are able to process the jobs. Data can be loaded efficiently across the cluster, and levels of nodes can be determined dynamically to process queries and other requests on the system.
47 Citations
25 Claims
-
1. A method of determining a node hierarchy for processing a set of queries, comprising:
-
receiving an instance query and a summary query to process data using a cluster of nodes; specifying a first set of nodes in the cluster to process the instance query, the number of nodes to be included in the first set of nodes based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes; determining at least a second set of nodes in the cluster based at least in part upon the predicted number of first results for the instance query to process the summary query, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes; scheduling a plurality of jobs each corresponding to at least one of the instance or summary queries, each job to be executed on a single node; causing each node of the first set of nodes to process the instance query, at least a portion of the first set of nodes operable to process the instance query in parallel; causing at least a portion of the second set of nodes to process the summary query when a respective portion of the first set of nodes finishes processing the instance query, at least a portion of the second set of nodes operable to process the summary query in parallel; and storing a result of the summary query to a specified location, wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes. - View Dependent Claims (2, 3)
-
-
4. A method of determining a node hierarchy for processing a request, comprising:
-
specifying a first set of nodes in a cluster to process a first portion of a request, the number of nodes in the first set of nodes determined based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes; determining a second set of nodes in the cluster based at least in part upon the predicted number of first results for the first portion of the request to process a second portion of the request, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes; causing each node of the first set of nodes to process the first portion of the request, at least a portion of the first set of nodes operable to process the first portion in parallel; causing at least a portion of the second set of nodes to process the second portion of the request when a respective portion of the first set of nodes finishes processing the first portion of the request, at least a portion of the second set of nodes operable to process the second portion in parallel; and wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes. - View Dependent Claims (5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
10. A method of determining a node hierarchy for processing a request, comprising:
-
specifying a first set of nodes in a cluster to process a first portion of a request, the number of nodes in the first set of nodes determined based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes; determining a second set of nodes in the cluster based at least in part upon the predicted number of first results for the first portion of the request to process a second portion of the request, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes; causing each node of the first set of nodes to process the first portion of the request, at least a portion of the first set of nodes operable to process the first portion in parallel; causing at least a portion of the second set of nodes to process the second portion of the request when a respective portion of the first set of nodes finishes processing the first portion of the request, at least a portion of the second set of nodes operable to process the second portion in parallel; wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes; and wherein causing each node of the first set of nodes to process the first portion of the request and causing each of the second set of nodes to process the second portion of the request collectively comprises; scheduling a plurality of jobs each corresponding to at least one of the first and or second portions of the request, each job to be executed on a single node; tracking dependencies between jobs such that jobs for the second set of nodes are not processed until the corresponding jobs for the first set of nodes are processed.
-
-
21. A system for determining a node hierarchy for processing a request, comprising:
-
a processor; and a memory device including instructions that, when executed by the processor, cause the processor to; specify a first set of nodes in a cluster to process a first portion of a request, the number of nodes in the first set of nodes determined based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes; determine a second set of nodes in the cluster based at least in part upon the predicted number of first results for the first portion of the request to process a second portion of the request, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes; cause each node of the first set of nodes to process the first portion of the request, at least a portion of the first set of nodes operable to process the first portion in parallel; cause at least a portion of the second set of nodes to process the second portion of the request when a respective portion of the first set of nodes finishes processing the first portion of the request, at least a portion of the second set of nodes operable to process the second portion in parallel; and wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes. - View Dependent Claims (22, 23)
-
-
24. A computer program product embedded in a non-transitory computer readable storage medium having program instructions stored thereon that upon execution by a processor causes a computer system to determine a node hierarchy for processing a request, the computer program product comprising:
-
program instructions for specifying a first set of nodes in a cluster to process a first portion of a request, the number of nodes in the first set of nodes determined based at least in part upon a predicted number of first results and an expected capacity of each node of the first set of nodes; program instructions for determining a second set of nodes in the cluster based at least in part upon a predicted number of first results for the first portion of the request to process a second portion of the request, each node of the second set of nodes being a parent node to at least one child node in the first set of nodes and the second set of nodes being distinct from the first set of nodes; program instructions for causing each node of the first set of nodes to process the first portion of the request, at least a portion of the first set of nodes operable to process the first portion in parallel; program instructions for causing at least a portion of the second set of nodes to process the second portion of the request when a respective portion of the first set of nodes finishes processing the first portion of the request, at least a portion of the second set of nodes operable to process the second portion in parallel; and wherein the second set of nodes is operable to have a plurality of hierarchical levels such that a parent node of the second set of nodes processes results of at least two child nodes in the second set of nodes, the number of hierarchical levels being based at least in part upon the predicted number of first results and an expected capacity of each node of the second set of nodes. - View Dependent Claims (25)
-
Specification