Parallel processing framework

US 9,996,593 B1
Filed: 08/05/2014
Issued: 06/12/2018
Est. Priority Date: 08/28/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving, by a computer system, one or more queries;

identifying a first set of nodes of a system cluster associated with a plurality of nodes to process a first portion of the one or more queries, the first set of nodes being a first subset of the plurality of nodes;

scheduling a plurality of jobs corresponding to the one or more queries;

causing individual nodes of the first set of nodes to process the first portion of the one or more queries in parallel and in accordance with the scheduled plurality of jobs; and

determining a second set of nodes of the system cluster associated with the plurality of nodes based at least in part on a retention policy associated with a plurality of distributed databases of the system cluster, the second set of nodes being a second subset of the plurality of nodes, individual nodes of the second set of nodes being a parent node to at least one child node of the first set of nodes in a hierarchical node structure, wherein at least one node is able to reside in the first set and the second set of nodes such that the at least one node may operate as a child node during processing of the first portion of the one or more queries and a parent node during processing of a second portion of the one or more queries, the first set of nodes and the second set of nodes comprising an instance on a database of the plurality of distributed databases, each layer of the hierarchical node structure being formed based at least in part on first results of the first portion of the one or more queries.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data can be processed in parallel across a cluster of nodes using a parallel processing framework. Using Web services calls between components allows the number of nodes to be scaled as necessary, and allows developers to build applications on the framework using a Web services interface. A job scheduler works together with a queuing service to distribute jobs to nodes as the nodes have capacity, such that jobs can be performed in parallel as quickly as the nodes are able to process the jobs. Data can be loaded efficiently across the cluster, and levels of nodes can be determined dynamically to process queries and other requests on the system.

Citations

18 Claims

1. A computer-implemented method, comprising:
- receiving, by a computer system, one or more queries;
  
  identifying a first set of nodes of a system cluster associated with a plurality of nodes to process a first portion of the one or more queries, the first set of nodes being a first subset of the plurality of nodes;
  
  scheduling a plurality of jobs corresponding to the one or more queries;
  
  causing individual nodes of the first set of nodes to process the first portion of the one or more queries in parallel and in accordance with the scheduled plurality of jobs; and
  
  determining a second set of nodes of the system cluster associated with the plurality of nodes based at least in part on a retention policy associated with a plurality of distributed databases of the system cluster, the second set of nodes being a second subset of the plurality of nodes, individual nodes of the second set of nodes being a parent node to at least one child node of the first set of nodes in a hierarchical node structure, wherein at least one node is able to reside in the first set and the second set of nodes such that the at least one node may operate as a child node during processing of the first portion of the one or more queries and a parent node during processing of a second portion of the one or more queries, the first set of nodes and the second set of nodes comprising an instance on a database of the plurality of distributed databases, each layer of the hierarchical node structure being formed based at least in part on first results of the first portion of the one or more queries.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining at least a third set of nodes of the system cluster associated with the plurality of nodes to process second results using the second portion of the one or more queries, individual nodes of the third set of nodes being a parent to at least one child node of the second set of nodes and operable to process the second results of the second portion of the one or more queries of the child node.
  - 3. The computer-implemented method of claim 1, further comprising:
    - receiving an intermediate query;
      
      determining at least a third set of nodes of the system cluster associated with the plurality of nodes to process the intermediate query, individual nodes of the third set of nodes being a parent to at least one child node of the first set of nodes and a child to one parent node in the second set of nodes; and
      
      causing individual nodes of the third set of nodes to process the intermediate query using the first results from the first portion of the one or more queries on the child node,wherein individual nodes of the second set of nodes are able to further process the second portion of the one or more queries using second results from the intermediate query on the child node.
  - 4. The computer-implemented method of claim 1, wherein the first portion of one or more queries includes an instance query and the second portion of the one or more queries includes a summary query.
  - 5. The computer-implemented method of claim 1, further comprising:
    - tracking dependencies between jobs such that jobs for the second set of nodes are not processed until one or more corresponding jobs for the first set of nodes are processed.
  - 6. The computer-implemented method of claim 1, further comprising:
    - passing the first results from individual nodes of the first set of nodes to a corresponding node of the second set of nodes using a file transfer service.
  - 7. The computer-implemented method of claim 1, further comprising:
    - enabling a user to upload a library to replicate to the child node for use in processing at least the first portion of the one or more queries.
  - 8. The computer-implemented method of claim 1, wherein the second set of nodes is determined dynamically based upon at least a current capacity of the parent node at a time for processing the second portion of the one or more queries.
  - 9. The computer-implemented method of claim 1, further comprising:
    - distributing a schema to each of the first set of nodes and the second set of nodes; and
      
      creating at least one table on each of the first set of nodes and the second set of nodes having a structure corresponding to the schema, each table being able to store a portion of data corresponding to the first portion of the one or more queries or the second portion of the one or more queries that are loaded on each node.

10. A system, comprising:
- a processor; and
  
  a memory device including instructions that, when executed with the processor, cause the system to, at least;
  
  receive one or more queries;
  
  identify a first set of nodes of a system cluster associated with a plurality of nodes, the first set of nodes being a first subset of the plurality of nodes;
  
  schedule a plurality of jobs corresponding to the one or more queries;
  
  cause individual nodes of the first set of nodes to process a first portion of the one or more queries in parallel and in accordance with the scheduled plurality of jobs;
  
  determine a second set of nodes of the system cluster associated with the plurality of nodes based at least in part on a retention policy associated with a plurality of distributed databases of the system cluster, the second set of nodes being a second subset of the plurality of nodes, individual nodes of the second set of nodes being a parent node to at least one child node of the first set of nodes in a hierarchical node structure, wherein at least one node is able to reside in the first set and the second set of nodes, such that the at least one node may operate as a child node during processing of the first portion of the one or more queries and a parent node during processing of a second portion of the one or more queries, the first set of nodes and the second set of nodes comprising an instance on a database of the plurality of distributed databases, each layer of the hierarchical node structure being formed based at least in part on first results from the first portion of the one or more queries;
  
  cause individual nodes of the second set of nodes to process the second portion of the one or more queries using the first results from the first portion of the one or more queries on individual respective child nodes; and
  
  store second results of the second portion of the one or more queries to a specified location.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The system of claim 10, wherein the memory device further includes instructions that, when executed by the processor, cause the system to:
    - receive an intermediate query;
      
      determine a third set of nodes of the system cluster associated with the plurality of nodes to process the intermediate query, individual nodes of the third set of nodes being a parent to at least one child node of the first set of nodes and a child to one of the nodes in the second set of nodes; and
      
      cause individual nodes of the third set of nodes to process the intermediate query using the first results from the first portion of the one or more queries on the child node,wherein individual nodes of the second set of nodes are able to further process the second portion of the one or more queries using third results from the intermediate query on the child node.
  - 12. The system of claim 10, wherein the system cluster associated with the plurality of nodes communicates using Web services and is scalable to add additional nodes.
  - 13. The system of claim 11, wherein the memory device further includes instructions that, when executed by the processor, cause the system to:
    - track dependencies between jobs such that jobs for the second set of nodes are not processed until one or more corresponding jobs for the first set of nodes are processed.
  - 14. The system of claim 10, wherein the second set of nodes is determined dynamically based upon at least a current capacity of the parent node at a time for processing the second portion of the one or more queries.

15. A computer program product embedded in a non-transitory computer-readable storage medium, comprising:
- program code for receiving one or more queries;
  
  program code for identifying a first set of nodes of a system cluster associated with a plurality of nodes, the first set of nodes being a first subset of the plurality of nodes;
  
  program code for scheduling a plurality of jobs corresponding to the one or more queries;
  
  program code for causing individual nodes of the first set of nodes to process a first portion of the one or more queries in parallel and in accordance with the scheduled plurality of jobs;
  
  program code for determining a second set of nodes of the system cluster associated with the plurality of nodes based at least in part on a retention policy associated with a plurality of distributed databases of the system cluster, the second set of nodes being a second subset of the plurality of nodes, individual nodes of the second set of nodes being a parent to at least one child node of the first set of nodes in a hierarchical node structure, wherein at least one node is able to reside in the first set and the second set of nodes such that the at least one node may operate as a child node during processing of the first portion of the one or more queries and a parent node during processing of a second portion of the one or more queries, the first set of nodes and the second set of nodes comprising an instance on a database of the plurality of distributed databases, each layer of the hierarchical node structure being formed based at least in part on first results of the first portion of the one or more queries;
  
  program code for causing individual nodes of the second set of nodes to process the second portion of the one or more queries using the first results from the first portion of the one or more queries on individual respective child nodes; and
  
  program code for storing second results of the second portion of the one or more queries to a specified location.
- View Dependent Claims (16, 17, 18)
- - 16. The computer program product of claim 15, further comprising:
    - program code for receiving an intermediate query;
      
      program code for determining a third set of nodes of the system cluster associated with the plurality of nodes to process the intermediate query, individual nodes of the third set of nodes being a parent to at least one child node of the first set of nodes and a child to one of the nodes in the second set of nodes; and
      
      program code for causing individual nodes of the third set of nodes to process the intermediate query using the first results from the first portion of the one or more queries on the child node,wherein individual nodes of the second set of nodes are able to further process the second portion of the one or more queries using third results from the intermediate query on the child node.
  - 17. The computer program product of claim 15, further comprising:
    - program code for determining at least a third set of nodes of the system cluster associated with the plurality of nodes to process the second results using the second portion of the one or more queries, individual nodes of the third set of nodes being a parent to at least one child node of the second set of nodes and operable to process the second results of the second portion of the one or more queries of the child node.
  - 18. The computer program product of claim 15, further comprising:
    - program code for tracking dependencies between jobs such that jobs for the second set of nodes are not processed until one or more corresponding jobs for the first set of nodes are processed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Bacthavachalu, Govindaswamy, Gavares, Peter Grant, Badran, Ahmed A., Scharf, Jr., James E.
Primary Examiner(s)
Herndon, Heather
Assistant Examiner(s)
Nguyen, Merilyn

Application Number

US14/452,146
Time in Patent Office

1,407 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/24532   of parallel queries

G06F 16/2471   Distributed queries

G06F 16/8373   Query execution

G06F 9/505   considering the load

Parallel processing framework

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Parallel processing framework

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links