Dynamically modifying a cluster of computing nodes used for distributed execution of a program
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving, by a computing system configured to provide a distributed program execution service having a plurality of computing nodes, configuration information indicating a quantity of computing nodes on which to execute an indicated program, wherein executing the indicated program causes a plurality of execution jobs to be executed;
selecting, by the configured computing system, the indicated quantity of computing nodes to use as part of a cluster in executing the indicated program in a distributed parallel manner;
initiating the executing of the indicated program on the computing nodes of the cluster at a first time by, for each of the multiple computing nodes of the cluster, attempting to initiate execution on the computing node of at least one of the execution jobs;
at a second time subsequent to the first time, determining whether a minimum quantity of the computing nodes of the cluster have begun to execute the execution jobs, the minimum quantity being less than the indicated quantity; and
if it is determined at the second time that the minimum quantity of the computing nodes of the cluster have not begun to execute the execution jobs, initiating termination of the executing of the indicated program on the computing nodes of the cluster without completing the executing of the indicated program, and otherwise continuing the executing of the indicated program until the executing of the indicated program is completed.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques are described for managing distributed execution of programs. In some situations, the techniques include dynamically modifying the distributed program execution in various manners, such as based on monitored status information. The dynamic modifying of the distributed program execution may include adding and/or removing computing nodes from a cluster that is executing the program, modifying the amount of computing resources that are available for the distributed program execution, terminating or temporarily suspending execution of the program (e.g., if an insufficient quantity of computing nodes of the cluster are available to perform execution), etc.
183 Citations
26 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a computing system configured to provide a distributed program execution service having a plurality of computing nodes, configuration information indicating a quantity of computing nodes on which to execute an indicated program, wherein executing the indicated program causes a plurality of execution jobs to be executed; selecting, by the configured computing system, the indicated quantity of computing nodes to use as part of a cluster in executing the indicated program in a distributed parallel manner; initiating the executing of the indicated program on the computing nodes of the cluster at a first time by, for each of the multiple computing nodes of the cluster, attempting to initiate execution on the computing node of at least one of the execution jobs; at a second time subsequent to the first time, determining whether a minimum quantity of the computing nodes of the cluster have begun to execute the execution jobs, the minimum quantity being less than the indicated quantity; and if it is determined at the second time that the minimum quantity of the computing nodes of the cluster have not begun to execute the execution jobs, initiating termination of the executing of the indicated program on the computing nodes of the cluster without completing the executing of the indicated program, and otherwise continuing the executing of the indicated program until the executing of the indicated program is completed. - View Dependent Claims (2, 3)
-
-
4. A computer-implemented method comprising:
-
receiving, by one or more computing systems configured to provide a distributed program execution service having a plurality of computing nodes, configuration information regarding executing an indicated program on an indicated quantity of multiple of the plurality of computing nodes, wherein the executing of the indicated program causes a plurality of jobs to be executed; initiating at a first time, by the one or more configured computing systems, the executing of the indicated program in a distributed manner on the multiple computing nodes in such a manner that one or more of the jobs of the indicated program are attempted to be executed on each of the multiple computing nodes; determining, by the one or more configured computing systems at a second time subsequent to the first time, whether a minimum subset of the multiple computing nodes have begun to execute the jobs of the indicated program as expected; and in response to the determining, initiating a change in a quantity of the multiple computing nodes that are used for executing the jobs of the indicated program. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium whose contents configure a computing system to perform a method comprising:
-
attempting to execute one or more of a plurality of jobs of a distributed program across a cluster of multiple computing nodes; determining, by the configured computing system, that an actual quantity of computing resources that has begun to be used to execute the distributed program differs from a specified minimum quantity of computing resources that are expected to be so used; and initiating a change in a quantity of the multiple computing nodes of the cluster that are executing the distributed program based at least in part on the determined actual quantity of computing resources. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A computing system configured to dynamically modify distributed execution of programs, comprising:
-
one or more processors; and one or more components of a distributed execution service that are configured to, when executed by at least one of the one or more processors, dynamically modify distributed execution of programs for users by, for each of multiple of the users; receiving information from the user regarding executing an indicated program in a distributed manner on a cluster of multiple computing nodes, the executing of the indicated program in the distributed manner including executing a plurality of jobs of the indicated program; initiating the executing of the indicated program in the distributed manner on the multiple computing nodes of the cluster at a first time in such a manner that one or more of the jobs of the indicated program are attempted to be executed on each of the multiple computing nodes; at a second time subsequent to the first time, determining that an actual quantity of the multiple computing nodes that have begun executing the jobs of the indicated program differs from a specified minimum quantity of the multiple computing nodes; and initiating a change in the multiple computing nodes of the cluster based at least in part on the determined actual quantity of the multiple computing nodes. - View Dependent Claims (22, 23, 24, 25, 26)
-
Specification