SAVING PROGRAM EXECUTION STATE
First Claim
1. A method for a computing system of a distributed program execution service to manage distributed execution of programs, the method comprising:
- under control of the computing system of the distributed program execution service, the distributed program execution service providing a plurality of computing nodes that are configurable to execute programs of a plurality of users, receiving multiple requests to execute indicated programs using indicated input data, each of the requests being from one of the plurality of users and including indications of the program and the input data to be used for the request, and automatically responding to each request by;
automatically selecting multiple of the plurality of computing nodes for use in executing multiple execution jobs of the indicated program for the request in a distributed parallel manner, each of the multiple execution jobs having multiple operations to be performed using a subset of the indicated input data that is selected for the execution job;
for each of the multiple computing nodes, initiating execution on the computing node of one of the multiple execution jobs using the selected subset of input data for the one execution job, the initiating of the execution including configuring a portion of a distributed file system on the computing node for use in locally storing intermediate output data that is generated by completed performance of one or more of the multiple operations for the one execution job;
before the execution of at least some of the multiple execution jobs on at least some of the multiple computing nodes is completed,automatically monitoring a state of partial execution of each of the at least some execution jobs, the monitoring of each of the at least some execution jobs including identifying intermediate output data that is locally stored on the distributed file system portion for the computing node on which the execution job executes, the intermediate output data being generated by one or more operations of the execution job whose performance is completed; and
determining to terminate execution at a first time of one or more of the at least some execution jobs, and in response to the determining and for each of the one or more execution jobs, automatically initiating remote persistent storage of the intermediate output data that is stored on the distributed file system portion for the computing node on which the execution job executes;
at a later second time after the first time, for each of the one or more execution jobs, initiating a resumed execution of the execution job on a selected computing node by initiating performance of the operations of the execution job that were not completed at the first time, the resumed execution including retrieving the persistently stored intermediate output data that was stored at the first time for the execution job and initiating storage at the second time of the retrieved output data on a portion of the distributed file system on the selected computing node; and
after the execution of the multiple execution jobs of the indicated program is completed, providing final results from the execution to the one user.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques are described for managing distributed execution of programs. In at least some situations, the techniques include decomposing or otherwise separating the execution of a program into multiple distinct execution jobs that may each be executed on a distinct computing node, such as in a parallel manner with each execution job using a distinct subset of input data for the program. In addition, the techniques may include temporarily terminating and later resuming execution of at least some execution jobs, such as by persistently storing an intermediate state of the partial execution of an execution job, and later retrieving and using the stored intermediate state to resume execution of the execution job from the intermediate state. Furthermore, the techniques may be used in conjunction with a distributed program execution service that executes multiple programs on behalf of multiple customers or other users of the service.
98 Citations
32 Claims
-
1. A method for a computing system of a distributed program execution service to manage distributed execution of programs, the method comprising:
under control of the computing system of the distributed program execution service, the distributed program execution service providing a plurality of computing nodes that are configurable to execute programs of a plurality of users, receiving multiple requests to execute indicated programs using indicated input data, each of the requests being from one of the plurality of users and including indications of the program and the input data to be used for the request, and automatically responding to each request by; automatically selecting multiple of the plurality of computing nodes for use in executing multiple execution jobs of the indicated program for the request in a distributed parallel manner, each of the multiple execution jobs having multiple operations to be performed using a subset of the indicated input data that is selected for the execution job; for each of the multiple computing nodes, initiating execution on the computing node of one of the multiple execution jobs using the selected subset of input data for the one execution job, the initiating of the execution including configuring a portion of a distributed file system on the computing node for use in locally storing intermediate output data that is generated by completed performance of one or more of the multiple operations for the one execution job; before the execution of at least some of the multiple execution jobs on at least some of the multiple computing nodes is completed, automatically monitoring a state of partial execution of each of the at least some execution jobs, the monitoring of each of the at least some execution jobs including identifying intermediate output data that is locally stored on the distributed file system portion for the computing node on which the execution job executes, the intermediate output data being generated by one or more operations of the execution job whose performance is completed; and determining to terminate execution at a first time of one or more of the at least some execution jobs, and in response to the determining and for each of the one or more execution jobs, automatically initiating remote persistent storage of the intermediate output data that is stored on the distributed file system portion for the computing node on which the execution job executes; at a later second time after the first time, for each of the one or more execution jobs, initiating a resumed execution of the execution job on a selected computing node by initiating performance of the operations of the execution job that were not completed at the first time, the resumed execution including retrieving the persistently stored intermediate output data that was stored at the first time for the execution job and initiating storage at the second time of the retrieved output data on a portion of the distributed file system on the selected computing node; and after the execution of the multiple execution jobs of the indicated program is completed, providing final results from the execution to the one user. - View Dependent Claims (2, 3)
-
4. A computer-implemented method for managing distributed execution of programs, the method comprising:
under control of one or more computing systems that provide a distributed program execution service that manages distributed execution of programs for users, the distributed program execution service providing a plurality of computing nodes that are configurable to execute the programs for the users, after execution of multiple execution jobs of an indicated program is initiated on multiple of the plurality of computing nodes, the execution of the indicated program being performed on behalf of a first user and using indicated input data in such a manner that the multiple execution jobs each have one or more operations to be performed using at least some of the indicated input data, automatically tracking information about a state of the execution of the multiple execution jobs on the multiple computing nodes, the tracking including identifying intermediate results that are produced from a subset of the operations of the multiple execution jobs whose performance is complete and that are stored on the multiple computing nodes; after determining to terminate execution at a first time of at least one of the multiple execution jobs, the at least one execution jobs having at least one operation that is in the subset of operations whose performance is complete and having at least one other operation that is not in the subset and whose performance is not complete, automatically identifying the at least one operations by using the tracked information, and initiating persistent storage of the identified intermediate results produced from the at least one operations; at a later second time after the first time, initiating a resumed execution of the at least one execution jobs on at least one computing node so as to complete the performance of the at least one other operations that are not in the subset whose performance is complete and so as to not repeat the completed performance of the at least one operations in the subset, the resumed execution being performed in a manner based at least in part on the persistently stored intermediate results; and after the execution of the multiple execution jobs of the indicated program is completed, providing final results from the execution to the first user. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
24. A computer-readable medium whose contents enable a computing system to manage distributed execution of programs, by performing a method comprising:
-
tracking information about a state of distributed execution of an indicated program on multiple computing nodes, the indicated program having multiple execution jobs executing on the multiple computing nodes, the execution jobs each having one or more operations to be performed, the tracked state information including information about intermediate results produced from operations of the multiple execution jobs whose performance is complete; after the execution of at least one of the execution jobs is terminated before performance of all of the operations of the at least one execution jobs is completed, the at least one execution jobs having at least one operation whose performance is complete, continuing the distributed execution of the indicated program by automatically initiating resumed performance of the operations of the at least one execution jobs other than the at least one operations; and after the distributed execution of the indicated program is completed, providing an indication of final results from the distributed execution. - View Dependent Claims (25, 26, 27)
-
-
28. A computing system configured to manage distributed execution of programs, comprising:
-
one or more memories; and a system manager component that is configured to manage distributed execution for users of a distributed execution service by, for each of multiple of the users; receiving an indication from the user to perform distributed execution of multiple related execution jobs; initiating execution of the multiple execution jobs on multiple computing nodes; after a partial execution of at least one of the multiple execution jobs is performed but before the execution of the at least one execution jobs is completed, determining to terminate the execution of the at least one execution jobs, and automatically initiating persistent storage of an intermediate state of the partial execution of the at least one execution jobs; at a later time, retrieving the persistently stored intermediate state of the partial execution of the at least one execution jobs, and resuming the execution of the at least one execution jobs based at least in part on the retrieved persistently stored intermediate state; and after the execution of the multiple execution jobs is completed, providing final results from the execution to the user. - View Dependent Claims (29, 30, 31, 32)
-
Specification