×

SAVING PROGRAM EXECUTION STATE

  • US 20100153955A1
  • Filed: 12/12/2008
  • Published: 06/17/2010
  • Est. Priority Date: 12/12/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for a computing system of a distributed program execution service to manage distributed execution of programs, the method comprising:

  • under control of the computing system of the distributed program execution service, the distributed program execution service providing a plurality of computing nodes that are configurable to execute programs of a plurality of users, receiving multiple requests to execute indicated programs using indicated input data, each of the requests being from one of the plurality of users and including indications of the program and the input data to be used for the request, and automatically responding to each request by;

    automatically selecting multiple of the plurality of computing nodes for use in executing multiple execution jobs of the indicated program for the request in a distributed parallel manner, each of the multiple execution jobs having multiple operations to be performed using a subset of the indicated input data that is selected for the execution job;

    for each of the multiple computing nodes, initiating execution on the computing node of one of the multiple execution jobs using the selected subset of input data for the one execution job, the initiating of the execution including configuring a portion of a distributed file system on the computing node for use in locally storing intermediate output data that is generated by completed performance of one or more of the multiple operations for the one execution job;

    before the execution of at least some of the multiple execution jobs on at least some of the multiple computing nodes is completed,automatically monitoring a state of partial execution of each of the at least some execution jobs, the monitoring of each of the at least some execution jobs including identifying intermediate output data that is locally stored on the distributed file system portion for the computing node on which the execution job executes, the intermediate output data being generated by one or more operations of the execution job whose performance is completed; and

    determining to terminate execution at a first time of one or more of the at least some execution jobs, and in response to the determining and for each of the one or more execution jobs, automatically initiating remote persistent storage of the intermediate output data that is stored on the distributed file system portion for the computing node on which the execution job executes;

    at a later second time after the first time, for each of the one or more execution jobs, initiating a resumed execution of the execution job on a selected computing node by initiating performance of the operations of the execution job that were not completed at the first time, the resumed execution including retrieving the persistently stored intermediate output data that was stored at the first time for the execution job and initiating storage at the second time of the retrieved output data on a portion of the distributed file system on the selected computing node; and

    after the execution of the multiple execution jobs of the indicated program is completed, providing final results from the execution to the one user.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×