Many task computing with distributed file system
First Claim
1. An apparatus comprising a processor and a storage to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
- receive, at the processor and from a remote device, a request to perform a job flow using a flow input data set as an input to the job flow performance, wherein;
the job flow is defined in a job flow definition that specifies a set of tasks to be performed via execution of a corresponding set of task routines during the job flow performance;
at least one result report is to be generated as an output of the job flow performance;
the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices;
the flow input data set is either stored as an undivided object within one storage device of the set of storage devices, or stored as a set of data object blocks into which the flow input data set is divided and distributed among the set of storage devices;
each storage device of the set of storage devices incorporates a processor;
the processors of the set of storage devices cooperate to maintain a distributed file system that spans storage spaces provided by each storage device of the set of storage devices;
as part of maintaining the distributed file system, at least one processor of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; and
the flow input data set is stored as a set of data object blocks of the flow input data set by the set of storage devices in response to the flow input data set having a size larger than the distribution block size;
retrieve the job flow definition and each task routine of the set of task routines from the set of storage devices;
determine whether the flow input data set is stored as an undivided object or as a set of data object blocks based on the size of the flow input data set; and
in response to a determination that the flow input data set is stored as a set of data objects blocks, the processor is caused to perform operations comprising;
generate a container that contains the job flow definition and the set of task routines to enable the processor incorporated into each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein the performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report;
provide a copy of the container to each storage device of the set of storage devices to enable the processors incorporated into least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel;
retrieve, from each storage device of the set of storage devices, at least one data object block of the set of data object blocks of the result report;
assemble the result report from the set of data object blocks of the result report; and
transmit the result report to the remote device.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus includes a processor to: receive a request from a remote device to perform a job flow; retrieve a job flow definition defining the job flow and each of a set of task routines to perform tasks of the job flow from a set of storage devices where each is stored as an undivided object within one storage device; and in response to determining that a data set is stored as multiple data object blocks, generate a container containing the job flow definition and set of task routines to enable each storage device to perform the job flow using a locally stored data object block of the data set as input to generate a corresponding data object block of a result report, provide a copy of the container to each storage device, and transmit the result report assembled from the data object blocks thereof to the remote device.
-
Citations
30 Claims
-
1. An apparatus comprising a processor and a storage to store instructions that, when executed by the processor, cause the processor to perform operations comprising:
-
receive, at the processor and from a remote device, a request to perform a job flow using a flow input data set as an input to the job flow performance, wherein; the job flow is defined in a job flow definition that specifies a set of tasks to be performed via execution of a corresponding set of task routines during the job flow performance; at least one result report is to be generated as an output of the job flow performance; the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; the flow input data set is either stored as an undivided object within one storage device of the set of storage devices, or stored as a set of data object blocks into which the flow input data set is divided and distributed among the set of storage devices; each storage device of the set of storage devices incorporates a processor; the processors of the set of storage devices cooperate to maintain a distributed file system that spans storage spaces provided by each storage device of the set of storage devices; as part of maintaining the distributed file system, at least one processor of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; and the flow input data set is stored as a set of data object blocks of the flow input data set by the set of storage devices in response to the flow input data set having a size larger than the distribution block size; retrieve the job flow definition and each task routine of the set of task routines from the set of storage devices; determine whether the flow input data set is stored as an undivided object or as a set of data object blocks based on the size of the flow input data set; and in response to a determination that the flow input data set is stored as a set of data objects blocks, the processor is caused to perform operations comprising; generate a container that contains the job flow definition and the set of task routines to enable the processor incorporated into each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein the performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; provide a copy of the container to each storage device of the set of storage devices to enable the processors incorporated into least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel; retrieve, from each storage device of the set of storage devices, at least one data object block of the set of data object blocks of the result report; assemble the result report from the set of data object blocks of the result report; and transmit the result report to the remote device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor to perform operations comprising:
-
receive, at the processor and from a remote device, a request to perform a job flow using a flow input data set as an input to the job flow performance, wherein; the job flow is defined in a job flow definition that specifies a set of tasks to be performed via execution of a corresponding set of task routines during the job flow performance; at least one result report is to be generated as an output of the job flow performance; the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; the flow input data set is either stored as an undivided object within one storage device of the set of storage devices, or stored as a set of data object blocks into which the flow input data set is divided and distributed among the set of storage devices; each storage device of the set of storage devices incorporates a processor; the processors of the set of storage devices cooperate to maintain a distributed file system that spans storage spaces provided by each storage device of the set of storage devices; as part of maintaining the distributed file system, at least one processor of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; and the flow input data set is stored as a set of data object blocks of the flow input data set by the set of storage devices in response to the flow input data set having a size larger than the distribution block size; retrieve the job flow definition and each task routine of the set of task routines from the set of storage devices; determine whether the flow input data set is stored as an undivided object or as a set of data object blocks based on the size of the flow input data set; and in response to a determination that the flow input data set is stored as a set of data objects blocks, the processor is caused to perform operations comprising; generate a container that contains the job flow definition and the set of task routines to enable the processor incorporated into each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein the performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; provide a copy of the container to each storage device of the set of storage devices to enable the processors incorporated into least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel; retrieve, from each storage device of the set of storage devices, at least one data object block of the set of data object blocks of the result report; assemble the result report from the set of data object blocks of the result report; and transmit the result report to the remote device. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented method comprising:
-
receiving, by a processor, and from a remote device, a request to perform a job flow using a flow input data set as an input to the job flow performance, wherein; the job flow is defined in a job flow definition that specifies a set of tasks to be performed via execution of a corresponding set of task routines during the job flow performance; at least one result report is to be generated as an output of the job flow performance; the job flow definition and each task routine of the set of task routines is stored as an undivided object within one storage device of a set of storage devices; the flow input data set is either stored as an undivided object within one storage device of the set of storage devices, or stored as a set of data object blocks into which the flow input data set is divided and distributed among the set of storage devices; each storage device of the set of storage devices incorporates a processor; the processors of the set of storage devices cooperate to maintain a distributed file system that spans storage spaces provided by each storage device of the set of storage devices; as part of maintaining the distributed file system, at least one processor of at least one storage device of the set of storage devices determines whether a data object received by the set of storage devices is to be stored as an undivided object or stored as a set of data object blocks into which the received data object is divided and distributed among the set of storage devices based on a size of the received data object compared to a distribution block size; and the flow input data set is stored as a set of data object blocks of the flow input data set by the set of storage devices in response to the flow input data set having a size larger than the distribution block size; retrieving the job flow definition and each task routine of the set of task routines from the set of storage devices; determining, by the processor, whether the flow input data set is stored as an undivided object or as a set of data object blocks based on the size of the flow input data set; and in response to a determination that the flow input data set is stored as a set of data objects blocks, performing operations comprising; generating, by the processor, a container that contains the job flow definition and the set of task routines to enable the processor incorporated into each storage device to independently perform an instance of the job flow using one of the data object blocks of the flow input data set stored locally within the storage device as an input to the instance, wherein the performance of an instance of the job flow within each storage device generates a corresponding data object block of a set of data object blocks of the result report; providing a copy of the container to each storage device of the set of storage devices to enable the processors incorporated into least two storage devices of the set of storage devices to perform instances of the job flow at least partially in parallel; retrieving, from each storage device of the set of storage devices, at least one data object block of the set of data object blocks of the result report; assembling, by the processor, the result report from the set of data object blocks of the result report; and transmitting, from the processor, the result report to the remote device;
orin response to a determination that the flow input data set is stored as an undivided object within one storage device of the set of storage devices, performing operations comprising; retrieving the flow input data set from the set of storage devices; performing, by the processor, the job flow using the flow input data set as an input to generate the result report; and transmitting, from the processor, the result report to the remote device. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification