Generating data streams from pre-existing data sets
First Claim
1. A system for processing data items within a data source via an on-demand code execution environment, the system comprising:
- a non-transitory data store configured to implement a backlog cache indicating data items, from the data source, that have been identified for processing at the on-demand code execution environment as backlog items;
one or more processors, in communication with the non-transitory data store, configured to;
retrieve, for a set of data items within the data source, time data indicating points in time at which individual data items from the set of data items were created or modified within the data source;
determine, from the time data, an estimated modification frequency for the data source, the estimated modification frequency indicating an estimated frequency at which data items within the data source are created or modified;
obtaining a threshold period of time;
utilize the estimated modification frequency for the data source, the time data, and an anticipated rate of processing of data items at the on-demand code execution system to establish a demarcation time for the data source that is expected to result in a completion, within the threshold period of time, of processing of data items created or modified in the data source after the demarcation time, wherein data items created or modified in the data source prior to the demarcation time are considered backlogged data items, and wherein the set of data items includes at least one data item created or modified in the data source after the demarcation time;
enqueue within the backlog cache a first set of data items, from the data store, that were created or modified in the data source prior to the demarcation time;
iteratively submit data stream calls to the on-demand code execution environment, the data stream calls requesting that the demand code execution environment process, by execution of a task, data items from the data source that were created or modified after the demarcation time;
while data stream calls are submitted the on-demand code execution environment, submit backlog calls to the on-demand code execution environment, the backlog calls requesting that the demand code execution environment process, by execution of the task, data items from the backlog cache.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are described for transforming a data set within a data source into a series of task calls to an on-demand code execution environment or other distributed code execution environment. Such environments utilize pre-initialized virtual machine instances to enable execution of user-specified code in a rapid manner, without delays typically caused by initialization of the virtual machine instances, and are often used to process data in near-real time, as it is created. However, limitations in computing resources may inhibit a user from utilizing an on-demand code execution environment to simultaneously process a large, existing data set. The present application provides a task generation system that can iteratively retrieve data items from an existing data set and generate corresponding task calls to the on-demand computing environment, while ensuring that at least one task call for each data item within the existing data set is made.
378 Citations
21 Claims
-
1. A system for processing data items within a data source via an on-demand code execution environment, the system comprising:
-
a non-transitory data store configured to implement a backlog cache indicating data items, from the data source, that have been identified for processing at the on-demand code execution environment as backlog items; one or more processors, in communication with the non-transitory data store, configured to; retrieve, for a set of data items within the data source, time data indicating points in time at which individual data items from the set of data items were created or modified within the data source; determine, from the time data, an estimated modification frequency for the data source, the estimated modification frequency indicating an estimated frequency at which data items within the data source are created or modified; obtaining a threshold period of time; utilize the estimated modification frequency for the data source, the time data, and an anticipated rate of processing of data items at the on-demand code execution system to establish a demarcation time for the data source that is expected to result in a completion, within the threshold period of time, of processing of data items created or modified in the data source after the demarcation time, wherein data items created or modified in the data source prior to the demarcation time are considered backlogged data items, and wherein the set of data items includes at least one data item created or modified in the data source after the demarcation time; enqueue within the backlog cache a first set of data items, from the data store, that were created or modified in the data source prior to the demarcation time; iteratively submit data stream calls to the on-demand code execution environment, the data stream calls requesting that the demand code execution environment process, by execution of a task, data items from the data source that were created or modified after the demarcation time; while data stream calls are submitted the on-demand code execution environment, submit backlog calls to the on-demand code execution environment, the backlog calls requesting that the demand code execution environment process, by execution of the task, data items from the backlog cache. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method to process data items within a data source via an on-demand code execution environment, the computer-implemented method comprising:
-
retrieving, for a set of data items within the data source, time data indicating points in time at which individual data items from the set of data items were created or modified within the data source; determining, from the time data, an estimated modification frequency for the data source, the estimated modification frequency indicating an estimated frequency at which data items within the data source are created or modified; obtaining a threshold period of time; utilizing the time data, the estimated modification frequency for the data source, and an anticipated rate of processing of data items at the on-demand code execution system to establish a demarcation time for the data source that is expected to result in a completion, within the threshold period of time, of processing of data items created or modified in the data source after the demarcation time, wherein data items created or modified in the data source prior to the demarcation time are considered backlogged data items, and wherein the set of data items includes at least one data item created or modified in the data source after the demarcation time; enqueuing within a backlog cache a first set of data items, from the data store, that were created or modified in the data source prior to the demarcation time; iteratively submitting data stream calls to the on-demand code execution environment, the data stream calls requesting that the demand code execution environment process, by execution of a task, data items from the data source that were created or modified after the demarcation time; while data stream calls are submitted the on-demand code execution environment, submitting backlog calls to the on-demand code execution environment, the backlog calls requesting that the demand code execution environment process, by execution of the task, data items from the backlog cache. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. Non-transitory computer readable media including computer-executable instructions to process data items within a data source via an on-demand code execution environment, wherein the computer-executable instructions, when executed by a computing system, cause the computing system to:
-
obtain time data indicating points in time at which individual data items from the data items within the data source were created or modified; determine, from the time data, an estimated modification frequency for the data source, the estimated modification frequency indicating an estimated frequency at which data items within the data source are created or modified; obtain a threshold period of time; utilize the time data, the estimated modification frequency for the data source, and an anticipated rate of processing of data items at the on-demand code execution system, to establish a demarcation time for the data source that is expected to result in a completion, within the threshold period of time, of processing of data items created or modified in the data source after the demarcation time, wherein data items created or modified in the data source prior to the demarcation time are considered backlogged data items, and wherein the set of data items includes at least one data item created or modified in the data source after the demarcation time; enqueue within a backlog cache a first set of data items, from the data store, that were created or modified in the data source prior to the demarcation time; submit data stream calls to the on-demand code execution environment, the data stream calls requesting that the demand code execution environment process, by execution of a task, data items from the data source that were created or modified after the demarcation time; concurrently to submission of data stream calls to the on-demand code execution environment, submit backlog calls to the on-demand code execution environment, the backlog calls requesting that the demand code execution environment process, by execution of the task, data items from the backlog cache. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification