Reachability-Based Coordination for Cyclic Dataflow
First Claim
1. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, configure the processor to perform operations comprising:
- scheduling a plurality of processes each comprising a plurality of threads to operate independently on discrete partitions of data;
responsive to a first thread of the plurality of threads receiving a first partition of the data, the first thread beginning an operation on the first partition of the data;
responsive to a second thread of the plurality of threads receiving a second partition of the data, the second thread beginning the operation on the second partition of the data;
tracking progress of the operation using a replicated data structure at each thread, the replicated data structure comprising an epoch-iteration tuple;
storing the epoch-iteration tuple as a timestamp in the replicated data structure corresponding to records for processing at each thread;
determining a number of yet to be processed records from the replicated data structure for at least one of the first thread or the second thread; and
when the number of yet to be processed records for the first thread or the second thread reaches zero and the tuple includes the highest iteration in the replicated data structure for the thread, terminating the thread.
3 Assignments
0 Petitions
Accused Products
Abstract
Various embodiments provide techniques for working with large-scale collections of data pertaining to real world systems, such as a social network, a roadmap/GPS system, etc. The techniques perform incremental, iterative, and interactive parallel computation using a coordination clock protocol, which applies to scheduling computations and managing resources such as memory and network resources, etc., in cyclic graphs including those resulting from a differential dataflow model that performs computations on differences in the collections of data.
44 Citations
20 Claims
-
1. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, configure the processor to perform operations comprising:
-
scheduling a plurality of processes each comprising a plurality of threads to operate independently on discrete partitions of data; responsive to a first thread of the plurality of threads receiving a first partition of the data, the first thread beginning an operation on the first partition of the data; responsive to a second thread of the plurality of threads receiving a second partition of the data, the second thread beginning the operation on the second partition of the data; tracking progress of the operation using a replicated data structure at each thread, the replicated data structure comprising an epoch-iteration tuple; storing the epoch-iteration tuple as a timestamp in the replicated data structure corresponding to records for processing at each thread; determining a number of yet to be processed records from the replicated data structure for at least one of the first thread or the second thread; and when the number of yet to be processed records for the first thread or the second thread reaches zero and the tuple includes the highest iteration in the replicated data structure for the thread, terminating the thread. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
tracking progress of a computation by; associating each of a plurality of data items to be processed in a computation with a vertex of a dataflow graph, the dataflow graph representing a single program; associating each of the plurality of data items to be processed in the computation with a timestamp, the timestamp corresponding to an order in which data items are created in the computation; storing a vertex-timestamp pair associated with each data item; and counting a number of yet to be processed data items that are associated with each vertex-timestamp pair; and communicating by at least one processor asynchronously sending one or more messages including the number of yet to be processed data items that are associated with at least one vertex-timestamp pair. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a plurality of processors connected to a network for asynchronously sending messages; a plurality of memories storing data comprising; a graph of at least one dataflow operator; a queue of unprocessed records associated with each dataflow operator, each unprocessed record having a timestamp; a mapping from a version, wherein a version includes a dataflow operator-timestamp pair, to an over-approximation of a number of unprocessed records for that version in the system. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification