Reachability-based coordination for cyclic dataflow
First Claim
Patent Images
1. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, configure the processor to perform operations comprising:
- scheduling a plurality of processes each comprising a plurality of threads to operate independently on discrete partitions of data;
responsive to a first thread of the plurality of threads receiving a first partition of the data, the first thread beginning an operation on the first partition of the data;
responsive to a second thread of the plurality of threads receiving a second partition of the data, the second thread beginning the operation on the second partition of the data;
tracking progress of the operation using a replicated data structure at each of the first thread and the second thread, the replicated data structure comprising a logical time, the logical time being a tuple comprising at least an integer representing an epoch and an integer representing an iteration;
adding a timestamp to a record for processing at each of the first thread and the second thread, the timestamp corresponding to the logical time the record was produced;
storing the timestamp in the replicated data structure;
determining a number of yet to be processed records from the replicated data structure for at least one of the first thread or the second thread; and
when the number of yet to be processed records for the first thread or the second thread reaches zero and the timestamp includes the highest iteration in the replicated data structure for one of the first thread or the second thread, terminating the one of the first thread or the second thread.
2 Assignments
0 Petitions
Accused Products
Abstract
Various embodiments provide techniques for working with large-scale collections of data pertaining to real world systems, such as a social network, a roadmap/GPS system, etc. The techniques perform incremental, iterative, and interactive parallel computation using a coordination clock protocol, which applies to scheduling computations and managing resources such as memory and network resources, etc., in cyclic graphs including those resulting from a differential dataflow model that performs computations on differences in the collections of data.
93 Citations
20 Claims
-
1. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, configure the processor to perform operations comprising:
-
scheduling a plurality of processes each comprising a plurality of threads to operate independently on discrete partitions of data; responsive to a first thread of the plurality of threads receiving a first partition of the data, the first thread beginning an operation on the first partition of the data; responsive to a second thread of the plurality of threads receiving a second partition of the data, the second thread beginning the operation on the second partition of the data; tracking progress of the operation using a replicated data structure at each of the first thread and the second thread, the replicated data structure comprising a logical time, the logical time being a tuple comprising at least an integer representing an epoch and an integer representing an iteration; adding a timestamp to a record for processing at each of the first thread and the second thread, the timestamp corresponding to the logical time the record was produced; storing the timestamp in the replicated data structure; determining a number of yet to be processed records from the replicated data structure for at least one of the first thread or the second thread; and when the number of yet to be processed records for the first thread or the second thread reaches zero and the timestamp includes the highest iteration in the replicated data structure for one of the first thread or the second thread, terminating the one of the first thread or the second thread. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
tracking progress of a computation by; associating each of a plurality of data items to be processed in a computation with a dataflow vertex of a dataflow graph, the dataflow graph representing a single program, and the dataflow vertex comprising the output of a function and the input of a next function; associating each of the plurality of data items to be processed in the computation with a timestamp, the timestamp corresponding to an order in which data items are created in the computation; storing a vertex-time pair associated with each data item, the vertex-time pair comprising the dataflow vertex and the timestamp; and counting a number of yet to be processed data items that are associated with each vertex-time pair; and communicating by at least one processor asynchronously sending one or more messages including the number of yet to be processed data items that are associated with at least one vertex-time pair. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a plurality of processors connected to a network for asynchronously sending messages; a plurality of memories storing data comprising; a graph of at least one dataflow operator; a queue of unprocessed records associated with each dataflow operator, each unprocessed record having a timestamp; and a mapping from a version, wherein the version includes a dataflow operator-timestamp pair, to an over-approximation of a number of unprocessed records for that version in the system, wherein at least one processor of the plurality of processors communicates the number of unprocessed records. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification