Method and apparatus for time management and scheduling for sychronous processing on a cluster of processing nodes
First Claim
1. A method for processing by a first node in a distributed computing system formed by a plurality of interconnected nodes which maintain their own relative time, comprising:
- monitoring completion of jobs by other nodes in the distributed computing system;
determining, after completing processing of a job in a current time interval of the first node, whether or not to start processing a job in a subsequent time interval of the first node based on at least one constraint and the monitored completion of jobs by other nodes; and
determining a time offset for a connection between the first node and the second node, based on a difference in timestamps between the first node and at least the second node, wherein each of the timestamps indicates an interval for which a corresponding job is completed by a respective node,wherein the at least one constraint is defined by a value that specifies a maximum number of jobs the first node is allowed to process ahead of a second node, and the at least one constraint prevents the first node from processing a job in the subsequent time interval of the first node before at least the second node has begun processing a job in an interval later than a subsequent time interval of the first node adjusted based on the time offset.
1 Assignment
0 Petitions
Accused Products
Abstract
Certain aspects of the present disclosure provide techniques for time management and scheduling of synchronous neural processing on a cluster of processing nodes. A slip (or offset) may be introduced between processing nodes of a distributed processing system formed by a plurality of interconnected processing nodes, to enable faster nodes to continue processing without waiting for slower nodes to catch up. In certain aspects, a processing node, after completing each processing step, may check for received completion packets and apply a defined constraint to determine whether it may start processing a subsequent step or not.
37 Citations
20 Claims
-
1. A method for processing by a first node in a distributed computing system formed by a plurality of interconnected nodes which maintain their own relative time, comprising:
-
monitoring completion of jobs by other nodes in the distributed computing system; determining, after completing processing of a job in a current time interval of the first node, whether or not to start processing a job in a subsequent time interval of the first node based on at least one constraint and the monitored completion of jobs by other nodes; and determining a time offset for a connection between the first node and the second node, based on a difference in timestamps between the first node and at least the second node, wherein each of the timestamps indicates an interval for which a corresponding job is completed by a respective node, wherein the at least one constraint is defined by a value that specifies a maximum number of jobs the first node is allowed to process ahead of a second node, and the at least one constraint prevents the first node from processing a job in the subsequent time interval of the first node before at least the second node has begun processing a job in an interval later than a subsequent time interval of the first node adjusted based on the time offset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for processing by a first node in a distributed computing system formed by a plurality of interconnected nodes which maintain their own relative time, comprising:
-
means for monitoring completion of jobs by other nodes in the distributed computing system; means for determining, after completing processing of a job in a current time interval of the first node, whether or not to start processing a job in a subsequent time interval of the first node based on at least one constraint and the monitored completion of jobs by other nodes; and means for determining a time offset for a connection between the first node and the second node, based on a difference in timestamps between the first node and at least the second node wherein each of the timestamps indicates an interval for which a corresponding job is completed by a respective node, wherein the at least one constraint is defined by a value that specifies a maximum number of jobs the first node is allowed to process ahead of a second node, and the at least one constraint prevents the first node from processing a job in the subsequent time interval of the first node before at least the second node has begun processing a job in an interval later than a subsequent time interval of the first node adjusted based on the time offset. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus for processing by a first node in a distributed computing system formed by a plurality of interconnected nodes which maintain their own relative time, comprising:
-
at least one processor configured to monitor completion of jobs by other nodes in the distributed computing system, determine, after completing processing of a job in a current time interval of the first node, whether or not to start processing a job in a subsequent time interval of the first node based on at least one constraint and the monitored completion of jobs by other nodes, and determine a time offset for a connection between the first node and the second node, based on a difference in timestamps between the first node and at least the second node, wherein each of the timestamps indicates an interval for which a corresponding job is completed by a respective node, wherein the at least one constraint is defined by a value that specifies a maximum number of jobs the first node is allowed to process ahead of a second node and the at least one constraint prevents the first node from processing a job in the subsequent time interval of the first node before at least the second node has begun processing a job in an interval later than a subsequent time interval of the first node adjusted based on the time offset; and a memory coupled with the at least one processor.
-
-
20. A non-transitory computer readable medium for processing by a first node in a distributed computing system formed by a plurality of interconnected nodes which maintain their own relative time, the non-transitory computer readable medium having instructions stored thereon, the instructions executable by one or more processors for:
-
monitoring completion of jobs by other nodes in the distributed computing system; determining, after completing processing of a job in a current time interval of the first node, whether or not to start processing a job in a subsequent time interval of the first node based on at least one constraint and the monitored completion of jobs by other nodes; and determining a time offset for a connection between the first node and the second node, based on a difference in timestamps between the first node and at least the second node, wherein each of the timestamps indicates an interval for which a corresponding job is completed by a respective node, wherein the at least one constraint is defined by a value that specifies a maximum number of jobs the first node is allowed to process ahead of a second node, and the at least one constraint prevents the first node from processing a job in the subsequent time interval of the first node before at least the second node has begun processing a job in an interval later than a subsequent time interval of the first node adjusted based on the time offset.
-
Specification