Message flow control in a multi-node computer system
First Claim
1. A computer-implemented method for controlling message flow in a parallel computing system having a plurality of compute nodes, the method comprising:
- assigning a first set of compute nodes to a first node pool;
assigning a first message flow control policy to at least two compute nodes of the first node pool, wherein the first message flow control policy specifies at least one logging activity to be performed by an instance of an application running on each of the at least two compute nodes of the first node pool, and wherein subsequent modifications to the assigned first message flow control policy affect one or more of the at least one logging activities performed by each instance of the application running on the at least two compute nodes;
initiating execution of the application on each of the compute nodes in the first node pool;
while executing the application on the at least two compute nodes of the first node pool, generating a plurality of logging messages according to the first message flow control policy; and
upon determining that two or more of the at least two compute nodes of the first node pool are generating duplicate error messages based on content of the plurality of logging messages;
assigning a selected one of the two or more compute nodes to a second node pool; and
assigning a second message flow control policy corresponding to the second node pool to the selected compute node, wherein the second message flow control policy is distinct from the first message flow control policy, and wherein logging activity performed by the instance of the application running on the selected compute node is controlled by the second message flow control policy rather than the first message flow control policy.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide for controlling message flow across a parallel computer system having multiple compute nodes by selectively grouping compute nodes of such a system into node pools and assigning message flow control policies to nodes in the node pools. The message flow control policies specify logging and/or tracing activities to be performed by instances of applications running on nodes assigned to the node pools. As the application is executed, logging and/or tracing messages are generated on the compute nodes according to message flow control policies assigned to the nodes. Optionally, the message flow is analyzed, the message flow control policies are adjusted, and duplicate messages are eliminated.
19 Citations
25 Claims
-
1. A computer-implemented method for controlling message flow in a parallel computing system having a plurality of compute nodes, the method comprising:
-
assigning a first set of compute nodes to a first node pool; assigning a first message flow control policy to at least two compute nodes of the first node pool, wherein the first message flow control policy specifies at least one logging activity to be performed by an instance of an application running on each of the at least two compute nodes of the first node pool, and wherein subsequent modifications to the assigned first message flow control policy affect one or more of the at least one logging activities performed by each instance of the application running on the at least two compute nodes; initiating execution of the application on each of the compute nodes in the first node pool; while executing the application on the at least two compute nodes of the first node pool, generating a plurality of logging messages according to the first message flow control policy; and upon determining that two or more of the at least two compute nodes of the first node pool are generating duplicate error messages based on content of the plurality of logging messages; assigning a selected one of the two or more compute nodes to a second node pool; and assigning a second message flow control policy corresponding to the second node pool to the selected compute node, wherein the second message flow control policy is distinct from the first message flow control policy, and wherein logging activity performed by the instance of the application running on the selected compute node is controlled by the second message flow control policy rather than the first message flow control policy. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium containing a program which, when executed by a processor, performs an operation for controlling message flow in a parallel computing system having a plurality of compute nodes, the operation comprising:
-
assigning a first set of compute nodes to a first node pool; assigning a first message flow control policy to at least two compute nodes of the first node pool, wherein the first message flow control policy specifies at least one logging activity to be performed by an instance of an application running on each of the at least two compute nodes of the first node pool, and wherein subsequent modifications to the assigned first message flow control policy affect one or more of the at least one logging activities performed by each instance of the application running on the at least two compute nodes; initiating execution of the application on each of the compute nodes in the first node pool; while executing the application on the at least two compute nodes of the first node pool, generating a plurality of logging messages according to the first message flow control policy; and upon determining that two or more of the at least two compute nodes of the first node pool are generating duplicate error messages based on content of the plurality of logging messages; assigning a selected one of the two or more compute nodes to a second node pool; and assigning a second message flow control policy corresponding to the second node pool to the selected compute node, wherein the second message flow control policy is distinct from the first message flow control policy, and wherein logging activity performed by the instance of the application running on the selected compute node is controlled by the second message flow control policy rather than the first message flow control policy. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A parallel computing system, comprising:
-
a plurality of compute nodes, each having at least a processor and a memory, wherein the plurality of compute nodes is configured to execute a parallel computing task; and a service node having at least a processor and a memory and a tracing-logging control program for controlling message flow in the parallel computing system, wherein the tracing-logging control program is configured to; assign a first set of compute nodes to a first node pool; assign a first message flow control policy to at least two compute nodes of the first node pool, wherein the first message flow control policy specifies at least one logging activity to be performed by an instance of an application running on each of the at least two compute nodes of the first node pool, and wherein subsequent modifications to the assigned first message flow control policy affect one or more of the at least one logging activities performed by each instance of the application running on the at least two compute nodes; initiate execution of the application on each of the compute nodes in the first node pool; while executing the application on the at least two compute nodes of the first node pool, generate a plurality of logging messages according to the first message flow control policy; and upon determining that two or more of the at least two compute nodes of the first node pool are generating duplicate error messages based on content of the plurality of logging messages; assign a selected one of the two or more compute nodes to a second node pool; and assign a second message flow control policy corresponding to the second node pool to the selected compute node, wherein the second message flow control policy is distinct from the first message flow control policy, and wherein logging activity performed by the instance of the application running on the selected compute node is controlled by the second message flow control policy rather than the first message flow control policy. - View Dependent Claims (24, 25)
-
Specification