Network Fault Detection and Reconfiguration
First Claim
1. A computer-implemented method for detecting communication faults in a parallel system, the method comprising:
- sending ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor;
waiting to receive acknowledgements from each destination node indicating the destination node received the ping message;
responsive to failure to receive one or more acknowledgement message, detecting failure of corresponding one or more ping messages to reach their destination nodes; and
responsive to detecting failure of one or more ping messages to reach their target nodes, identifying faulty component in the parallel system, the identifying comprising;
freezing communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages;
sending ping messages through different components of the parallel system;
identifying the faulty component based on failure to deliver a ping message through the component; and
unfreezing the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages.
6 Assignments
0 Petitions
Accused Products
Abstract
Scalable means are provided for diagnosing the health of a parallel system comprising multiple nodes interconnected using one or more switching networks. The node pings other nodes via different paths at regular intervals. If more than a threshold number of pings are missed from a node, the system performs fault detection by entering a freeze state in which nodes do not send or receive any messages except ping messages. If ping messages still fail to reach destination nodes, the parallel system identifies faulty components that are causing ping messages to fail. Once the faulty component is identified, the parallel system is unfrozen by allowing nodes to communicate all messages. If redundant computers and/or switches are present, the parallel system is automatically reconfigured to avoid the faulty components.
55 Citations
20 Claims
-
1. A computer-implemented method for detecting communication faults in a parallel system, the method comprising:
-
sending ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor; waiting to receive acknowledgements from each destination node indicating the destination node received the ping message; responsive to failure to receive one or more acknowledgement message, detecting failure of corresponding one or more ping messages to reach their destination nodes; and responsive to detecting failure of one or more ping messages to reach their target nodes, identifying faulty component in the parallel system, the identifying comprising; freezing communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages; sending ping messages through different components of the parallel system; identifying the faulty component based on failure to deliver a ping message through the component; and unfreezing the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable storage medium storing computer-executable code for detecting communication faults in a parallel system, the code, when executed by a processor, causing the processor to:
-
send ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor; wait to receive acknowledgements from each destination node indicating the destination node received a ping message; responsive to failure to receive one or more acknowledgement message, detect failure of corresponding one or more ping messages to reach their destination nodes; and responsive to detecting failure of one or more ping messages to reach their target nodes, identify faulty component in the parallel system, the identifying causing the processor to; freeze communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages; send ping messages through different components of the parallel system; identify the faulty component based on failure to deliver a ping message through the component; and unfreeze the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages. - View Dependent Claims (14, 15, 16)
-
-
17. A computer-implemented system for detecting communication faults in a parallel system, the system comprising:
-
a computer processor; and a computer-readable storage medium storing computer program modules configured to execute on the computer processor, the computer program modules comprising; a communication module configured to; send ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor; wait to receive acknowledgements from each destination node indicating the destination node received a ping message; a fault detection module configured to; responsive to failure to receive one or more acknowledgement message, detect failure of corresponding one or more ping messages to reach their destination nodes; and responsive to detecting failure of one or more ping messages to reach their target nodes, identify faulty component in the parallel system, the identifying causing the processor to; freeze communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages; send ping messages through different components of the parallel system; identify the faulty component based on failure to deliver a ping message through the component; and unfreeze the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages. - View Dependent Claims (18, 19, 20)
-
Specification