Network Fault Detection and Reconfiguration

US 20130297976A1
Filed: 03/07/2013
Published: 11/07/2013
Est. Priority Date: 05/04/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for detecting communication faults in a parallel system, the method comprising:

sending ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor;

waiting to receive acknowledgements from each destination node indicating the destination node received the ping message;

responsive to failure to receive one or more acknowledgement message, detecting failure of corresponding one or more ping messages to reach their destination nodes; and

responsive to detecting failure of one or more ping messages to reach their target nodes, identifying faulty component in the parallel system, the identifying comprising;

freezing communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages;

sending ping messages through different components of the parallel system;

identifying the faulty component based on failure to deliver a ping message through the component; and

unfreezing the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Scalable means are provided for diagnosing the health of a parallel system comprising multiple nodes interconnected using one or more switching networks. The node pings other nodes via different paths at regular intervals. If more than a threshold number of pings are missed from a node, the system performs fault detection by entering a freeze state in which nodes do not send or receive any messages except ping messages. If ping messages still fail to reach destination nodes, the parallel system identifies faulty components that are causing ping messages to fail. Once the faulty component is identified, the parallel system is unfrozen by allowing nodes to communicate all messages. If redundant computers and/or switches are present, the parallel system is automatically reconfigured to avoid the faulty components.

55 Citations

View as Search Results

20 Claims

1. A computer-implemented method for detecting communication faults in a parallel system, the method comprising:
- sending ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor;
  
  waiting to receive acknowledgements from each destination node indicating the destination node received the ping message;
  
  responsive to failure to receive one or more acknowledgement message, detecting failure of corresponding one or more ping messages to reach their destination nodes; and
  
  responsive to detecting failure of one or more ping messages to reach their target nodes, identifying faulty component in the parallel system, the identifying comprising;
  
  freezing communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages;
  
  sending ping messages through different components of the parallel system;
  
  identifying the faulty component based on failure to deliver a ping message through the component; and
  
  unfreezing the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computer-implemented method of claim 1, wherein the faulty component is one of a link, a node, or a switch.
  - 3. The computer-implemented method of claim 1, wherein identifying the faulty component comprises:
    - identifying a link as the faulty component responsive to the link failing to communicate a sequence of ping messages.
  - 4. The computer-implemented method of claim 1, wherein identifying the faulty component comprises:
    - identifying a node as the faulty component responsive to all links connected to the node being determined to be faulty.
  - 5. The computer-implemented method of claim 1, wherein each link is associated with a switch and identifying the faulty component comprises:
    - identifying a switch as the faulty component responsive to any links of the switch being determined as faulty.
  - 6. The computer-implemented method of claim 1, detecting failure of a ping message comprises determining a failure to receive more than a threshold number of acknowledgements corresponding to ping messages sent.
  - 7. The computer-implemented method of claim 1, wherein identifying the faulty components in the system comprises executing a hierarchical state machine comprising:
    - link state machines that track states of links of the parallel system;
      
      switch state machines that track states of the switches of the parallel system; and
      
      node state machines that track states of nodes of the parallel system.
  - 8. The computer-implemented method of claim 7, wherein a state of each switch state machine is determined based on states of the link state machines tracking states of the links of the switch.
  - 9. The computer-implemented method of claim 7, wherein a state of each node state machine is determined based on states of the link state machines tracking states of the links connected to the node.
  - 10. The computer-implemented method of claim 7, further comprising a system state machine, wherein the state of the system state machine is determined based on the states of the switch state machines and node state machines of the system.
  - 11. The computer-implemented method of claim 1, further comprising:
    - reconfiguring the system to allow the nodes to communicate without using the faulty component of the system.
  - 12. The computer-implemented method of claim 11, wherein each node stores structures describing a network configuration for communicating with other nodes and reconfiguring the system to communicate without using the faulty components comprises modifying the structures describing the network configuration to reflect a new network configuration that excludes the faulty components.

13. A computer-readable storage medium storing computer-executable code for detecting communication faults in a parallel system, the code, when executed by a processor, causing the processor to:
- send ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor;
  
  wait to receive acknowledgements from each destination node indicating the destination node received a ping message;
  
  responsive to failure to receive one or more acknowledgement message, detect failure of corresponding one or more ping messages to reach their destination nodes; and
  
  responsive to detecting failure of one or more ping messages to reach their target nodes, identify faulty component in the parallel system, the identifying causing the processor to;
  
  freeze communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages;
  
  send ping messages through different components of the parallel system;
  
  identify the faulty component based on failure to deliver a ping message through the component; and
  
  unfreeze the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages.
- View Dependent Claims (14, 15, 16)
- - 14. The computer-readable storage medium of claim 13, wherein the code causes a processor to execute a hierarchical state machine comprising:
    - link state machines that track states of links of the parallel system;
      
      switch state machines that track states of the switches of the parallel system; and
      
      node state machines that track states of nodes of the parallel system.
  - 15. The computer-readable storage medium of claim 14, wherein the code causes the processor to:
    - determine a state of each switch state machine based on states of the link state machines tracking states of the links of the switch.
  - 16. The computer-readable storage medium of claim 14, wherein the code causes the processor to:
    - determine a state of each node state machine based on states of the link state machines tracking states of the links connected to the node.

17. A computer-implemented system for detecting communication faults in a parallel system, the system comprising:
- a computer processor; and
  
  a computer-readable storage medium storing computer program modules configured to execute on the computer processor, the computer program modules comprising;
  
  a communication module configured to;
  
  send ping messages by a node of the parallel system to one or more destination nodes, wherein the parallel system comprises a plurality of nodes communicating with each other using a plurality of links, each node comprising a processor;
  
  wait to receive acknowledgements from each destination node indicating the destination node received a ping message;
  
  a fault detection module configured to;
  
  responsive to failure to receive one or more acknowledgement message, detect failure of corresponding one or more ping messages to reach their destination nodes; and
  
  responsive to detecting failure of one or more ping messages to reach their target nodes, identify faulty component in the parallel system, the identifying causing the processor to;
  
  freeze communications in the parallel system by sending a request to nodes of the system to stop sending and receiving messages except for ping messages;
  
  send ping messages through different components of the parallel system;
  
  identify the faulty component based on failure to deliver a ping message through the component; and
  
  unfreeze the parallel system by sending requests to the nodes of the system to restart sending and receiving messages other than ping messages.
- View Dependent Claims (18, 19, 20)
- - 18. The computer-readable storage medium of claim 17, wherein the computer program modules comprise a state machine manager configured to execute a hierarchical state machine comprising:
    - link state machines that track states of links of the parallel system;
      
      switch state machines that track states of the switches of the parallel system; and
      
      node state machines that track states of nodes of the parallel system.
  - 19. The computer-readable storage medium of claim 18, wherein the state machine manager is configured to:
    - determine a state of each switch state machine based on states of the link state machines tracking states of the links of the switch.
  - 20. The computer-readable storage medium of claim 18, wherein the state machine manager is configured to:
    - determine a state of each node state machine based on states of the link state machines tracking states of the links connected to the node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Paraccel Incorporated (Actian Corporation)
Original Assignee
Paraccel Incorporated (Actian Corporation)
Inventors
McMillen, Robert J.

Granted Patent

US 9,239,749 B2
Time in Patent Office

Days
Field of Search
US Class Current

714/43
CPC Class Codes

G06F 11/079   Root cause analysis, i.e. e...

H04L 41/0645   by additionally acting on o...

H04L 41/0659   by isolating or reconfiguri...

H04L 41/0661   by reconfiguring faulty ent...

H04L 43/10   Active monitoring, e.g. hea...

Network Fault Detection and Reconfiguration

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

55 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Network Fault Detection and Reconfiguration

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links