System and method for fault tolerance in multi-node system
First Claim
Patent Images
1. A general purpose computer system having multiple nodes, comprising:
- at least one processor executing method acts to promote tolerance of faults in the system, the method acts comprising;
based at least in part on the faults, determining a set of nodes; and
using nodes in the set of nodes only as points on routing paths of messages, and not using any node in the set of nodes for sending or receiving messages.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for promoting fault tolerance in a multi-node computing system that provides deadlock-free message routing in the presence of node and/or link faults using only two rounds and, thus, requiring only two virtual channels to ensure deadlock freedom. A lamb set of nodes for use in message routing is introduced, with each node in the lamb set being used only as points along message routes, and not for sending or receiving messages.
-
Citations
40 Claims
-
1. A general purpose computer system having multiple nodes, comprising:
at least one processor executing method acts to promote tolerance of faults in the system, the method acts comprising;
based at least in part on the faults, determining a set of nodes; and
using nodes in the set of nodes only as points on routing paths of messages, and not using any node in the set of nodes for sending or receiving messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
16. A computer program device comprising:
-
a computer program storage device readable by a digital processing apparatus; and
a program on the program storage device and including instructions executable by the digital processing apparatus for promoting fault tolerance in a multi-node system, the program comprising;
means for designating a lamb set of nodes in the multi-node system to be used for routing messages within the system. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A method for promoting fault tolerance in a multi-node system, comprising the acts of:
-
for each of k rounds, finding multiple partitions of nodes, each partition having a representative node;
for each representative node, determining whether the node can reach at least one predetermined other representative node within a predetermined criteria;
minimizing the number of nodes and/or partitions using a weighted graph to establish a routing set of nodes; and
returning the routing set of nodes for use thereof in routing messages through the system in the presence of one or more node and/or link faults. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
Specification