Resource management in a clustered computer system
First Claim
1. A Method for managing resources in a cluster, the method comprising the computer-implemented steps of:
- remotely probing memory in a device in the cluster; and
using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;
wherein the probing step probes volatile memory in a remote cluster node.
7 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and devices are provided for managing resources in a computing cluster. The managed resources include cluster nodes themselves, as well as sharable resources such as memory buffers and bandwidth credits that may be used by one or more nodes. Resource management includes detecting failures and possible failures by node software, node hardware, interconnects, and system area network switches and taking steps to compensate for failures and prevent problems such as uncoordinated access to a shared disk. Resource management also includes reallocating sharable resources in response to node failure, demands by application programs, or other events. Specific examples provided include failure detection by remote memory probes, emergency communication through a shared disk, and sharable resource allocation with minimal locking.
342 Citations
73 Claims
-
1. A Method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the probing step probes volatile memory in a remote cluster node. - View Dependent Claims (2, 3, 4, 5, 6, 66)
-
-
7. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the probing step probes memory in a remote cluster interconnect. - View Dependent Claims (8, 9, 10, 11, 12, 67)
-
-
13. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the probing step probes volatile memory in a remote system area network switch. - View Dependent Claims (14, 15, 16, 17, 18, 68)
-
-
19. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the probing step probes memory in a remote cluster node and also probes memory in a remote system area network switch. - View Dependent Claims (20, 21, 22, 23, 24, 69)
-
-
25. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the probing step probes nonvolatile memory in a shared storage device accessible by a remote cluster node. - View Dependent Claims (26, 27, 28, 29, 30, 70)
-
-
31. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; and using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value;wherein the using step is preceded by the step of determining the validity of the counter value. - View Dependent Claims (32, 33, 34, 35, 36, 71)
-
-
37. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value; andremoving from the cluster a failed node, the node'"'"'s failure being detected by the probing and using steps, wherein the removing step comprises writing to a predetermined emergency message location on a nonvolatile storage device that was and possibly still is accessible to the failed node. - View Dependent Claims (38, 39, 40, 41, 72)
-
-
42. A method for managing resources in a cluster, the method comprising the computer-implemented steps of:
-
remotely probing memory in a device in the cluster; using one of the following to determine whether the probed device has failed;
a counter value retrieved by the remote probing, or the unavailability of such a counter value; andremoving from the cluster a failed node, the node'"'"'s failure being detected by the probing and using steps, wherein the removing step comprises returning to a global queue resources that were previously allocated to the failed node. - View Dependent Claims (43, 44, 45, 46, 73)
-
-
47. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for detecting node failure by remotely probing memory.
-
-
48. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for detecting interconnect failure by remotely probing memory.
-
-
49. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for detecting system area network switch failure by remotely probing memory.
-
-
50. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for one node to monitor a communication path to another node by remotely probing memory at regular time intervals. - View Dependent Claims (51)
-
-
52. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for a first node to monitor a communication path to a second node by remotely probing memory in preparation for sending a message from the first node to the second node.
-
-
53. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which a second node and an interconnect connected to the second node are operating normally and a second condition in which the second node is restarting and the interconnect is operating normally.
-
-
54. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which a second node and an interconnect connected to the second node are operating normally and a second condition in which the second node is operating normally after recently restarting and the interconnect is operating normally.
-
-
55. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which a second node and an interconnect connected to the second node are operating normally and a second condition in which software running on the second node has failed and the interconnect is operating normally.
-
-
56. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which a second node and an interconnect connected to the second node are operating normally and a second condition in which the second node has yielded control to a debugger and the interconnect is operating normally.
-
-
57. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which a second node and an interconnect connected to the second node are operating normally and a second condition in which hardware within the second node has failed and the interconnect is operating normally.
-
-
58. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a remote memory probing and evaluation means for a first node to distinguish between a first condition in which an interconnect connected to a second node is operating normally and a second condition in which the interconnect has failed.
-
-
59. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for remotely probing memory to obtain a probe structure containing a counter. - View Dependent Claims (60, 61, 62)
-
-
63. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for remotely probing memory to obtain a probe structure containing an epoch.
-
-
64. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for remotely probing memory to obtain a probe structure containing a root pointer.
-
-
65. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform; and a management means for managing computational resources for use by the nodes, wherein the management means comprises a means for remotely probing memory to obtain a probe structure containing a status summary.
-
Specification