Resource management in a clustered computer system
First Claim
1. A method for managing resources in a cluster, the method comprising the steps of:
- determining that reliable communications with a cluster node over a system area network has failed, the cluster node including a memory; and
updating a node record stored at an emergency message location on a shared non-volatile storage device to remove the node from the cluster.
8 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and devices are provided for managing resources in a computing cluster. The managed resources include cluster nodes themselves, as well as sharable resources such as memory buffers and bandwidth credits that may be used by one or more nodes. Resource management includes detecting failures and possible failures by node software, node hardware, interconnects, and system area network switches and taking steps to compensate for failures and prevent problems such as uncoordinated access to a shared disk. Resource management also includes reallocating sharable resources in response to node failure, demands by application programs, or other events. Specific examples provided include failure detection by remote memory probes, emergency communication through a shared disk, and sharable resource allocation with minimal locking.
310 Citations
15 Claims
-
1. A method for managing resources in a cluster, the method comprising the steps of:
-
determining that reliable communications with a cluster node over a system area network has failed, the cluster node including a memory; and
updating a node record stored at an emergency message location on a shared non-volatile storage device to remove the node from the cluster. - View Dependent Claims (2)
-
-
3. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform, the nodes including respective memories;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure. - View Dependent Claims (4, 5, 6)
-
-
7. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location is specified as a predetermined sector location on a disk.
-
-
8. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location is specified as a predetermined file.
-
-
9. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that identifies node epochs.
-
-
10. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that identifies node roles.
-
-
11. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that identifies a cluster master node.
-
-
12. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that contains a status value indicating that a particular node should shut down a particular task.
-
-
13. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that contains a status value indicating that a particular node should shut down all tasks.
-
-
14. A computer system comprising:
-
at least two interconnected nodes capable of presenting a uniform system image such that an application program views the interconnected nodes as a single computing platform;
a management means for managing computational resources for use by the nodes; and
a shared nonvolatile storage device, wherein the management means comprises an accessing means for accessing an emergency message location on the shared nonvolatile storage device in response to detection of a possible node failure, wherein the emergency message location stores an emergency communication structure that contains a status value indicating that a particular node should yield control to a debugger.
-
-
15. A computer storage medium having a configuration that represents data and instructions which will cause at least a portion of a computer system to perform method steps for managing resources in a cluster computing system, the method steps comprising the steps of determining that reliable communication with a cluster node over a system area network has failed, said cluster node including a memory, and updating a node record stored at an emergency message location on a shared nonvolatile storage device to remove the node from the cluster.
16.
Specification