MAINTAINING HIGH AVAILABILITY OF A GROUP OF VIRTUAL MACHINES USING HEARTBEAT MESSAGES
First Claim
1. A system for maintaining high availability of a plurality of virtual machines in a fault domain, the system comprising:
- a memory for storing a protected virtual machine list identifying a plurality of virtual machines within a fault domain that are to be maintained in an operating state;
a network communication interface configured to receive heartbeat messages from a plurality of hosts executing the virtual machines; and
a processor coupled to the memory and programmed to;
determine a host, within the plurality of hosts, from which the network communication interface has not received a heartbeat message within a first predetermined duration to identify an unreachable host;
determine whether the unreachable host has stored heartbeat data in a datastore within a second predetermined duration; and
restart a virtual machine executed by the unreachable host based on determining that the unreachable host has not stored heartbeat data in the datastore within the second predetermined duration.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments maintain high availability of software application instances in a fault domain. Subordinate hosts are monitored by a master host. The subordinate hosts publish heartbeats via a network and datastores. Based at least in part on the published heartbeats, the master host determines the status of each subordinate host, distinguishing between subordinate hosts that are entirely inoperative and subordinate hosts that are operative but partitioned (e.g., unreachable via the network). The master host may restart software application instances, such as virtual machines, that are executed by inoperative subordinate hosts or that cease executing on partitioned subordinate hosts.
73 Citations
20 Claims
-
1. A system for maintaining high availability of a plurality of virtual machines in a fault domain, the system comprising:
-
a memory for storing a protected virtual machine list identifying a plurality of virtual machines within a fault domain that are to be maintained in an operating state; a network communication interface configured to receive heartbeat messages from a plurality of hosts executing the virtual machines; and a processor coupled to the memory and programmed to; determine a host, within the plurality of hosts, from which the network communication interface has not received a heartbeat message within a first predetermined duration to identify an unreachable host; determine whether the unreachable host has stored heartbeat data in a datastore within a second predetermined duration; and restart a virtual machine executed by the unreachable host based on determining that the unreachable host has not stored heartbeat data in the datastore within the second predetermined duration. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
transmitting, by a subordinate host, a heartbeat message to a master host, wherein the transmitted heartbeat message indicates that the subordinate host is operative at a current time; storing, by the subordinate host, heartbeat data in a datastore, wherein the stored heartbeat data indicates that the subordinate host is operative at the current time; and storing, by the subordinate host, a power-on list in the datastore, wherein the power-on list indicates software application instances for a master host to restart at a host other than the subordinate host if the subordinate host becomes inoperative. - View Dependent Claims (9, 10, 12, 13, 14)
-
-
11. The method of 8, wherein the subordinate host is a first host, the method further comprising:
-
receiving, by the subordinate host, from a management server a host list that includes a version number, a plurality of host identifiers, and one or more heartbeat datastores associated with each host identifier of the plurality of host identifiers; transmitting the version number to a second host, wherein the second host transmits to the subordinate host a request for the host list when the transmitted version number is greater than a version number included in a host list stored by the second host; and transmitting the host list from the subordinate host to the second host in response to the request.
-
-
15. One or more computer-readable storage media having computer-executable components comprising:
-
a datastore selection component that when executed causes at least one processor to; identify a plurality of datastores to which a subordinate host has read-write access; select from the plurality of datastores a first heartbeat datastore and a second heartbeat datastore based on a quantity of hosts that have access to the first and second heartbeat datastores; and associate the first and second heartbeat datastores with the host; and a heartbeat publication component that when executed causes at least one processor to; transmit to a master host a heartbeat message indicating that the subordinate host is operative at a current time; and store in the first and second heartbeat datastores heartbeat data indicating that the subordinate host is operative at the current time. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification