Maintaining high availability of a group of virtual machines using heartbeat messages
First Claim
1. A system for maintaining high availability of a plurality of virtual machines in a fault domain, the system comprising:
- a memory for storing a list identifying a plurality of virtual machines within a fault domain that are to be maintained in an operating state;
a network communication interface configured to receive heartbeat messages from a plurality of hosts executing the virtual machines; and
a processor coupled to the memory and programmed to;
associate a datastore with a host by ranking a plurality of datastores based on their accessibility and discarding a datastore with an accessibility score that is below a threshold value, wherein the accessibility score is based on at least two of the following factors;
1) quantity of hosts that have access to the datastore,
2) whether the datastore is associated with the same storage device as another datastore, and
3) the file system type of the datastore;
monitor the heartbeat messages received from the plurality of hosts at predetermined intervals and determine the host, within the plurality of hosts, from which the network communication interface has not received a heartbeat message within a first predetermined duration to identify the host as an unreachable host;
after the unreachable host is identified, designate the unreachable host as a dead host when no response to a previously sent ping message is received from the unreachable host; and
if the dead host has stored heartbeat data in the associated datastore within a second predetermined duration, monitor a power-on list stored in the associated datastore by the dead host, wherein the power-on list indicates virtual machines being executed by the dead host, and change the status of the dead host.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments maintain high availability of software application instances in a fault domain. Subordinate hosts are monitored by a master host. The subordinate hosts publish heartbeats via a network and datastores. Based at least in part on the published heartbeats, the master host determines the status of each subordinate host, distinguishing between subordinate hosts that are entirely inoperative and subordinate hosts that are operative but partitioned (e.g., unreachable via the network). The master host may restart software application instances, such as virtual machines, that are executed by inoperative subordinate hosts or that cease executing on partitioned subordinate hosts.
77 Citations
20 Claims
-
1. A system for maintaining high availability of a plurality of virtual machines in a fault domain, the system comprising:
-
a memory for storing a list identifying a plurality of virtual machines within a fault domain that are to be maintained in an operating state; a network communication interface configured to receive heartbeat messages from a plurality of hosts executing the virtual machines; and a processor coupled to the memory and programmed to; associate a datastore with a host by ranking a plurality of datastores based on their accessibility and discarding a datastore with an accessibility score that is below a threshold value, wherein the accessibility score is based on at least two of the following factors;
1) quantity of hosts that have access to the datastore,
2) whether the datastore is associated with the same storage device as another datastore, and
3) the file system type of the datastore;monitor the heartbeat messages received from the plurality of hosts at predetermined intervals and determine the host, within the plurality of hosts, from which the network communication interface has not received a heartbeat message within a first predetermined duration to identify the host as an unreachable host; after the unreachable host is identified, designate the unreachable host as a dead host when no response to a previously sent ping message is received from the unreachable host; and if the dead host has stored heartbeat data in the associated datastore within a second predetermined duration, monitor a power-on list stored in the associated datastore by the dead host, wherein the power-on list indicates virtual machines being executed by the dead host, and change the status of the dead host. - View Dependent Claims (2, 3, 4, 5, 6, 20)
-
-
7. A method comprising:
-
associating a datastore with a subordinate host by ranking a plurality of datastores based on their accessibility and discarding a datastore with an accessibility score that is below a threshold value, wherein the accessibility score is based on at least two of the following factors;
1) quantity of hosts that have access to the datastore,
2) whether the datastore is associated with the same storage device as another datastore, and
3) the file system type of the datastore;storing, by the subordinate host, heartbeat data in the associated datastore, wherein the stored heartbeat data indicates that the subordinate host is operative at the current time; storing, by the subordinate host, a power-on list in the associated datastore, wherein the power-on list indicates software application instances for a master host to restart at a host other than the subordinate host if the subordinate host becomes inoperative; wherein heartbeat messages received from the subordinate host is monitored at predetermined intervals and whether a heartbeat message is received from the subordinate host within a first predetermined duration is determined to identify an unreachable host; wherein after the unreachable host is identified, the unreachable host is designated as a dead host when no response to a previously sent ping message is received from the unreachable host; and wherein the power-on list stored in the associated datastore by the dead host is monitored and the status of the dead host is changed based on determining that the dead host has stored heartbeat data in the associated datastore within a second predetermined duration. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. One or more non-transitory computer-readable storage media having computer-executable components comprising:
-
a datastore selection component that when executed causes at least one processor to; identify a plurality of datastores to which a subordinate host has read-write access; select from the plurality of datastores a first heartbeat datastore and a second heartbeat datastore based on a quantity of hosts that have access to the first and second heartbeat datastores; and associate the first and second heartbeat datastores with the host by ranking a plurality of datastores based on their accessibility and discarding a datastore with an accessibility score that is below a threshold value, wherein the accessibility score is based on at least two of the following factors;
1) quantity of hosts that have access to the datastore,
2) whether the datastore is associated with the same storage device as another datastore, and
3) the file system type of the datastore;a heartbeat publication component that when executed causes at least one processor to; store in the first and second heartbeat datastores a heartbeat message indicating that the subordinate host is operative at the current time; and an application instance monitoring component that when executed causes at least one processor to; determine whether a heartbeat message is received from the subordinate host at the master host within a first predetermined duration to identify an unreachable host; after the unreachable host is identified, designate the unreachable host as a dead host when no response to a previously sent ping message is received from the unreachable host; and if the dead host has stored heartbeat data the heartbeat message in the first heartbeat datastore or in the second heartbeat datastore within a second predetermined duration, monitor a power-on list stored in the first heartbeat datastore or in the second heartbeat datastore by the dead host, wherein the power-on list indicates virtual machines being executed by the dead host, and change the status of the dead host. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A method comprising:
-
associating a datastore with a host by ranking a plurality of datastores based on their accessibility and discarding a datastore with an accessibility score that is below a threshold value, wherein the accessibility score is based on at least two of the following factors;
1) quantity of hosts that have access to the datastore,
2) whether the datastore is associated with the same storage device as another datastore, and
3) the file system type of the datastore;monitoring heartbeat messages received from the host at predetermined intervals and determining whether a heartbeat message is received from the host within a first predetermined duration to identify an unreachable host; after the unreachable host is identified, designate the unreachable host as a dead host when no response to a previously sent ping message is received from the unreachable host; and if the dead host has stored heartbeat data in the associated datastore within a second predetermined duration, monitoring a power-on list stored in the associated datastore by the dead host, wherein the power-on list indicates virtual machines being executed by the dead host, and change the status of the dead host.
-
Specification