Fault tolerant high availability meter
First Claim
1. A fault tolerant method of monitoring one or more computers for availability, comprising:
- generating an event when a computer system detects a change in its status that affects availability;
transmitting the event from the computer system to a central repository; and
periodically re-transmitting the event if a receipt confirmation message is not received from the central repository.
3 Assignments
0 Petitions
Accused Products
Abstract
A fault tolerant availability meter includes agents for stand-alone computers and each node of a cluster. The agents monitor availability with timestamps and report uptime and downtime events to a server. Additionally, agents on nodes of a cluster monitor cluster, node and package availability and cluster configuration changes and report these event to the server. Events are stored locally on the stand-alone computers and nodes, and additionally, on the server. Events are tracked with a sequence numbers. If the server receives an out-of-sequence event, an agent-server recovery procedure is initiated to restore the missing events from either the agents or the server. The server may generate availability reports for all monitored entities, including one or more stand-alone computers and one or more clusters of computers. Availability is distinguished by planned and unplanned downtime. Furthermore, unavailable and unreachable systems are identified.
-
Citations
16 Claims
-
1. A fault tolerant method of monitoring one or more computers for availability, comprising:
-
generating an event when a computer system detects a change in its status that affects availability;
transmitting the event from the computer system to a central repository; and
periodically re-transmitting the event if a receipt confirmation message is not received from the central repository. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
storing the event in a local repository located on the computer system before transmitting the event.
-
-
3. The method of claim 1, further comprising:
holding the event in a queue if a receipt confirmation message is not received from the central repository.
-
4. The method of claim 3, further comprising:
-
receiving a status request from the central repository;
providing a status update on the computer system in response to the status request; and
providing events held in the queue to the central repository in response to the status request.
-
-
5. The method of claim 1, wherein the event is re-transmitted after one hour.
-
6. The method of claim 1, wherein the computer system is a cluster.
-
7. The method of claim 1, wherein the computer system is a stand-alone server.
-
8. The method of claim 1, wherein the change of status includes changes in availability and configuration.
-
9. The method of claim 1, wherein an event indicating a change in availability includes a timestamp, event type and source designator.
-
10. A fault tolerant method of monitoring one or more computers for availability, comprising:
-
generating an event containing a sequence number when a computer system detects a change in its status that effects availability;
transmitting the event from the computer system to a central repository;
comparing the sequence number of the event with a next expected sequence number computed from reading the central repository; and
synchronizing events between the computer system and the central repository if the sequence number does not match the next expected sequence number. - View Dependent Claims (11, 12, 13, 14)
storing events and sequence numbers in the central repository if the sequence number matches the next expected sequence number.
-
-
12. The method of claim 10, further comprising:
maintaining a copy of each event in a local repository on the computer system.
-
13. The method of claim 10, wherein the synchronizing step further comprises:
requesting missing events from the computer system if the sequence number is greater than the next expected sequence number.
-
14. The method of claim 10, wherein the synchronizing step further comprises:
-
if the sequence number is less than the next expected sequence number, determining whether the event has already been received;
transmitting missing events to the computer system from the central repository if the event has not already been received; and
discarding the event if the event has already been received.
-
-
15. A system for measuring availability of computer systems, comprising:
-
a network;
a local support computer coupled to said network;
a computer system coupled to the network, said computer system programmed to monitor itself for availability and to transmit availability events to said local support node; and
a cluster of computers coupled to the network, said cluster of computers comprised of nodes and packages, each of the nodes being programmed to monitor itself for cluster, node and package availability and to transmit availability events to said local support node, wherein said local support node computes availability for the computer system and the cluster of computers based on the availability events received. - View Dependent Claims (16)
a remote support computer connectable to said local support computer for remotely operating said local support computer and for receiving availability data from said local support computer.
-
Specification