Using a data storage system for cluster I/O failure determination
First Claim
1. A method, comprising:
- a first storage device storing a log of write operations to a second storage device, wherein the write operations are made by different instances of a distributed application executing on a plurality of host computer systems; and
the first storage device determining whether a failure to receive status information from a first of the plurality of host computer systems indicates a) that write operations from the first host computer system to the second storage device have ceased, or b) that write operations are being made by the first host computer system to the second storage device without being logged by the first storage device.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques are disclosed relating to storing a log of write operations made to a first storage device by one of a plurality of host computers running an instance of a distributed application. The log of write operations is stored at a second storage device. The plurality of host computers communicate status information to the second storage device over respective communication paths. Upon a failure to communicate status information between one of the host computers and the second storage device, the second storage device reads from a predetermined location in the first storage device to determine whether the host computer is still performing write operations. If the second storage device reads an expected signature value written by the host computer, the host computer is deemed to have written data, which indicates that the host computer is operational but that the write operations have not been recorded by the second storage device.
26 Citations
20 Claims
-
1. A method, comprising:
-
a first storage device storing a log of write operations to a second storage device, wherein the write operations are made by different instances of a distributed application executing on a plurality of host computer systems; and the first storage device determining whether a failure to receive status information from a first of the plurality of host computer systems indicates a) that write operations from the first host computer system to the second storage device have ceased, or b) that write operations are being made by the first host computer system to the second storage device without being logged by the first storage device. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer readable medium having program instructions stored thereon that, if executed by a first of a plurality of host computers implementing a distributed application, cause the first host computer to perform a method comprising:
-
receiving information indicative of a write operation from a first instance of the distributed application executing on the first host computer; providing the information indicative of the write operation to a first storage device; providing information to a second storage device, wherein the information provided to the second storage device is usable to recreate the write operation; sending heartbeat information to the second storage device via a first communication path to indicate that the first host computer is operational; and in response to detecting an error in providing the information to the second storage device via the first communication path, communicating to second storage device via a second communication path to indicate that the first host computer remains operational. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A non-transitory computer readable medium having program instructions stored thereon that, if executed by a first storage system, cause the first storage system to perform a method comprising:
a first storage system maintaining information indicative of write operations made by a plurality of host computer systems to a second storage system, wherein said maintaining includes; receiving status information from at least a first of the plurality of host computer systems, wherein the status information indicates that the first host computer system and a first communication path between the first host computer system and the first storage system are operational; and in response to the first storage system not receiving the status information from the first host computer system within a predetermined time period, determining a) whether write operations from the first host computer system to the second storage system have ceased, or b) whether write operations are being made by the first host computer system to the second storage system without being recorded by the first storage system. - View Dependent Claims (17, 18, 19, 20)
Specification