QUARANTINE AND REPAIR OF REPLICAS IN A QUORUM-BASED DATA STORAGE SYSTEM
First Claim
1. A method for improving data persistence and availability in a distributed data store where data is stored in a plurality of shards and a given shard is replicated across a plurality of nodes in the distributed system, where a quorum of replicas is needed for access to the given shard, the method operable within the data distributed data store, the method comprising:
- detecting that a replica associated with the given shard are unavailable;
determining whether the available replicas for the given shard represent a quorum;
upon a determination that the available replicas do not represent a quorum, marking the unavailable replica as quarantined;
upon a determination that the available replicas do represent a quorum, marking the unavailable replica to be deleted.
1 Assignment
0 Petitions
Accused Products
Abstract
A data storage system with quorum-based commits sometimes experiences replica failure, due to unavailability of a replica-hosting node, for example. In embodiments described herein, such failed replicas can be quarantined rather than deleted, and subsequently such quarantines can be recovered. The teachings hereof provide data storage with improved fault-tolerance, resiliency, and data availability.
66 Citations
27 Claims
-
1. A method for improving data persistence and availability in a distributed data store where data is stored in a plurality of shards and a given shard is replicated across a plurality of nodes in the distributed system, where a quorum of replicas is needed for access to the given shard, the method operable within the data distributed data store, the method comprising:
-
detecting that a replica associated with the given shard are unavailable; determining whether the available replicas for the given shard represent a quorum; upon a determination that the available replicas do not represent a quorum, marking the unavailable replica as quarantined; upon a determination that the available replicas do represent a quorum, marking the unavailable replica to be deleted. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A data storage system with improved operation, comprising:
-
a plurality of distributed nodes, each node comprising a microprocessor and memory storing instructions to be executed on the microprocessor for operating of that node, the plurality of nodes including a plurality of storage nodes each storing at least one replica of a shard; one or more processes executing on one or more of the plurality of nodes to detect and handle faults in replicas, the one or more processes executing the following steps; detecting that a replica associated with the shard are unavailable; determining whether the available replicas for the shard represent a quorum; upon a determination that the available replicas do not represent a quorum, marking the unavailable replica as quarantined; upon a determination that the available replicas do represent a quorum, marking the unavailable replica to be deleted. - View Dependent Claims (7, 8, 9, 10)
-
-
11-21. -21. (canceled)
-
22. A data storage system with improved operation, comprising:
-
a plurality of distributed nodes, each node comprising means for operating that node, the plurality of nodes including a plurality of storage nodes each storing at least one replica of a shard; means for detecting that a replica associated with the shard are unavailable; means for determining whether the available replicas for the shard represent a quorum; means for, upon a determination that the available replicas do not represent a quorum, marking the unavailable replica as quarantined; means for, upon a determination that the available replicas do represent a quorum, marking the unavailable replica to be deleted. - View Dependent Claims (23, 24, 25, 26)
-
-
27-31. -31. (canceled)
Specification