Distributed computing fault management
First Claim
Patent Images
1. A distributed database system comprising:
- a plurality of computing nodes comprising at least a first subset of the plurality of computing nodes, the first subset configured to perform a distributed computing function, one or more of the plurality of computing nodes configured at least to;
detect a fault involving the first subset of the plurality of computing nodes;
perform one or more diagnostic procedures involving at least a component connected to a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault;
perform a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on the performing of the one or more diagnostic procedures; and
reconfigure the first subset of the plurality of computing nodes to perform the distributed computing function without the first computing node upon determining that performing the first one or more operations has not resolved the fault.
1 Assignment
0 Petitions
Accused Products
Abstract
An automated system may be employed to perform detection, analysis and recovery from faults occurring in a distributed computing system. Faults may be recorded in a metadata store for verification and analysis by an automated fault management process. Diagnostic procedures may confirm detected faults. The automated fault management process may perform recovery workflows involving operations such as rebooting faulting devices and excommunicating unrecoverable computing nodes from affected clusters.
86 Citations
20 Claims
-
1. A distributed database system comprising:
a plurality of computing nodes comprising at least a first subset of the plurality of computing nodes, the first subset configured to perform a distributed computing function, one or more of the plurality of computing nodes configured at least to; detect a fault involving the first subset of the plurality of computing nodes; perform one or more diagnostic procedures involving at least a component connected to a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault; perform a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on the performing of the one or more diagnostic procedures; and reconfigure the first subset of the plurality of computing nodes to perform the distributed computing function without the first computing node upon determining that performing the first one or more operations has not resolved the fault. - View Dependent Claims (2, 3, 4)
-
5. A method for fault recovery comprising:
-
detecting a fault involving a first subset of a plurality of computing nodes, the first subset configured to perform a distributed computing function; performing, by at least one of the plurality of computing nodes, one or more diagnostic procedures involving at least a component of a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining, by at least one of the plurality of computing nodes, that the component is a potential origin of the fault; selecting, by at least one of the plurality of computing nodes, a first one or more operations involving the first computing node, the first one or more operations selected based in part on the performing of the one or more diagnostic procedures; and reconfiguring the first subset of the plurality of computing nodes to stop the first computing node from performing the distributed computing function upon determining that performing the first one or more operations has not resolved the fault. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer-readable storage medium having stored thereon instructions that, upon execution by a computing device, cause the computing device at least to:
-
receive information indicative of a fault involving a first subset of a plurality of computing nodes, the first subset configured to perform a distributed computing function; select one or more diagnostic procedures, the one or more diagnostic procedures involving at least a component of a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault; select a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on performing the one or more diagnostic procedures; and select a second one or more operations involving the first computing node upon determining that performing the first one or more operations has not resolved the fault, wherein the second one or more operations comprises excluding the first computing node from performing the distributed computing function. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification