Distributed computing fault management

US 9,274,902 B1
Filed: 08/07/2013
Issued: 03/01/2016
Est. Priority Date: 08/07/2013
Status: Active Grant

First Claim

Patent Images

1. A distributed database system comprising:

a plurality of computing nodes comprising at least a first subset of the plurality of computing nodes, the first subset configured to perform a distributed computing function, one or more of the plurality of computing nodes configured at least to;

detect a fault involving the first subset of the plurality of computing nodes;

perform one or more diagnostic procedures involving at least a component connected to a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault;

perform a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on the performing of the one or more diagnostic procedures; and

reconfigure the first subset of the plurality of computing nodes to perform the distributed computing function without the first computing node upon determining that performing the first one or more operations has not resolved the fault.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automated system may be employed to perform detection, analysis and recovery from faults occurring in a distributed computing system. Faults may be recorded in a metadata store for verification and analysis by an automated fault management process. Diagnostic procedures may confirm detected faults. The automated fault management process may perform recovery workflows involving operations such as rebooting faulting devices and excommunicating unrecoverable computing nodes from affected clusters.

86 Citations

View as Search Results

20 Claims

1. A distributed database system comprising:
- a plurality of computing nodes comprising at least a first subset of the plurality of computing nodes, the first subset configured to perform a distributed computing function, one or more of the plurality of computing nodes configured at least to;
  
  detect a fault involving the first subset of the plurality of computing nodes;
  
  perform one or more diagnostic procedures involving at least a component connected to a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault;
  
  perform a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on the performing of the one or more diagnostic procedures; and
  
  reconfigure the first subset of the plurality of computing nodes to perform the distributed computing function without the first computing node upon determining that performing the first one or more operations has not resolved the fault.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the fault corresponds to a first region of a storage device, one or more of the plurality of computing nodes further configured at least to:
    - select a second region of the storage device based at least in part on association with the first region; and
      
      perform at least one diagnostic procedure, of the one or more diagnostic procedures, that performs a read or write operation on the second region of the storage device.
  - 3. The system of claim 1, further comprising one or more storage devices configured to store information indicative of the fault, one or more of the plurality of computing nodes configured at least to:
    - perform a second one or more operations upon determining that performing the first one or more operations has not resolved the fault, the second one or more operations selected based at least in part on one or more of recovery, repair, or replacement of the first computing node.
  - 4. The system of claim 1, the system further configured at least to:
    - repeat at least one of the one or more diagnostic procedures after rebooting the first computing node.

5. A method for fault recovery comprising:
- detecting a fault involving a first subset of a plurality of computing nodes, the first subset configured to perform a distributed computing function;
  
  performing, by at least one of the plurality of computing nodes, one or more diagnostic procedures involving at least a component of a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining, by at least one of the plurality of computing nodes, that the component is a potential origin of the fault;
  
  selecting, by at least one of the plurality of computing nodes, a first one or more operations involving the first computing node, the first one or more operations selected based in part on the performing of the one or more diagnostic procedures; and
  
  reconfiguring the first subset of the plurality of computing nodes to stop the first computing node from performing the distributed computing function upon determining that performing the first one or more operations has not resolved the fault.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- - 6. The method of claim 5, wherein the fault corresponds to a first region of a storage device, further comprising:
    - selecting a second region of the storage device based at least in part on association with the first region; and
      
      performing at least one diagnostic procedure, of the one or more diagnostic procedures, that performs a read or write operation on the second region of the storage device.
  - 7. The method of claim 5, further comprising:
    - repeating at least one of the one or more diagnostic procedures after rebooting the first computing node.
  - 8. The method of claim 5, wherein the distributed computing function involves storage and retrieval of data.
  - 9. The method of claim 8, further comprising:
    - replicating data from the first computing node to at least one of the plurality of computing nodes prior to reconfiguring the first subset of the plurality of computing nodes.
  - 10. The method of claim 5, further comprising:
    - storing information indicative of the fault on one or more storage devices; and
      
      selecting the first one or more operations based at least in part on retrieving the information from the one or more storage devices.
  - 11. The method of claim 5, further comprising:
    - storing information indicative of the first one or more operations on one or more storage devices and indicative of an order in which to perform the first one or more operations.
  - 12. The method of claim 5, further comprising:
    - selecting, by at least one of the plurality of computing nodes, a second one or more operations involving the first computing node, the second one or more operations selected based at least in part on one or more of recovery, repair, or replacement of the first computing node.
  - 13. The method of claim 12, further comprising:
    - postponing performance of at least one of the first one or more operations or the second one or more operations based at least in part on an operational status of a second subset of the plurality of computing nodes.

14. A non-transitory computer-readable storage medium having stored thereon instructions that, upon execution by a computing device, cause the computing device at least to:
- receive information indicative of a fault involving a first subset of a plurality of computing nodes, the first subset configured to perform a distributed computing function;
  
  select one or more diagnostic procedures, the one or more diagnostic procedures involving at least a component of a first computing node of the first subset of the plurality of computing nodes, the one or more diagnostic procedures selected based at least in part on determining that the component is a potential origin of the fault;
  
  select a first one or more operations involving the first computing node, the first one or more operations selected based at least in part on performing the one or more diagnostic procedures; and
  
  select a second one or more operations involving the first computing node upon determining that performing the first one or more operations has not resolved the fault, wherein the second one or more operations comprises excluding the first computing node from performing the distributed computing function.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computer-readable storage medium of claim 14, comprising further instructions that, upon execution by the computing device, cause the computing device to at least:
    - receive information indicative of a read or write operation on a first region of a storage device, the operation corresponding to the fault.
  - 16. The computer-readable storage medium of claim 14, comprising further instructions that, upon execution by the computing device, cause the computing device to at least:
    - receive information indicative of operational status corresponding to at least one of the plurality of computing nodes other than the first computing node.
  - 17. The computer-readable storage medium of claim 14, comprising further instructions that, upon execution by the computing device, cause the computing device to at least:
    - determine to replicate data from the first computing node to at least one of the plurality of computing nodes, the determination based at least in part on performing the one or more diagnostic procedures.
  - 18. The computer-readable storage medium of claim 14, further comprising selecting the second one or more operations based at least in part on one or more of recovery, repair, or replacement of the first computing node.
  - 19. The computer-readable storage medium of claim 14, comprising further instructions that, upon execution by the computing device, cause the computing device to at least:
    - calculate a level of risk associated with performing an operation involving the first computing node, the level of risk corresponding to a likelihood of ceasing to perform the distributed computing function.
  - 20. The computer-readable storage medium of claim 14, wherein at least one of the first one or more operations or the second one or more operations involves personnel scheduling.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Morley, Adam Douglas, Lu, Yijun, Rath, Timothy Andrew, Muniswamy-Reddy, Kiran-Kumar, Huang, Xianglong, Hunter, Barry Bailey Jr., Zheng, Jiandan
Primary Examiner(s)
Ehne, Charles

Application Number

US13/961,720
Time in Patent Office

937 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0787   Storage of error reports, e...

G06F 11/079   Root cause analysis, i.e. e...

G06F 11/0793   Remedial or corrective acti...

G06F 11/2002   where interconnections or c...

G06F 11/2094   Redundant storage or storag...

Distributed computing fault management

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

86 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed computing fault management

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

86 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links