DEFERRED, BULK MAINTENANCE IN A DISTRIBUTED STORAGE SYSTEM

US 20170277609A1
Filed: 03/22/2016
Published: 09/28/2017
Est. Priority Date: 03/22/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

determining a failed capacity of a distributed storage system, wherein the distributed storage system includes a plurality of storage nodes, wherein the plurality of storage nodes include at least one storage device to store data objects, wherein the data objects are divided into data fragments in the distributed storage system;

determining a protection capacity of the distributed storage system, wherein the protection capacity comprises storage configured to store at least a portion of the data fragments generated to allow the data objects to be rebuilt in response to at least a part of the data objects being either lost or corrupted;

determining a first probability that the failed capacity overlaps with the protection capacity of the distributed storage system prior to a next periodically scheduled maintenance of the distributed storage system;

determining whether the first probability exceeds a first risk threshold; and

in response to the first probability exceeding the first risk threshold, scheduling a next maintenance of the distributed storage system that comprises reducing the failed capacity.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Failed capacity of a distributed storage system is determined. The distributed storage system includes a plurality of storage nodes, wherein the plurality of storage nodes include at least one storage device to store data objects, wherein the data objects have been divided into constituent fragments in the distributed storage system. Protection capacity of the distributed storage system is determined. Protection capacity includes the data fragments generated to allow the data objects to be rebuilt in response to at least a part of the data objects being either lost or corrupted. A probability is determined that the failed capacity overlaps with the used capacity of the distributed storage system prior to a next periodically scheduled maintenance of the distributed storage system. In response to the probability exceeding a risk threshold, a next maintenance of the distributed storage system is scheduled that comprises reducing the failed capacity.

Citations

20 Claims

1. A method comprising:
- determining a failed capacity of a distributed storage system, wherein the distributed storage system includes a plurality of storage nodes, wherein the plurality of storage nodes include at least one storage device to store data objects, wherein the data objects are divided into data fragments in the distributed storage system;
  
  determining a protection capacity of the distributed storage system, wherein the protection capacity comprises storage configured to store at least a portion of the data fragments generated to allow the data objects to be rebuilt in response to at least a part of the data objects being either lost or corrupted;
  
  determining a first probability that the failed capacity overlaps with the protection capacity of the distributed storage system prior to a next periodically scheduled maintenance of the distributed storage system;
  
  determining whether the first probability exceeds a first risk threshold; and
  
  in response to the first probability exceeding the first risk threshold, scheduling a next maintenance of the distributed storage system that comprises reducing the failed capacity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - determining a used capacity of the distributed storage system;
      
      determining a second probability that the failed capacity overlaps with the used capacity;
      
      determining whether the second probability exceeds a second risk threshold; and
      
      in response to the second probability exceeding the second risk threshold, scheduling a high priority maintenance of the distributed storage system that comprises reducing the failed capacity.
  - 3. The method of claim 1, wherein reducing the failed capacity comprises replacing any failed storage drives with operative storage drives.
  - 4. The method of claim 1, further comprising:
    - in response to the first probability not exceeding the first risk threshold, deferring reducing the failed capacity until a subsequent periodic scheduled maintenance of the distributed storage system.
  - 5. The method of claim 1, wherein the next maintenance is prior to the next periodically scheduled maintenance of the distributed storage system.
  - 6. The method of claim 1, wherein the determining of the first probability is based, at least in part, on a past failure rate of storage devices in the distributed storage system.
  - 7. The method of claim 1, wherein the determining of the first probability is based, at least in part, on a past usage rate of the storage devices in the distributed storage system.
  - 8. The method of claim 1, wherein the next maintenance of the distributed storage system comprises increasing an available storage capacity.

9. A non-transitory machine readable medium having stored thereon instructions for performing a method comprising machine executable code which when executed by at least one machine, causes the at least one machine to:
- determine a failed capacity of the distributed storage system, wherein the distributed storage system includes a plurality of storage nodes, wherein the plurality of storage nodes include at least one storage device to store data objects, wherein the data objects are divided into data fragments in the distributed storage system;
  
  determine a protection capacity of the distributed storage system, wherein the protection capacity comprises storage configured to store at least a portion of the data fragments generated to allow the data objects to be rebuilt in response to at least a part of the data objects being either lost or corrupted;
  
  determine a first probability that the failed capacity overlaps with the protection capacity of the distributed storage system prior to a next periodically scheduled maintenance of the distributed storage system;
  
  determine whether the first probability exceeds a first risk threshold; and
  
  in response to the probability exceeding the risk threshold, schedule the next maintenance of the distributed storage system that comprises a reduction of the failed capacity.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The non-transitory machine-readable storage medium of claim 9, further comprising machine executable code which when executed by the at least one machine, causes the at least one machine to:
    - determine a used capacity of the distributed storage system;
      
      determine a second probability that the failed capacity overlaps with the used capacity;
      
      determine whether the second probability exceeds a second risk threshold; and
      
      in response to the second probability exceeding the second risk threshold, schedule a high priority maintenance of the distributed storage system that comprises the reduction of the failed capacity.
  - 11. The non-transitory machine-readable storage medium of claim 9, wherein reduction of the failed capacity comprises replacement any failed storage drives with operative storage drives.
  - 12. The non-transitory machine-readable storage medium of claim 9, further comprising machine executable code which when executed by the at least one machine, causes the at least one machine to:
    - in response to the first probability not exceeding the first risk threshold, defer reduction of the failed capacity until a subsequent periodic scheduled maintenance of the distributed storage system.
  - 13. The non-transitory machine-readable storage medium of claim 9, wherein the next bulk maintenance is prior to the next periodically scheduled maintenance of the distributed storage system.
  - 14. The non-transitory machine-readable storage medium of claim 9, wherein the machine executable code which when executed by at least one machine, causes the at least one machine to determine the first probability based, at least in part, on a past failure rate of storage devices in the distributed storage system.
  - 15. The non-transitory machine-readable storage medium of claim 9, wherein the machine executable code which when executed by at least one machine, causes the at least one machine to determine the first probability based, at least in part, on a past usage rate of the storage devices in the distributed storage system.
  - 16. The non-transitory machine-readable storage medium of claim 9, wherein the next maintenance of the distributed storage system comprises increasing an available storage capacity.

17. A computing device comprising:
- a processor; and
  
  a machine readable medium comprising machine executable code having stored thereon instructions executable by the processor to cause the computing device to;
  
  determine a used capacity of a distributed storage system, wherein the distributed storage system includes a plurality of storage nodes, wherein the plurality of storage nodes include at least one storage device to store data objects, wherein the data objects are divided into constituent fragments in the distributed storage system;
  
  determine a failed capacity of the distributed storage system;
  
  determine a probability that the failed capacity overlaps with the used capacity of the distributed storage system prior to a next periodically scheduled maintenance of the distributed storage system;
  
  determine whether the probability exceeds a risk threshold; and
  
  in response to the probability exceeding the risk threshold, schedule, prior to the next periodically scheduled maintenance, an intermittent bulk maintenance of the distributed storage system that comprises a reduction of the failed capacity.
- View Dependent Claims (18, 19, 20)
- - 18. The computing device of claim 17, wherein reduction of the failed capacity comprises replacement any failed storage drives with operative storage drives.
  - 19. The computing device of claim 17, further comprising machine executable code executable by the processor to cause the computing device to:
    - in response to the probability not exceeding the risk threshold, defer reduction of the failed capacity until a subsequent periodic scheduled maintenance of the distributed storage system.
  - 20. The computing device of claim 17, wherein the machine executable code executable by the processor to cause the computing device to determine the probability is based, at least in part, on a past failure rate of storage devices in the distributed storage system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Slik, David Anthony

Granted Patent

US 10,055,317 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/1088 Reconstruction on already f...

G06F 11/2094 Redundant storage or storag...

DEFERRED, BULK MAINTENANCE IN A DISTRIBUTED STORAGE SYSTEM

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DEFERRED, BULK MAINTENANCE IN A DISTRIBUTED STORAGE SYSTEM

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links