Methods to identify, handle and recover from suspect SSDS in a clustered flash array

US 9,710,317 B2
Filed: 03/30/2015
Issued: 07/18/2017
Est. Priority Date: 03/30/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a write request having data directed towards a storage container stored on a plurality of solid state storage devices (SSDs) included in a storage array connected to a node;

issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array;

in response to an I/O error detected from the I/O storage operation, incrementing a first counter in a memory included in the node, wherein the first counter is associated with a first I/O error type;

associating a set of counters with each SSD of the storage array, wherein the first counter is included in the set of counters associated with the first SSD, wherein the first I/O error type is a recovered error; and

in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique predicts failure of one or more storage devices of a storage array serviced by a storage system and for establishes one or more threshold conditions for replacing the storage devices. The predictive technique periodically monitors soft and hard failures of the storage devices (e.g., from Self-Monitoring, Analysis and Reporting Technology), as well as various usage counters pertaining to input/output (I/O) workloads and response times of the storage devices. A heuristic procedure may be performed that combines the monitored results to calculate the predicted failure and recommend replacement of the storage devices, using one or more thresholds based on current usage and failure patterns of the storage devices. In addition, one or more policies may be provided for replacing the storage devices in a cost-effective manner that ensures non-disruptive operation and/or replacement of the SSDs, while obviating a potential catastrophic scenario based on the usage and failure patterns of the storage devices.

475 Citations

18 Claims

1. A method comprising:
- receiving a write request having data directed towards a storage container stored on a plurality of solid state storage devices (SSDs) included in a storage array connected to a node;
  
  issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array;
  
  in response to an I/O error detected from the I/O storage operation, incrementing a first counter in a memory included in the node, wherein the first counter is associated with a first I/O error type;
  
  associating a set of counters with each SSD of the storage array, wherein the first counter is included in the set of counters associated with the first SSD, wherein the first I/O error type is a recovered error; and
  
  in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein a second counter include in the set of counters associated with the first SSD has a second I/O error type different from the first I/O error type.
  - 3. The method of claim 2 further comprising:
    - determining whether the second counter exceeds a second threshold during a second periodic interval, wherein the second I/O error type is a medium error; and
      
      in response to determining that the second counter does not exceed the second threshold during the second periodic interval, determining whether the second counter exceeds a third threshold during a third periodic interval having a longer duration than the second periodic interval, wherein the third threshold is larger than the second threshold.
  - 4. The method of claim 1 wherein the first counter is reset in response to not detecting an I/O error of the first I/O error type during a second periodic interval occurring after the first periodic interval, the second periodic interval having a same duration as the first periodic interval.
  - 5. The method of claim 1 further comprising:
    - scheduling a staged replacement of the first SSD such that a minimum level of redundancy of the storage array is maintained during the migration period.
  - 6. The method of claim 1 wherein the I/O storage operation is a write operation and wherein a storage medium of each SSD wears out after approximately a same number of write operations.
  - 7. The method of claim 1 wherein the first SSD is powered down by the node in response to determining that an attribute of the first SSD indicates a power-on hours exceeds a power-on threshold.
  - 8. The method of claim 1 wherein a second counter included in the set of counters associated with the first SSD is associated with a timeout error type of I/O error, and wherein the first SSD has multi-layer-cell flash components.

9. A method comprising:
- receiving a write request having data directed towards a storage container stored on a plurality of solid state storage drives (SSDs) included in a storage array connected to a node;
  
  issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array;
  
  reading an attribute of the first SSD, wherein the attribute is selected from a group consisting of a number of defective blocks, a number of reserved blocks used, and a number of reassigned blocks; and
  
  in response to determining that the attribute exceeds a threshold, wherein the threshold is normalized based on a flash component type included in the first SSD and a storage capacity of the first SSD, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.

10. A system comprising:
- a node of a cluster, the node having a memory connected to a processor via a bus;
  
  a storage array connected to the node having one or more solid state drives (SSDs);
  
  a storage input/output (I/O) stack executing on the processor of the node, the storage I/O stack configured to;
  
  receive a write request having data directed towards a storage container stored on the storage array;
  
  issue an I/O storage operation to a first SSD of the storage array;
  
  in response to an I/O error detected from the I/O storage operation, increment a first counter in the memory, wherein the first counter is associated with a first type of I/O error, and wherein the first counter is included in a set of counters associated with the first SSD, wherein the first type of I/O error is a recovered error; and
  
  in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issue an alert to migrate the data from the first SSD to a second SSD of the storage array, and place the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10 wherein a second counter included in the set of counters associated with the first SSD has a second type of I/O error different from the first type of I/O error.
  - 12. The system of claim 11 wherein the storage I/O stack is further configured to:
    - determine whether the second counter exceeds a second threshold during a second periodic interval, wherein the second type of I/O error is a medium error; and
      
      in response to determining that the second counter does not exceed the second threshold during the second periodic interval, determine whether the second counter exceeds a third threshold during a third periodic interval having a longer duration than the second periodic interval, wherein the third threshold is larger than the second threshold.
  - 13. The system of claim 10 wherein the first counter is reset in response to not detecting an I/O error of the first type of I/O error during a second periodic interval occurring after the first periodic interval, the second periodic interval having a same duration as the first periodic interval.
  - 14. The system of claim 10 wherein the storage I/O stack is further configured to:
    - schedule a staged replacement of the first SSD such that a minimum level of redundancy of the storage array is maintained during the migration period.
  - 15. The system of claim 10 wherein the first SSD is powered down by the storage I/O stack in response to determining that an attribute of the SSD indicates a power-on hours exceeds a power-on threshold.
  - 16. The system of claim 10 wherein the I/O storage operation is a write operation and wherein the storage I/O stack is further configured to:
    - migrate all data from a first shelf of SSDs having the first SSD to a second shelf of SSDs having the second SSD, wherein one or more flash components of each SSD of the first shelf wears out after approximately a same number of write operations.
  - 17. The system of claim 16 wherein the migration of the data from the first shelf to the second shelf occurs during the migration period.
  - 18. The system of claim 10 wherein the I/O storage operation is a write operation and wherein the storage I/O stack is further configured to:
    - schedule a staged replacement of all SSDs included in a shelf of the storage array, wherein one or more flash components of each SSD of the shelf wears out after approximately a same number of write operations, and wherein a minimum level of redundancy of the storage array is maintained during the staged replacement of the SSDs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NetApp, Inc.
Original Assignee
NetApp, Inc.
Inventors
Gupta, Anish, Mohammed, Samiullah
Primary Examiner(s)
Riad, Amine

Application Number

US14/673,258
Publication Number

US 20160292025A1
Time in Patent Office

841 Days
Field of Search

714 472
US Class Current
CPC Class Codes

G06F 11/008   Reliability or availability...

G06F 11/0727   in a storage system, e.g. i...

G06F 11/0757   by exceeding a time limit, ...

G06F 11/076   by exceeding a count or rat...

G06F 11/0772   Means for error signaling, ...

G06F 11/079   Root cause analysis, i.e. e...

G06F 11/0793   Remedial or corrective acti...

G06F 3/0616   in relation to life time, e...

G06F 3/0647   Migration mechanisms

G06F 3/0653   Monitoring storage devices ...

G06F 3/0688   Non-volatile semiconductor ...

Methods to identify, handle and recover from suspect SSDS in a clustered flash array

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

475 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

Methods to identify, handle and recover from suspect SSDS in a clustered flash array

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

475 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others