Methods to identify, handle and recover from suspect SSDS in a clustered flash array
First Claim
1. A method comprising:
- receiving a write request having data directed towards a storage container stored on a plurality of solid state storage devices (SSDs) included in a storage array connected to a node;
issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array;
in response to an I/O error detected from the I/O storage operation, incrementing a first counter in a memory included in the node, wherein the first counter is associated with a first I/O error type;
associating a set of counters with each SSD of the storage array, wherein the first counter is included in the set of counters associated with the first SSD, wherein the first I/O error type is a recovered error; and
in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique predicts failure of one or more storage devices of a storage array serviced by a storage system and for establishes one or more threshold conditions for replacing the storage devices. The predictive technique periodically monitors soft and hard failures of the storage devices (e.g., from Self-Monitoring, Analysis and Reporting Technology), as well as various usage counters pertaining to input/output (I/O) workloads and response times of the storage devices. A heuristic procedure may be performed that combines the monitored results to calculate the predicted failure and recommend replacement of the storage devices, using one or more thresholds based on current usage and failure patterns of the storage devices. In addition, one or more policies may be provided for replacing the storage devices in a cost-effective manner that ensures non-disruptive operation and/or replacement of the SSDs, while obviating a potential catastrophic scenario based on the usage and failure patterns of the storage devices.
475 Citations
18 Claims
-
1. A method comprising:
-
receiving a write request having data directed towards a storage container stored on a plurality of solid state storage devices (SSDs) included in a storage array connected to a node; issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array; in response to an I/O error detected from the I/O storage operation, incrementing a first counter in a memory included in the node, wherein the first counter is associated with a first I/O error type; associating a set of counters with each SSD of the storage array, wherein the first counter is included in the set of counters associated with the first SSD, wherein the first I/O error type is a recovered error; and in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method comprising:
-
receiving a write request having data directed towards a storage container stored on a plurality of solid state storage drives (SSDs) included in a storage array connected to a node; issuing an input/output (I/O) storage operation to store the data to a first SSD of the storage array; reading an attribute of the first SSD, wherein the attribute is selected from a group consisting of a number of defective blocks, a number of reserved blocks used, and a number of reassigned blocks; and in response to determining that the attribute exceeds a threshold, wherein the threshold is normalized based on a flash component type included in the first SSD and a storage capacity of the first SSD, issuing an alert to migrate the data from the first SSD to a second SSD of the storage array, and placing the first SSD into read-only service, and wherein the predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data.
-
-
10. A system comprising:
-
a node of a cluster, the node having a memory connected to a processor via a bus; a storage array connected to the node having one or more solid state drives (SSDs); a storage input/output (I/O) stack executing on the processor of the node, the storage I/O stack configured to; receive a write request having data directed towards a storage container stored on the storage array; issue an I/O storage operation to a first SSD of the storage array; in response to an I/O error detected from the I/O storage operation, increment a first counter in the memory, wherein the first counter is associated with a first type of I/O error, and wherein the first counter is included in a set of counters associated with the first SSD, wherein the first type of I/O error is a recovered error; and in response to the first counter exceeding a first predetermined threshold after a first periodic interval, issue an alert to migrate the data from the first SSD to a second SSD of the storage array, and place the first SSD into read-only service, and wherein the first predetermined threshold is chosen such that an expected failure of the first SSD occurs after a migration period for the data. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification