Partial disk failures and improved storage resiliency

US 8,049,980 B1
Filed: 04/18/2008
Issued: 11/01/2011
Est. Priority Date: 04/18/2008
Status: Active Grant

First Claim

Patent Images

1. A mass data storage system comprising:

a plurality of member disks and a spare disk which are organized into an array, each member disk and the spare disk having a plurality of heads and a plurality of platter surfaces respectively serviced by each head during read and write operations;

a disk controller of each disk which controls the heads and platter surfaces during the read and write operations and which recognizes errors arising from each head performing read and write operations on the platter surface serviced by each head, the disk controller designating any one head as faulty whenever read and write errors associated with that one head meet a predetermined threshold; and

an array controller which communicates with each disk controller and controls the operation of each disk in the array during mass data storage operations, the array controller controlling the disk controllers of the spare disk and the member disk having the designated faulty head to (a), read data from the surfaces of the member disk serviced by non-faulty heads and write that data onto surfaces of the spare disk, (b) to rebuild data on the spare disk written on the surface of the member disk serviced by the faulty head, and (c) thereafter perform subsequent mass data storage read and write operations on the spare disk which would otherwise be addressed to the member disk having the faulty head.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mass data storage system including a hard disk drive comprising heads and platter surfaces determines when a head of the disk is faulty and the disk continues to operate as a partially failed disk with respect to the remaining heads which are not faulty. A striped parity disk array comprises disks capable of operating as partially failed disks allows copying of data from the platter surfaces not associated with a faulty head of a partially failed disk to a spare disk which reduces the amount of data that must be rebuilt in the rebuild process, thereby reducing the amount of time the array spends in degraded mode exposed to a total loss of data caused by a subsequent disk failure.

Citations

18 Claims

1. A mass data storage system comprising:
- a plurality of member disks and a spare disk which are organized into an array, each member disk and the spare disk having a plurality of heads and a plurality of platter surfaces respectively serviced by each head during read and write operations;
  
  a disk controller of each disk which controls the heads and platter surfaces during the read and write operations and which recognizes errors arising from each head performing read and write operations on the platter surface serviced by each head, the disk controller designating any one head as faulty whenever read and write errors associated with that one head meet a predetermined threshold; and
  
  an array controller which communicates with each disk controller and controls the operation of each disk in the array during mass data storage operations, the array controller controlling the disk controllers of the spare disk and the member disk having the designated faulty head to (a), read data from the surfaces of the member disk serviced by non-faulty heads and write that data onto surfaces of the spare disk, (b) to rebuild data on the spare disk written on the surface of the member disk serviced by the faulty head, and (c) thereafter perform subsequent mass data storage read and write operations on the spare disk which would otherwise be addressed to the member disk having the faulty head.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. A mass data storage system as defined in claim 1, wherein:
    - the array is a parity disk array in which parity information is recorded on at least one of the disks of the array to create data redundancy; and
      
      the array controller rebuilds the data on the spare disk from the data and parity information of the member disks other than the disk having the head designated as faulty.
  - 3. A mass data storage system as defined in claim 2, wherein the data written to the surfaces of the member disks of the array is striped across at least two member disks of the array.
  - 4. A mass data storage system as defined in claim 1, wherein:
    - the disk controller of the disk with the faulty head prevents further read and write operations by the faulty head after the head has been designated as faulty.
  - 5. A mass data storage system as defined in claim 1, wherein each of the disks further comprises:
    - a memory unit accessible by the disk controller; and
      
      wherein;
      
      data is stored in physical blocks on the platter surfaces; and
      
      the memory unit includes mappings between each of the physical blocks on the platter surfaces of the disk and one of the plurality of heads of the disk.
  - 6. A mass data storage system as defined in claim 1, wherein each of the disks further comprises:
    - a memory unit accessible by the disk controller; and
      
      wherein;
      
      the disk controller detects and counts errors arising from each head of the disk servicing the surfaces; and
      
      the memory unit includes a data structure in which the error counts for each of the plurality of heads are stored.
  - 7. A mass data storage system as defined in claim 1, wherein each of the disks further comprises:
    - a memory unit accessible by the disk controller; and
      
      wherein;
      
      the memory unit includes a data structure which stores a reference to any head designated as faulty.
  - 8. A mass data storage system as defined in claim 1, wherein:
    - data is stored in physical blocks on the platter surfaces of each disk;
      
      each platter surface includes a plurality of predetermined locations for the physical blocks;
      
      each physical block of each disk has a logical block address;
      
      each disk controller uses logical block addressing in communicating with the array controller; and
      
      each disk controller communicates to the array controller those logical block addresses which are associated with the platter surface serviced by each faulty head.
  - 9. A mass data storage system as defined in claim 8, wherein:
    - each physical block which is incapable of at least one of reliably being read from or reliably being written to constituting a bad physical block;
      
      each of the disks has at least one platter surface containing a plurality of spare physical blocks which are not initially used in read and write operations, the plurality of spare physical blocks of each disk constituting a spare block pool for each disk;
      
      the disk controller of each disk remaps logical block numbers associated with bad physical blocks to unused spare physical blocks in the spare block pool which are not located on a surface serviced by a faulty head; and
      
      the disk controller of each disk prevents remapping a spare physical block to a logical block number if the spare physical block resides on a platter surface associated with a faulty head.
  - 10. A mass data storage system as defined in claim 9, wherein:
    - each of the platter surfaces includes spare physical blocks from the spare block pool; and
      
      the disk controller of each disk remaps read and write operations from a bad physical block to a spare physical block on the same platter surface.
  - 11. A mass data storage system as defined in claim 1, wherein:
    - the disk controller of each disk counts the number of errors occurring for each head, compares the counted number of errors for each head with a predetermined threshold, and designates any head as faulty when the counted number of errors exceeds the predetermined threshold.

12. A method of reducing the risk of data becoming unrecoverable in a mass data storage system including a plurality of disks each having a plurality of heads and a plurality of platter surfaces respectively serviced by the heads during read and write operations, at least one of the plurality of disks storing data and at least one of the plurality of disks being a spare disk, the method further comprising:
- writing data to platter surfaces with the heads in write operations;
  
  reading data from platter surfaces with the heads in read operations;
  
  detecting errors during read operations;
  
  associating each error with one of the heads from which the error arose;
  
  designating one of the heads as faulty whenever the errors associated with that head meet a predetermined threshold;
  
  continuing to perform read and write operations with the non-faulty heads of the disks;
  
  copying data from the platter surfaces serviced by the non-faulty heads of the disk having the faulty head to the spare disk; and
  
  restoring the data that was on the platter surface serviced by the faulty head onto the spare disk without copying data from the platter surface serviced by the faulty head.
- View Dependent Claims (13, 14, 15)
- - 13. A method as defined in claim 12, wherein the plurality of disks constitute a striped parity disk array, a plurality of disks of the disk array constituting member disks, the member disks storing data and parity information, and in the case of one of the member disks having the faulty head:
    - restoring the data from the platter surface serviced by the faulty head onto the spare disk comprises rebuilding the data from the data and parity information of the member disks other than the disk with the faulty head.
  - 14. A method as defined in claim 13, wherein the copying data from the platter surfaces serviced by the non-faulty heads of the disk with the faulty head to the spare disk is performed with a mirroring copy process.
  - 15. A method as defined in claim 12, further comprising:
    - counting the number of errors occurring for each head;
      
      comparing the counted number of errors for each head with a predetermined threshold; and
      
      designating one of the heads as faulty when the counted number of errors for that head exceeds the predetermined threshold.

16. A method for copying data from a first hard disk having one of a plurality of heads designated as faulty to a second hard disk, each of the plurality of heads of the first hard disk associated with a different set of physical blocks which store data, the first hard disk associating each of a plurality of logical block addresses with a different physical block, the method using a host computer connected to the first hard disk and the second hard disk, the host computer issuing write and read commands pertaining to specific logical block addresses to the first hard disk which result in the first hard disk respectively writing data to and reading data from the physical blocks associated with the specific logical block addresses, and wherein the host computer:
- copies data from the first hard disk to the second hard disk by issuing to the first hard disk read commands pertaining to logical block addresses associated with non-faulty heads of the first hard disk, the first hard disk supplying the data to the host computer in response to the read commands, the host computer issuing write commands to store the data read from the first hard disk on the second hard disk; and
  
  restores data that was previously stored on the physical blocks serviced by the faulty head to the second hard disk without copying the data from the first hard disk.
- View Dependent Claims (17, 18)
- - 17. A method as defined in claim 16, wherein the hard disk:
    - recognizes read errors in response to a failure to read data from a physical block in response to a read command; and
      
      supplies to the host computer information identifying the head associated with a read error.
  - 18. A method as defined in claim 17, wherein the host computer further:
    - counts the number of read errors associated with each head;
      
      compares the counted number of read errors for each head with a predetermined threshold value; and
      
      determines any head to be faulty when the counted number of read errors for that head exceeds the predetermined threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Network Appliance Incorporated (NetApp, Inc.)
Original Assignee
Network Appliance Incorporated (NetApp, Inc.)
Inventors
Emami, Tim K.
Primary Examiner(s)
Negron; Daniell L

Application Number

US12/106,020
Time in Patent Office

1,292 Days
Field of Search

None
US Class Current

360/31
CPC Class Codes

G11B 20/1833   by adding special lists or ...

G11B 2220/2516   Hard disks

G11B 2220/415   Redundant array of inexpens...

G11B 27/36   Monitoring, i.e. supervisin...

Partial disk failures and improved storage resiliency

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Partial disk failures and improved storage resiliency

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links