Method and system for rapidly recovering data from a “sick” disk in a RAID disk group

US 7,574,623 B1
Filed: 04/29/2005
Issued: 08/11/2009
Est. Priority Date: 04/29/2005
Status: Active Grant

First Claim

Patent Images

1. A machine-implemented method, comprising:

predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices;

responsive to predicting the imminent failure of the particular mass storage device, automatically initiating a device-to-device copy operation to copy data from the particular mass storage device to a spare mass storage device;

during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area on the particular mass storage device, and forwarding the client-initiated read request to the particular mass storage device for servicing;

upon receiving, from the particular mass storage device, an error indicating the particular mass storage device failed to service the client-initiated read request, determining whether data from the storage area of the particular mass storage device has been copied to the spare mass storage device;

if data from the storage area of the particular mass storage device has been copied to the spare mass storage device, forwarding the client-initiated read request to the spare mass storage device for servicing; and

upon completion of the device-to-device copy operation, reconfiguring the redundancy group to replace the particular mass storage device in the redundancy group with the spare mass storage device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for recovering data from a “sick” disk are described. One embodiment of the invention relates to a RAID-based storage system that predicts the failure of a disk (e.g., a “sick” disk) in a RAID disk group. Accordingly, the storage system allocates a target disk, selected from several spare disks, to replace the “sick” disk in the RAID disk group upon completion of a disk-to-disk copy operation. Once a target disk has been allocated, a disk-to-disk copy operation is initiated to copy data from the “sick” disk to the target disk, thereby preventing the need to reconstruct data on the “sick” disk if the “sick” disk actually fails. During the disk-to-disk copy operation, client-initiated disk access operations continue to be serviced. Upon completion of the disk-to-disk copy operation, the storage system reconfigures the RAID disk group by swapping the target disk with the “sick” disk.

68 Citations

View as Search Results

23 Claims

1. A machine-implemented method, comprising:
- predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices;
  
  responsive to predicting the imminent failure of the particular mass storage device, automatically initiating a device-to-device copy operation to copy data from the particular mass storage device to a spare mass storage device;
  
  during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area on the particular mass storage device, and forwarding the client-initiated read request to the particular mass storage device for servicing;
  
  upon receiving, from the particular mass storage device, an error indicating the particular mass storage device failed to service the client-initiated read request, determining whether data from the storage area of the particular mass storage device has been copied to the spare mass storage device;
  
  if data from the storage area of the particular mass storage device has been copied to the spare mass storage device, forwarding the client-initiated read request to the spare mass storage device for servicing; and
  
  upon completion of the device-to-device copy operation, reconfiguring the redundancy group to replace the particular mass storage device in the redundancy group with the spare mass storage device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The machine-implemented method of claim 1, further comprising:
    - reconstructing data from a storage area of the particular mass storage device if the particular mass storage device fails to read data from the storage area of the particular mass storage device during a read operation associated with the device-to-device copy operation; and
      
      writing the reconstructed data to the spare mass storage device.
  - 3. The machine-implemented method of claim 2, wherein reconstructing data from a storage area of the particular mass storage device includes:
    - reading data and/or parity data from mass storage devices in the redundancy group other than the particular mass storage device; and
      
      computing the data from the storage area of the particular mass storage device using the data and/or parity data read from the mass storage devices in the redundancy group other than the particular mass storage device.
  - 4. The machine-implemented method of claim 1, further comprising:
    - during the device-to-device copy operation, in response to detecting a catastrophic failure of the particular disk, terminating the device-to-device copy operation; and
      
      initiating a data reconstruction operation to reconstruct only data from the particular mass storage device that was not copied to the target mass storage device during the disk-to-disk copy operation.
  - 5. The machine-implemented method of claim 1, further comprising:
    - during the device-to-device copy operation, preventing client-initiated write requests from being directed to the particular mass storage device by redirecting client-initiated write requests to a mass storage device in the redundancy group other than the particular mass storage device.
  - 6. The machine-implemented method of claim 1, further comprising:
    - during the device-to-device copy operation, receiving a client-initiated write request directed to a storage area on the particular mass storage device, and mirroring the write request so as to forward the write request to both the particular mass storage device and the spare mass storage device for servicing.
  - 7. The machine-implemented method of claim 1, further comprising:
    - if data from the particular storage area of the particular mass storage device has not been copied to the spare mass storage device, reconstructing data from the particular storage area of the particular mass storage device.
  - 8. The machine-implemented method of claim 7, wherein reconstructing data from the storage area of the particular mass storage device includes:
    - reading data and/or parity data from mass storage devices in the redundancy group other than the particular mass storage device; and
      
      computing the data from the storage area of the particular mass storage device using the data and/or parity data read from the mass storage devices in the redundancy group other than the particular mass storage device.
  - 9. The machine-implemented method of claim 1, wherein predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices includes:
    - receiving an error message from the particular mass storage device, the error message indicating the imminent failure of the particular mass storage device.
  - 10. The machine-implemented method of claim 1, wherein predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices includes:
    - receiving one or more error messages from the particular mass storage device; and
      
      automatically analyzing the one or more error messages to determine a pattern of error messages received, the pattern indicating the imminent failure of the particular mass storage device.
  - 11. The machine-implemented method of claim 1, wherein predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices includes:
    - receiving an error message from the particular mass storage device; and
      
      automatically analyzing the error message to determine a frequency with which error messages are received exceeds an error frequency threshold for the particular mass storage device.

12. An apparatus comprising:
- read/write hardware logic to read from and write to a plurality of mass storage devices, the plurality of mass storage devices logically configured to include a redundancy group of mass storage devices and one or more spare mass storage devices;
  
  failure prediction logic to predict imminent failure of a particular mass storage device in the redundancy group of mass storage devices;
  
  controller logic to(i) allocate a target mass storage device selected from the one or more spare mass storage devices, the target mass storage device to replace the particular mass storage device upon completion of a device-to-device copy operation,(ii) initiate a device-to-device copy operation to copy data from the particular mass storage device to the target mass storage device, and(iii) logically reconfigure the plurality of mass storage devices so as to replace the particular mass storage device with the target mass storage device in the redundancy group of mass storage devices upon completion of the device-to-device copy operation, wherein the device-to-device copy operation occurs concurrently with one or more client-initiated mass storage device access requests directed to a storage area of the particular mass storage device and wherein, if the client-initiated mass storage device access request is a read request, the read/write hardware logic is to forward the read request to the particular mass storage device for servicing and wherein the read/write hardware logic is to forward the read request to the target mass storage device for servicing, if the particular mass storage device fails to service the read request and the controller logic determines data from the storage area of the particular mass storage device has been copied to the target mass storage device.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The apparatus of claim 12, further comprising:
    - reconstruction logic to reconstruct data from a storage area of the particular mass storage device if data from the storage area cannot be read during a read operation associated with the device-to-device copy operation.
  - 14. The apparatus of claim 13, wherein the reconstruction logic is to initiate a data reconstruction operation to reconstruct data from the storage area of the particular mass storage device if the particular mass storage device fails to service the read request.
  - 15. The apparatus of claim 12, wherein, if the client-initiated mass storage device access request is a write request, the read/write hardware logic is to mirror the write request so as to forward the write request to both the particular mass storage device and the target mass storage device for servicing.
  - 16. The apparatus of claim 12, wherein the failure prediction logic is to receive and analyze one or more error messages from the particular mass storage device.

17. A computer-implemented method, comprising:
- predicting imminent failure of a particular disk in a RAID disk group, said particular disk capable of servicing read and/or write requests within predefined time parameters;
  
  allocating a target disk selected from one or more spare disks, said target disk to replace the particular disk in the RAID disk group upon completion of a disk-to-disk copy operation;
  
  initiating the disk-to-disk copy operation to copy data directly from the particular disk to the target disk, said disk-to-disk copy operation occurring concurrently with at least one disk access operation initiated by a client application, the disk access operation requesting access to a disk block on the particular disk;
  
  if the at least one disk access operation initiated by the client application is a read request, forwarding the client-initiated read request to the particular disk for servicing;
  
  upon receiving, from the particular disk, an error indicating the particular disk failed to service the client-initiated read request, determining whether the disk block on the particular disk has been copied to the target disk;
  
  if the disk block on the particular disk has been copied to the target disk, forwarding the client-initiated read request to the target disk for servicing andupon completion of the disk-to-disk copy operation, reconfiguring the RAID disk group so as to exclude the particular disk from the RAID disk group, and to include the target disk, in place of the particular disk, in the RAID disk group.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The computer-implemented method of claim 17, further comprising:
    - if said disk block on the particular disk has not been copied to the target disk and if said disk block on the particular disk cannot be read by the particular disk, initiating a reconstruction operation to reconstruct the data on the disk block by utilizing data and/or parity data read from disks in the RAID disk group other than the particular disk.
  - 19. The computer-implemented method of claim 17, further comprising:
    - if said disk access operation initiated by a client application is a write request directed to a disk block on the particular disk, mirroring the write request by forwarding the write request to both the particular disk and the target disk for servicing.
  - 20. The computer-implemented method of claim 17, further comprising:
    - if a disk block on the particular disk cannot be read in response to a read request associated with the disk-to-disk copy operation, initiating a reconstruction operation to reconstruct the data on the disk block by utilizing data and/or parity data read from disks in the RAID disk group other than the particular disk; and
      
      writing the reconstructed data to the target disk.
  - 21. The computer-implemented method of claim 17, wherein predicting imminent failure of a particular disk in a RAID disk group further comprises:
    - receiving error messages from the particular disk in the RAID disk group; and
      
      automatically analyzing the error messages to determine whether the frequency with which the error messages are received exceeds a disk-error frequency threshold.
  - 22. The computer-implemented method of claim 17, wherein predicting imminent failure of a particular disk in a RAID disk group further comprises:
    - analyzing a response time associated with a client-initiated read or write request directed to the particular disk; and
      
      determining whether the response time exceeds a predetermined expected response time.

23. A machine-readable storage medium storing instructions for facilitating the rapid recovery of data from a particular disk in a RAID disk group, the instructions, when executed by a machine, cause the machine to perform the method of:
- predicting imminent failure of a particular disk in a RAID disk group, said particular disk capable of servicing read and/or write requests within predefined time parameters;
  
  allocating a target disk selected from one or more spare disks, said target disk to replace the particular disk in the RAID disk group upon completion of a disk-to-disk copy operation;
  
  initiating the disk-to-disk copy operation to copy data directly from the particular disk to the target disk thereby preventing the need to reconstruct data on the particular disk should the particular disk actually fail, said disk-to-disk copy operation occurring concurrently with at least one disk access operation initiated by a client application, the disk access operation requesting access to a data block on the particular disk;
  
  if the at least one disk access operation initiated by the client application is a read request, forwarding the client-initiated read request to the particular disk for servicing;
  
  upon receiving, from the particular disk, an error indicating the particular disk failed to service the client-initiated read request, determining whether the data block on the particular disk has been copied to the target disk;
  
  if the data block on the particular disk has been copied to the target disk, forwarding the client-initiated read request to the target disk for servicing; and
  
  upon completion of the disk-to-disk copy operation, reconfiguring the RAID disk group so as to exclude the particular disk from the RAID disk group, and to include the target disk, in place of the particular disk, in the RAID disk group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Network Appliance Incorporated (NetApp, Inc.)
Original Assignee
Network Appliance Incorporated (NetApp, Inc.)
Inventors
Goel, Atul, Sundaram, Rajesh, Strange, Stephen H., Grcanac, Tomislav
Primary Examiner(s)
Chu; Gabriel L

Application Number

US11/118,896
Time in Patent Office

1,565 Days
Field of Search

None
US Class Current

714/47.2
CPC Class Codes

G06F 11/008 Reliability or availability...

G06F 11/1088 Reconstruction on already f...

Method and system for rapidly recovering data from a “sick” disk in a RAID disk group

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

68 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for rapidly recovering data from a “sick” disk in a RAID disk group

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

68 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links