Method and system for rapidly recovering data from a “sick” disk in a RAID disk group
First Claim
1. A machine-implemented method, comprising:
- predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices;
responsive to predicting the imminent failure of the particular mass storage device, automatically initiating a device-to-device copy operation to copy data from the particular mass storage device to a spare mass storage device;
during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area on the particular mass storage device, and forwarding the client-initiated read request to the particular mass storage device for servicing;
upon receiving, from the particular mass storage device, an error indicating the particular mass storage device failed to service the client-initiated read request, determining whether data from the storage area of the particular mass storage device has been copied to the spare mass storage device;
if data from the storage area of the particular mass storage device has been copied to the spare mass storage device, forwarding the client-initiated read request to the spare mass storage device for servicing; and
upon completion of the device-to-device copy operation, reconfiguring the redundancy group to replace the particular mass storage device in the redundancy group with the spare mass storage device.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for recovering data from a “sick” disk are described. One embodiment of the invention relates to a RAID-based storage system that predicts the failure of a disk (e.g., a “sick” disk) in a RAID disk group. Accordingly, the storage system allocates a target disk, selected from several spare disks, to replace the “sick” disk in the RAID disk group upon completion of a disk-to-disk copy operation. Once a target disk has been allocated, a disk-to-disk copy operation is initiated to copy data from the “sick” disk to the target disk, thereby preventing the need to reconstruct data on the “sick” disk if the “sick” disk actually fails. During the disk-to-disk copy operation, client-initiated disk access operations continue to be serviced. Upon completion of the disk-to-disk copy operation, the storage system reconfigures the RAID disk group by swapping the target disk with the “sick” disk.
68 Citations
23 Claims
-
1. A machine-implemented method, comprising:
-
predicting an imminent failure of a particular mass storage device in a redundancy group of mass storage devices; responsive to predicting the imminent failure of the particular mass storage device, automatically initiating a device-to-device copy operation to copy data from the particular mass storage device to a spare mass storage device; during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area on the particular mass storage device, and forwarding the client-initiated read request to the particular mass storage device for servicing; upon receiving, from the particular mass storage device, an error indicating the particular mass storage device failed to service the client-initiated read request, determining whether data from the storage area of the particular mass storage device has been copied to the spare mass storage device; if data from the storage area of the particular mass storage device has been copied to the spare mass storage device, forwarding the client-initiated read request to the spare mass storage device for servicing; and upon completion of the device-to-device copy operation, reconfiguring the redundancy group to replace the particular mass storage device in the redundancy group with the spare mass storage device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. An apparatus comprising:
-
read/write hardware logic to read from and write to a plurality of mass storage devices, the plurality of mass storage devices logically configured to include a redundancy group of mass storage devices and one or more spare mass storage devices; failure prediction logic to predict imminent failure of a particular mass storage device in the redundancy group of mass storage devices; controller logic to (i) allocate a target mass storage device selected from the one or more spare mass storage devices, the target mass storage device to replace the particular mass storage device upon completion of a device-to-device copy operation, (ii) initiate a device-to-device copy operation to copy data from the particular mass storage device to the target mass storage device, and (iii) logically reconfigure the plurality of mass storage devices so as to replace the particular mass storage device with the target mass storage device in the redundancy group of mass storage devices upon completion of the device-to-device copy operation, wherein the device-to-device copy operation occurs concurrently with one or more client-initiated mass storage device access requests directed to a storage area of the particular mass storage device and wherein, if the client-initiated mass storage device access request is a read request, the read/write hardware logic is to forward the read request to the particular mass storage device for servicing and wherein the read/write hardware logic is to forward the read request to the target mass storage device for servicing, if the particular mass storage device fails to service the read request and the controller logic determines data from the storage area of the particular mass storage device has been copied to the target mass storage device. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer-implemented method, comprising:
-
predicting imminent failure of a particular disk in a RAID disk group, said particular disk capable of servicing read and/or write requests within predefined time parameters; allocating a target disk selected from one or more spare disks, said target disk to replace the particular disk in the RAID disk group upon completion of a disk-to-disk copy operation; initiating the disk-to-disk copy operation to copy data directly from the particular disk to the target disk, said disk-to-disk copy operation occurring concurrently with at least one disk access operation initiated by a client application, the disk access operation requesting access to a disk block on the particular disk; if the at least one disk access operation initiated by the client application is a read request, forwarding the client-initiated read request to the particular disk for servicing; upon receiving, from the particular disk, an error indicating the particular disk failed to service the client-initiated read request, determining whether the disk block on the particular disk has been copied to the target disk; if the disk block on the particular disk has been copied to the target disk, forwarding the client-initiated read request to the target disk for servicing and upon completion of the disk-to-disk copy operation, reconfiguring the RAID disk group so as to exclude the particular disk from the RAID disk group, and to include the target disk, in place of the particular disk, in the RAID disk group. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A machine-readable storage medium storing instructions for facilitating the rapid recovery of data from a particular disk in a RAID disk group, the instructions, when executed by a machine, cause the machine to perform the method of:
-
predicting imminent failure of a particular disk in a RAID disk group, said particular disk capable of servicing read and/or write requests within predefined time parameters; allocating a target disk selected from one or more spare disks, said target disk to replace the particular disk in the RAID disk group upon completion of a disk-to-disk copy operation; initiating the disk-to-disk copy operation to copy data directly from the particular disk to the target disk thereby preventing the need to reconstruct data on the particular disk should the particular disk actually fail, said disk-to-disk copy operation occurring concurrently with at least one disk access operation initiated by a client application, the disk access operation requesting access to a data block on the particular disk; if the at least one disk access operation initiated by the client application is a read request, forwarding the client-initiated read request to the particular disk for servicing; upon receiving, from the particular disk, an error indicating the particular disk failed to service the client-initiated read request, determining whether the data block on the particular disk has been copied to the target disk; if the data block on the particular disk has been copied to the target disk, forwarding the client-initiated read request to the target disk for servicing; and upon completion of the disk-to-disk copy operation, reconfiguring the RAID disk group so as to exclude the particular disk from the RAID disk group, and to include the target disk, in place of the particular disk, in the RAID disk group.
-
Specification