Method and system for rapidly recovering data from a “dead” disk in a RAID disk group

US 7,587,630 B1
Filed: 04/29/2005
Issued: 09/08/2009
Est. Priority Date: 04/29/2005
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

responsive to identifying a particular mass storage device in a redundancy group of mass storage devices as incapable of servicing client-initiated requests in a timely manner;

automatically allocating a spare mass storage device to replace the particular mass storage device in the redundancy group of mass storage devices;

generating a disk cookie for the spare mass storage device, the disk cookie used to generate a data validity tag for indicating the validity of data written to the spare mass storage device;

forwarding client-initiated write requests directed to the particular mass storage device to the spare mass storage device for servicing; and

initiating a device-to-device copy operation to systematically read data from the particular mass storage device and write the data to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for rapidly recovering data from a failed disk in a RAID disk group are disclosed. According to one aspect of the present invention, a RAID-based storage system identifies a particular disk in a RAID disk group as a “dead” disk (e.g., incapable of servicing client-initiated requests in a timely manner). Accordingly, a spare disk is allocated to replace the “dead” disk and client-initiated read/write requests are directed to the spare disk for servicing. In addition, a disk-to-disk copy operation is initiated. Without overwriting valid data on the target disk with stale data from the “dead” disk, the disk-to-disk copy operation copies data from the “dead” disk to the target by directly reading data from the “dead” disk while reconstructing only the data that cannot be read directly from the “dead” disk.

Citations

26 Claims

1. A method comprising:
- responsive to identifying a particular mass storage device in a redundancy group of mass storage devices as incapable of servicing client-initiated requests in a timely manner;
  
  automatically allocating a spare mass storage device to replace the particular mass storage device in the redundancy group of mass storage devices;
  
  generating a disk cookie for the spare mass storage device, the disk cookie used to generate a data validity tag for indicating the validity of data written to the spare mass storage device;
  
  forwarding client-initiated write requests directed to the particular mass storage device to the spare mass storage device for servicing; and
  
  initiating a device-to-device copy operation to systematically read data from the particular mass storage device and write the data to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the device-to-device copy operation to systematically read data from the particular mass storage device and write data to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device includes:
    - reading data from a storage area of the particular mass storage device;
      
      reading a data validity tag from a corresponding storage area of the spare mass storage device; and
      
      writing the data from the storage area of the particular mass storage device to the corresponding storage area of the spare mass storage device only if the data validity tag indicates data at the corresponding storage area of the spare mass storage device is invalid.
  - 3. The method of claim 2, wherein writing the data from the storage area of the particular mass storage device to the corresponding storage area of the spare mass storage device includes:
    - generating a data validity tag associated with data to be written to the corresponding storage area of the spare mass storage device; and
      
      writing the data validity tag to the corresponding storage area of the spare mass storage device when writing the data from the storage area of the particular mass storage device to the corresponding storage area of the spare mass storage device.
  - 4. The method of claim 2, further comprising:
    - if the particular mass storage device fails to read data from the storage area of the particular mass storage device in connection with a read request associated with the device-to-device copy operation, reconstructing the data from the storage area of the particular mass storage device;
      
      generating a data validity tag associated with the reconstructed data; and
      
      writing the data validity tag and the reconstructed data to a corresponding storage area of the spare mass storage device.
  - 5. The method of claim 1, wherein the device-to-device copy operation to systematically read data from the particular mass storage device and write data to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device includes:
    - for each storage area storing data to be copied from the particular mass storage device to the spare mass storage device, analyzing a dirty bitmap indicating storage areas of the spare mass storage device to which valid data has been written; and
      
      for each storage area storing data to be copied from the particular mass storage device to the spare mass storage device, copying data from the storage area of the particular mass storage to a corresponding storage area of the spare mass storage device only if the dirty bitmap indicates that the corresponding storage area of the spare mass storage device is not storing valid data.
  - 6. The method of claim 5, wherein copying data from the storage area of the particular mass storage device to a corresponding storage area of the spare mass storage device includes:
    - reading data from the storage area of the particular mass storage device;
      
      generating a data validity tag based in part on a disk cookie associated with the spare mass storage device, andwriting the data validity tag and the data from the storage area of the particular mass storage device to the corresponding storage area of the spare mass storage device.
  - 7. The method of claim 1, further comprising:
    - during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area of the particular mass storage device;
      
      responsive to receiving the client-initiated read request directed to a storage area of the particular mass storage device, determining whether data at a corresponding storage area of the spare mass storage device is valid; and
      
      if data at the corresponding storage area of the spare mass storage device is valid, forwarding the client-initiated read request to the spare mass storage device for servicing.
  - 8. The method of claim 7, wherein determining whether data at a corresponding storage area of the spare mass storage device is valid includes comparing an address for the storage area of the particular mass storage device with a copy progress indicator value to determine whether the data at the corresponding storage area of the spare mass storage device is valid.
  - 9. The method of claim 7, wherein determining whether data at a corresponding storage area of the spare mass storage device is valid includes analyzing a dirty bitmap, the dirty bitmap indicating which storage areas of the spare mass storage device contain valid data.
  - 10. The method of claim 1, further comprising:
    - during the device-to-device copy operation, receiving a client-initiated read request directed to a storage area of the particular mass storage device;
      
      responsive to receiving the client-initiated read request directed to a storage area of the particular mass storage device, determining whether data at a corresponding storage area of the spare mass storage device is valid; and
      
      if data at the corresponding storage area of the spare mass storage device is not valid, reconstructing the data at the storage area of the particular mass storage device.
  - 11. The method of claim 10, further comprising:
    - after reconstructing the data at the storage area of the particular mass storage device, generating a data validity tag associated with the reconstructed data, and writing the data validity tag and the reconstructed data to a corresponding storage area of the spare mass storage device.

12. A storage system comprising:
- controller logic to automatically allocate a spare mass storage device to replace a particular mass storage device in a redundancy group of mass storage devices in response to identifying the particular mass storage device as incapable of servicing client-initiated access requests in a timely manner;
  
  data validity tag generation logic to generate data validity tags to indicate the validity of data written to the spare mass storage device, the data validity tags based at least in part on a disk cookie associated with the spare mass storage device; and
  
  read/write hardware logic to (i) forward client-initiated write requests directed to the particular mass storage device to the spare mass storage device for servicing, and (ii) initiate a device-to-device copy operation to systematically copy data from the particular mass storage device to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The storage system of claim 12, further comprising:
    - data validation logic to analyze a data validity tag read from a storage area of the spare mass storage device, wherein the read/write hardware logic is to write data from a storage area of the particular mass storage device to a corresponding storage area of the spare mass storage device, in connection with a device-to-device copy operation, only if the data validation logic determines, based on the data validity tag, that the data at the corresponding storage area of the spare mass storage device is invalid.
  - 14. The storage system of claim 12,wherein the read/write hardware logic is to write a data validity tag to a particular storage area of the spare mass storage device when data from a storage area of the particular mass storage device is written to the particular storage area of the spare mass storage device.
  - 15. The storage system of claim 12, further comprising:
    - data reconstruction logic to reconstruct data from a storage area of the particular mass storage device when the particular mass storage device cannot read the data at the storage area of the particular mass storage device in connection with a read request associated with a device-to-device copy operation.
  - 16. The storage system of claim 12, further comprising:
    - data reconstruction logic to reconstruct data from a storage area of the particular mass storage device when the storage system has received a client-initiated read request directed to the storage area of the particular mass storage device, and the data from the storage area of the particular mass storage device has not been copied to a corresponding storage area of the spare mass storage device.
  - 17. The storage system of claim 12, further comprising:
    - data reconstruction logic to reconstruct data from a storage area of the particular mass storage device when the storage system has received a client-initiated read request directed to the storage area of the particular mass storage device, and the data from a corresponding storage area of the spare mass storage device is determined to be invalid.
  - 18. The storage system of claim 17, wherein, after the data reconstruction logic has reconstructed data from the storage area of the particular mass storage device in connection with a client-initiated read request directed to the storage area of the particular mass storage device, the read/write hardware logic is to write the reconstructed data to a corresponding storage area of the spare mass storage device.
  - 19. The storage system of claim 12, further comprising:
    - a dirty bitmap to indicate which storage areas of the spare mass storage device have been written with valid data since the spare mass storage device was initially allocated to replace the particular mass storage device, wherein the read/write hardware logic is to (i) analyze the dirty bitmap, and (ii) copy data from a storage area of the particular mass storage device to a corresponding storage area of the spare mass storage device only if the dirty bitmap indicates that the corresponding storage area of the spare mass storage device is not storing valid data.
  - 20. The storage system of claim 12, wherein, in response to receiving a client-initiated read request directed to a storage area of the particular mass storage device, the read/write hardware logic is to determine whether data at a corresponding storage area of the spare mass storage device is valid, and if so, the read/write hardware logic is to forward the client-initiated read request to the spare mass storage device for servicing.

21. A method for rapidly recovering data from a failing disk ina RAID disk group, the method comprising:
- allocating a target disk, selected from one or more spare disks, to replace the failing disk in the RAID-disk group;
  
  generating a disk cookie for the target disk, the disk cookie used to generate a data validity tag for indicating the validity of data written to the target disk;
  
  preventing the failing disk from servicing client-initiated access requests by forwarding client-initiated write requests to the target disk for servicing, and forwarding client-initiated read requests to the target disk for servicing only if the client-initiated read request is directed to data at a disk block of the failing disk that has been copied to a corresponding disk block of the target disk as part of a disk-to-disk copy operation; and
  
  systematically copying data from the failing disk to the target disk, as part of a disk-to-disk copy operation, without overwriting valid data on the target disk.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The method of claim 21, further comprising:
    - if a client-initiated read request is directed to data at a disk block of the failing disk that has not been copied to a corresponding disk block of the target disk as part of a disk-to-disk copy operation, reconstructing the data at the disk block of the failing disk.
  - 23. The method of claim 22, further comprising:
    - generating a data validity tag associated with the reconstructed data; and
      
      writing the data validity tag and the reconstructed data to the corresponding disk block of the target disk.
  - 24. The method of claim 21, wherein systematically copying datafrom the failing disk to the target disk without overwriting valid data on the target disk includes:
    - reading data from a disk block of the failing disk;
      
      reading a data validity tag from a corresponding disk block of the target disk; and
      
      writing the data from the disk block of the failing disk to the corresponding disk block of the target disk only if the data validity tag indicates the data from the corresponding disk block of the target disk is invalid.
  - 25. The method of claim 21, further comprising:
    - if data from the disk block of the dead disk cannot be read, reconstructing data from the disk block of the dead disk.

26. A machine-readable medium storing instructions that, when executed by the machine, cause the machine to:
- automatically allocate a spare mass storage device to replace a particular mass storage device in a redundancy group of mass storage devices, the particular mass storage device having been identified as incapable of servicing client-initiated access requests in a timely manner;
  
  generate a disk cookie for the spare mass storage device, the disk cookie used to generate a data validity tag for indicating the validity of data written to the spare mass storage device;
  
  forward client-initiated write requests directed to the particular mass storage device to the spare mass storage device for servicing; and
  
  initiate a device-to-device copy operation to systematically copy data from the particular mass storage device to the spare mass storage device without overwriting data on the spare mass storage device with stale data from the particular mass storage device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Network Appliance Incorporated (NetApp, Inc.)
Original Assignee
Network Appliance Incorporated (NetApp, Inc.)
Inventors
Cassell, Loellyn, Goel, Atul, Sundaram, Rajesh, Leong, James
Primary Examiner(s)
Chu; Gabriel L

Application Number

US11/118,674
Time in Patent Office

1,593 Days
Field of Search

None
US Class Current

714/47.3
CPC Class Codes

G06F 11/1088 Reconstruction on already f...

Method and system for rapidly recovering data from a “dead” disk in a RAID disk group

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for rapidly recovering data from a “dead” disk in a RAID disk group

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links