Selecting optimal repair strategy for mirrored files
First Claim
1. A method of repairing a disk group comprising a plurality of disks, the method comprising:
- determining a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient;
maintaining a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current;
identifying with a computer a first hard disk in the disk group that has become unavailable, wherein the first hard disk comprises a first set of data blocks;
marking as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and
repairing the disk group, wherein repairing the disk group comprises;
(i) re-creating each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and
(ii) re-creating the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired.
1 Assignment
0 Petitions
Accused Products
Abstract
This document describes solutions to reduce the time of reduced data redundancy following transient disk failures that do not corrupt the disk. Beneficially, these solutions provide a way to estimate the most efficient repair strategy for the disk group, which helps to minimize the amount of time data in a disk group remains unprotected. Merely by way of example, a threshold value might specify a duration in which a disk failure should be considered transient, such that if the disk is repaired within that duration, only the stale extents on the disk need be recreated. If the disk cannot be repaired within that duration, the entire contents of the disk might be recreated on one or more other disks in the group.
123 Citations
16 Claims
-
1. A method of repairing a disk group comprising a plurality of disks, the method comprising:
-
determining a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient; maintaining a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current; identifying with a computer a first hard disk in the disk group that has become unavailable, wherein the first hard disk comprises a first set of data blocks; marking as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and repairing the disk group, wherein repairing the disk group comprises; (i) re-creating each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and (ii) re-creating the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for repairing a disk group, the system comprising:
-
a processor; and a computer readable medium comprising a set of instructions executable by the processor, the set of instructions comprising; (a) instructions to determine a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient; (b) instructions to maintain a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current; (c) instructions to identify a first hard disk in the disk group that has become unavailable, wherein the first hard disk comprises a first set of data blocks; (d) instructions to mark as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and (e) instructions to repair the disk group, wherein the instructions to repair the disk group comprise; (i) instructions to re-create each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and (ii) instructions to re-create the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired.
-
-
12. A system, comprising:
-
a processor; a disk group comprising a plurality of hard disks, each of the plurality of hard disks being in communication with the processor; and a computer readable medium comprising a set of instructions executable by the processor, the set of instructions comprising; (a) instructions to determine a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient; (b) instructions to maintain a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current; (c) instructions to identify a first hard disk in the disk group that has become unavailable, wherein the first hard disk comprises a first set of data blocks; (d) instructions to mark as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and (e) instructions to repair the disk group, wherein the instructions to repair the disk group comprise; (i) instructions to re-create each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and (ii) instructions to re-create the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired. - View Dependent Claims (13, 14)
-
-
15. A computer program, embodied on a computer readable medium, for repairing a disk group comprising a plurality of hard disks, the computer program comprising a set of instructions executable by one or more computers, the set of instructions comprising:
-
instructions to determine a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient; instructions to maintain a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current; instructions to identify a first hard disk in the disk group that has become unavailable, wherein the first hard disk comprises a;
first set of data blocks;instructions to mark as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and instructions to repair the disk group, wherein the instructions to repair the disk group comprise; (i) instructions to re-create each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and (ii) instructions to re-create the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired.
-
-
16. A system for repairing a disk group comprising a plurality of hard disks, the system comprising:
-
means for determining a threshold value for a repair time of the disk group, wherein the threshold value represents a duration during which the hard disk failure should be assumed to be transient; means for maintaining a set of metadata for each hard disk in the disk group, wherein each hard disk stores a set of data blocks, each data block comprising a set of data stored on the hard disk, and wherein the set of metadata for a hard disk comprises information about whether each of the data blocks on the hard disk is current; means for identifying a first hard disk in the disk group that has become unavailable, wherein the particular hard disk comprises a first set of data blocks; means for marking as stale, in the set of metadata for the first hard disk, each data block on the first hard disk to which a write attempt is made while the offline disk is unavailable; and means for repairing the disk group, wherein the means for repairing the disk group comprises; (i) means for re-creating each of the data blocks marked as stale in the metadata for the first hard disk, if the first hard disk becomes available before the duration specified by the threshold value has expired; and (ii) means for re-creating the first set of data blocks on one or more additional hard disks in the disk group, if the first hard disk does not become available before the duration specified by the threshold value has expired.
-
Specification