Method and apparatus for providing distributed sparing with uniform workload distribution in failures
First Claim
1. In a computer system having a plurality of storage units in which data and parity information is stored and retrieved under control of a controller so as to comprise a redundant storage unit array with distributed sparing such that the controller executes a rebuild process in response to a failure of a storage unit and thereby reconstructs the data and parity information with no loss of information, wherein the controller stores the data and parity information into regions comprising equal-sized address areas, a predetermined number of regions equal to a number n comprises a parity group having n-1 data regions and one parity region, from which parity information to be stored in the parity region is produced for the rebuild process, a method or operating the controller comprising the steps of:
- (A) determining if the number of storage units under control of the controller is equal to a number (n+f), wherein f is a maximum predetermined number of storage unit failures that will be permitted to occur with no loss of information, and providing an error indication if it is not;
(B) determining if the number of regions per storage unit is equal to a multiple of a number [n×
(n+1)×
(n+2)×
. . . ×
(n+f)] and providing an error indication if it is not;
(C) storing the parity and data information across the storage units in a read-write-modify process such that the parity and data information is stored in parity regions and data regions, respectively, in accordance with a storage pattern defined by the steps of(1) assigning the data regions and parity regions to storage locations of the first n storage units according to a distributed parity pattern until all regions of the first n storage units are assigned, and(2) repeatedly making an assignment of spare regions across the storage units in blocks of n×
(n+1)×
(n+2)×
. . . ×
(n+f-j) regions, where j=f, f-1, f-2, . . . , 3, 2, 1 such that the blocks of spare regions are assigned with rotation to the n+i storage unit with each repeated assignment, where i=1, 2, 3, . . . , f, until all regions of the n+f storage units are assigned; and
(D) storing and retrieving data and parity information from the storage units with the read-write-modify process in accordance with the storage pattern, whereby the data, parity, and spare regions are distributed such that the read-write-modify process workload will be equally distributed among the storage units during the storing and retrieving of information before, during, and after the rebuild process.
1 Assignment
0 Petitions
Accused Products
Abstract
Data regions, parity regions, and spare regions in a redundant array of storage units are distributed such that each storage unit in the array has the same number of parity regions before, during, and after a failure of one or more storage units such that there is a uniform workload distribution among the storage units during normal operation, during the rebuild process, and during operation after the rebuild and before repair or replacement of the failed unit. The array provides uniform workload distribution for one or more failures of storage units. The number of storage units and the number of storage regions per storage unit is specified once the number of regions in a parity group and the number of failures to be managed are specified. The data regions, parity regions, and spare regions are then placed to provide the uniform workload distribution. Data, parity, and spare regions also can be distributed across multiple redundant arrays to provide the failure-tolerant advantages with uniform workload distribution.
-
Citations
12 Claims
-
1. In a computer system having a plurality of storage units in which data and parity information is stored and retrieved under control of a controller so as to comprise a redundant storage unit array with distributed sparing such that the controller executes a rebuild process in response to a failure of a storage unit and thereby reconstructs the data and parity information with no loss of information, wherein the controller stores the data and parity information into regions comprising equal-sized address areas, a predetermined number of regions equal to a number n comprises a parity group having n-1 data regions and one parity region, from which parity information to be stored in the parity region is produced for the rebuild process, a method or operating the controller comprising the steps of:
-
(A) determining if the number of storage units under control of the controller is equal to a number (n+f), wherein f is a maximum predetermined number of storage unit failures that will be permitted to occur with no loss of information, and providing an error indication if it is not; (B) determining if the number of regions per storage unit is equal to a multiple of a number [n×
(n+1)×
(n+2)×
. . . ×
(n+f)] and providing an error indication if it is not;(C) storing the parity and data information across the storage units in a read-write-modify process such that the parity and data information is stored in parity regions and data regions, respectively, in accordance with a storage pattern defined by the steps of (1) assigning the data regions and parity regions to storage locations of the first n storage units according to a distributed parity pattern until all regions of the first n storage units are assigned, and (2) repeatedly making an assignment of spare regions across the storage units in blocks of n×
(n+1)×
(n+2)×
. . . ×
(n+f-j) regions, where j=f, f-1, f-2, . . . , 3, 2, 1 such that the blocks of spare regions are assigned with rotation to the n+i storage unit with each repeated assignment, where i=1, 2, 3, . . . , f, until all regions of the n+f storage units are assigned; and(D) storing and retrieving data and parity information from the storage units with the read-write-modify process in accordance with the storage pattern, whereby the data, parity, and spare regions are distributed such that the read-write-modify process workload will be equally distributed among the storage units during the storing and retrieving of information before, during, and after the rebuild process.
-
-
2. In a computer system having a plurality of storage units in which data and parity information is stored and retrieved under control of a controller so as to comprise at least one redundant storage unit array with distributed sparing such that the controller executes a rebuild process in response to a failure of a storage unit, wherein the controller stores information into storage unit regions comprising equal-sized address areas, a method of operating the controller comprising the steps of:
-
(A) determining if the number of storage units under control of the controller is equal to a number (kn+f), where k is a predetermined number of redundant arrays of storage units, n is a predetermined number of regions that comprises a parity group having n-1 data regions and one parity region, from which parity information to be stored in the parity region is produced for the rebuild process, and f is a predetermined maximum number of storage unit failures that will be permitted to occur with no loss of data or parity information; (B) providing an error indication if the determined number of storage units is not equal to the number (kn+f); (C) determining if the number of regions per storage unit is equal to a multiple of a number [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)] of regions per storage unit;(D) providing an error indication if the number of regions per storage unit is not a multiple of the number [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)];(E) storing the data and parity information across the storage units in a read-write-modify process such that the data and parity information is stored in data regions and parity regions, respectively, in accordance with a storage pattern defined by the steps of (1) making an assignment of data regions and parity regions to storage locations of the first kn storage units according to a distributed parity pattern for each array until all regions of the first kn storage units are assigned, and (2) repeatedly making an assignment of spare regions in blocks of n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f-j) regions, where j=f, f-1, f-2, . . . , 3, 2, 1 such that the blocks of spare regions are assigned with rotation to the kn+i storage unit, where i=1, 2, 3, . . . , f, until all regions of the kn+f storage units are assigned; and(F) storing and retrieving data and parity information from the storage units with the read-write-modify process in accordance with the storage pattern, whereby the data, parity, and spare regions are distributed such that the read-write-modify process workload will be equally distributed among the storage units during the storing and retrieving of information before, during, and after the rebuild process. - View Dependent Claims (3, 4, 5)
-
-
6. A method of operating a controller for a data storage array of a computer system having a plurality of storage units comprising a redundant array in which the controller stores and retrieves data and parity information in regions comprising equal-sized storage unit address areas organized into parity groups having a predetermined number of regions equal to a number n, with n-1 data regions and one parity region per parity group, calculates new parity information using a read-write-modify process for each modification of data information in a parity group using an exclusive-OR logical operator, and executes a rebuild process in response to a failure of one of the storage units such that information from a data region of a parity group otherwise lost due to the storage unit failure is recovered by using the exclusive-OR operator on the remaining information in the parity group, thereby ensuring that no information will be lost, the method comprising the steps of:
-
(A) determining if the number of storage units under control of the controller is equal to a number (kn+1), where k is a predetermined number of redundant arrays of storage units in the computer system, n is the number of regions that comprises a parity group, and f is a predetermined maximum number of storage unit failures that will be permitted to occur with no loss of data or parity information; (B) providing an error indication if the determined number of storage units is not equal to the number (kn+f); (C) determining if the number of regions per storage unit is equal to a multiple of a number [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)] of regions per storage unit, where k, n, and f are defined as recited in step (A);(D) providing an error indication if the number of regions per storage unit is not a multiple of the number [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)]; and(E) storing the data and parity information across the storage units in respective data regions and parity regions, along with dedicated spare regions, using a read-write-modify process such that the data regions, parity regions, and spare regions are distributed across the storage units such that spare regions are located in contiguous blocks in a pattern defined by repeated blocks or [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f-j)] spare regions, where j=f, f-1, f-2, . . . , 3, 2, 1 such that the blocks of spare regions precess across the storage units with rotation to the kn+i storage unit, where i=1, 2, 3, . . . , f whereby the read-write-modify process workload will be equally distributed among the storage units during the storing and retrieving of information before, during, and after the rebuild process. - View Dependent Claims (7)
-
-
8. A computing system comprising:
-
a plurality of storage units having storage locations in which data and parity information is stored and retrieved across the storage units in regions comprising equal-sized storage location address areas, wherein the storage units comprise a predetermined number k of redundant arrays and wherein a predetermined number of regions equal to a number n comprises a parity group with n-1 data regions and one parity region, from which parity information to be stored in the parity region is produced using an exclusive-OR logical operator such that information from a data region of a parity group otherwise lost due to the storage unit failure can be recovered by using the exclusive-OR operator on the remaining information in the parity group in a rebuild process, thereby ensuring that no information will be lost; a data memory unit that stores data and parity information assigned to data regions and parity regions of the storage units; an XOR engine that performs the exclusive-OR logical operator; and a storage unit controller that controls operation of the data memory unit and XOR engine, and assigns data, parity, and spare regions to the storage units such that each storage unit has a number of storage regions equal to a number [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)] or multiples thereof, such that the number of data, parity, and spare regions is uniformly distributed among the storage units before, during, and after a failure of one or more of the storage units. - View Dependent Claims (9, 10)
-
-
11. A storage unit controller for use in a computing system that includes a plurality of storage units having storage locations in which the storage unit controller stores and retrieves data and parity information in regions comprising equal-sized storage location address areas organized into parity groups of predetermined size n with n-1 data regions and one parity region per parity group, the storage unit controller further comprising:
-
error check means for determining if the number of storage units in the computing system is equal to a number kn+f, where k is a predetermined number of redundant arrays into which the storage units are organized, n is the number of regions that comprises a parity group, and f is a predetermined maximum number of storage unit failures that will be permitted to occur with no loss of data or parity information, such that the error check means provides an error indication if the determined number of storage units is not equal to kn+f; an XOR engine that performs an exclusive-OR logical operation; and a controller central processing unit that controls operation of the XOR engine for the transfer of data and assigns data, parity, and spare regions to the storage locations of the storage units such that each storage unit has [n×
(kn+1)×
(kn+2)×
. . . ×
(kn+f)] storage locations or multiples thereof, and new parity information is calculated for each modification of a data region in a parity group using the exclusive-OR logical operation to permit recovering information lost from a data region of a parity group due to a storage unit failure by using the exclusive-OR operation on the remaining information in the parity group in a rebuilding process, such that the number of data, parity, and spare regions is uniformly distributed among the storage units before, during, and after a failure of one or more storage units. - View Dependent Claims (12)
-
Specification