SCALABLE DATA ACCESS SYSTEM AND METHODS OF ELIMINATING CONTROLLER BOTTLENECKS
1. A method of writing write data to at least one storage device comprising:
- generating a write request in a computer processor;
passing the write request to a front-end storage controller (nFE_SAN);
copying the write data to a first and a second cache memory of the nFE_SAN;
generating a write lock request and transmitting the write lock request from the nFE_SAN over a network interconnect selected from a first and a second storage area interconnect to a back-end storage controller (nBE_SAN);
returning a write lock grant from the nBE_SAN to the nFE_SAN;
upon completing copying the write data to the first and second cache memory of the nFE_SAN and receiving the write lock grant from the nBE_SAN, the nFE_SAN providing a write complete signal to the computer processor;
copying the write data over a network interconnect selected from the first and second storage area interconnect to the nBE_SAN; and
writing, by the BE_SAN, the write data to the at least one storage device.
A data access system has host computers having front-end controllers nFE_SAN connected via a bus or network interconnect to back-end storage controllers nBE_SAN, and physical disk drives connected via network interconnect to the nBE_SANs to provide a distributed, high performance, policy based or dynamically reconfigurable, centrally managed, data storage acceleration system. The hardware and software architectural solutions eliminate BE_SAN controller bottlenecks and improve performance and scalability. In an embodiment, the nBE_SAN (BE_SAN) firmware recognize controller overload conditions, informs Distributed Resource Manager (DRM), and, based on the DRM provided optimal topology information, delegates part of its workload to additional controllers. The nFE_SAN firmware and additional hardware using functionally independent and redundant CPUs and memory that mitigate single points of failure and accelerates write performance. The nFE_SAN and FE_SAN controllers facilitate Converged I/O Interface by simultaneously supporting storage I/O and network traffic.
- 1. A method of writing write data to at least one storage device comprising:
generating a write request in a computer processor; passing the write request to a front-end storage controller (nFE_SAN); copying the write data to a first and a second cache memory of the nFE_SAN; generating a write lock request and transmitting the write lock request from the nFE_SAN over a network interconnect selected from a first and a second storage area interconnect to a back-end storage controller (nBE_SAN); returning a write lock grant from the nBE_SAN to the nFE_SAN; upon completing copying the write data to the first and second cache memory of the nFE_SAN and receiving the write lock grant from the nBE_SAN, the nFE_SAN providing a write complete signal to the computer processor; copying the write data over a network interconnect selected from the first and second storage area interconnect to the nBE_SAN; and writing, by the BE_SAN, the write data to the at least one storage device.
- View Dependent Claims (2)
- 3. A method of writing write data to at least one storage device comprising:
generating a write request in a computer processor; passing the write request to a front-end storage controller (nFE_SAN); copying the write data to a first and a second cache memory of the nFE_SAN; generating a write lock request and transmitting the write lock request from the nFE_SAN over a network interconnect selected from a first and a second storage area interconnect to a selected back-end storage controller (nBE_SAN), the nBE_SAN selected from a first and a second nBE_SAN of a plurality of nBE_SANs according to logical block addresses (LBAs) identification associated with the write request, where a first of the nBE_SANs is assigned to write data associated with a first and second LBAs, and a second nBE_SAN is assigned to write data associated with third LBAs, the first, second, and third LBAs being different and the first nBE_SAN being different from the second nBE_SAN; returning a write lock grant from the selected nBE_SAN to the nFE_SAN; upon completing copying the write data to the first and second cache memory of the nFE_SAN and receiving the write lock grant from the selected nBE_SAN, the nFE_SAN providing a write complete signal to the computer processor; copying the write data over a network interconnect selected from the first and second storage area interconnect to the selected nBE_SAN; and writing, by the selected BE_SAN, the write data to the at least one storage device.
- View Dependent Claims (4)
This application is a continuation of U.S. patent application Ser. No. 15/482,726 filed Apr. 8, 2017 (Attorney Docket No. 588628), which claims priority to U.S. Provisional Patent Application No. 62/320,349 filed Apr. 8, 2016 (Attorney Docket No. 579645). This application is also related generally to improvements on the storage technology described in U.S. patent application Ser. No. 15/173,043 (now U.S. Pat. No. 9,823,866), which in turn claims priority to U.S. patent application Ser. No. 11/292,838 (no U.S. Pat. No. 8,347,010) and U.S. patent application Ser. No. 14/252,838 (now U.S. Pat. No. 9,527,190). The contents of all the aforementioned patent applications and patents are incorporated herein by reference.
This document generally relates to the field of storage controllers and Storage Area Network (SAN) systems. The goal for this innovative storage technology is to boost performance by accelerating data access speed while improving overall datacenter efficiency to produce immediate saving in power, cooling costs, and floor space.
According to Moore'"'"'s law, CPU performance improves about 2× over two years, or approximately 50% per year. Although, historically Hard Disk Drive (HDD) capacity improves at the same rate, data access speed improvements lag behind the CPU performance and HDD capacity improvements. Based on a number of published papers and study reports, the CPU performance improves roughly 50% per year while storage I/O performance improves only about 5% per year. Thus, since 1980, the CPU performance has increased over quarter of a million times while the performance of legacy SAN systems (many of which are Redundant Array of Independent Disks (RAID) systems) has improved only about 12 times. Therefore, it is clear that in a computer system there is a growing I/O performance gap between compute and storage performance, which may limit the maximum achievable utilization of such unbalanced computer systems. Even with new advances in Flash Memory and Solid-State Drive (SSD) technology, the data access speed fails to match performance advances in CPU and memory technologies.
To further improve data storage scalability and performance to be able to meet Big Data and Exascale computational requirements for ultra-high data access speeds and capacity, a new herein-disclosed technology leverages the architecture and methods we have disclosed in our previous U.S. Pat. No. 9,118,698, the disclosure of which is incorporated herein by reference. With the world'"'"'s data more than doubling every two years, there is ever-increasing demand for more storage capacity and performance. Legacy SAN and RAID technologies available today are unable to meet performance requirements and, with prohibitively high cost, are out of reach for the majority of the small and some medium size businesses.
To reduce cost, organizations often utilize large number of disjoint individual physical servers with one or more Virtual Servers (VMs) where each server may be dedicated to one or more specific applications, such as email server, accounting packages, etc. However, such approach introduces other issues such as insufficient storage I/O performance, system and network administration, fault tolerance, fragmented data storage, online storage and backup management problems, as well as system complexity, and so forth. Data access and data sharing could be done at different levels such as block level (shared block storage), multiple hosts accessing the same disk drives or Logical Unit Numbers (LUNs), or for file level access using legacy file systems like Network File System (NFS), Common Internet File System (CIFS), or modern parallel file systems such as Lustre, GPFS, QFS, StorNext, etc.
In addition, TCP/IP protocol overhead together with network latency affects the performance of NFS/CIFS storage systems by significantly increasing access delays for network-attached storage when compared to locally attached disks slowing down applications and lowering overall datacenter utilization that may result in lower employee productivity. However, locally attached disk performance is usually slower than data storage subsystem implementations such as legacy SAN and RAID subsystems. Traditional SAN design and implementation, even though in many cases superior to locally attached disks, tends to significantly underutilize aggregate data rate of all attached disk drives or SSDs by making use of time division multiplexing over, typically, small number of relatively slow I/O ports (network links) between servers and attached SAN subsystem(s).
The present system is an improvement over the previously disclosed data storage architectures (Scalable Data Storage Architecture and Methods of Eliminating I/O Traffic Bottlenecks), by means of self-reconfiguring storage controllers and multi-level storage architecture. In addition, the new hardware architecture and firmware algorithms provide additional data access performance scaling, for bursty I/O traffic, without a need for additional back-end hard disk drives (HDDs) or solid state drives (SSDs). For that reason it is called a “Scalable Data Access System” rather than SAN. Thus, with this new data access system further decoupling of I/O performance from storage capacity was made possible. The new architecture still facilitates parallel execution of the Front-End code on independent FE_SAN and nFE_SAN (new FE_SAN design) controllers and employs locking mechanism in the Back-End code (executed on the BE_SAN and nBE_SAN (new BE_SAN design) controllers) to enforce data coherency and prevent data corruption.
To denote that either nFE_SAN or FE_SAN, (nFE_SAN/FE_SAN), controller(s) may be utilized the abbreviated notation will be used herein:
(n)FE_SAN=(nFE_SAN or FE_SAN).
Similarly, we denote that either nBE_SAN or BE_SAN controller(s) may be used using notation: (n)BE_SAN=(nBE_SAN/BE_SAN)
Furthermore, to accelerate execution of write requests and to improve reliability and resiliency, the new nFE_SAN controller hardware design enables a second copy of write data to be maintained locally without a need to first traverse the network fabric and store the second copy of the data on (n)BE_SAN controller(s). In order to prevent single-points-of-failure, the nFE_SAN controller design includes two independent write-back cache memory buffers, each with corresponding processing, network-interface, and power components. Thus, a single nFE_SAN controller card has two operationally independent controllers, FE_SAN controller and redundant nFE_SAN-S sub-controllers. In order to free up memory buffers with one copy of the data, de-staging of the nFE_SAN controller cache to (n)BE_SAN controller(s) is done as soon as possible and upon its completion the duplicate memory buffer (with one copy of the data) may be released, while the nFE_SAN may optionally retain a single copy in the cache to permit ultrafast data access. Even though, an nFE_SAN controller card has two different subsections, FE_SAN controller and nFE_SAN-S sub-controller, they may be presented to the host operating system (OS) as a single nFE_SAN controller while the underlying complexity is hidden away from the host OS and user.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more comprehensive description of embodiments of the system, as illustrated in the accompanying drawings in which like reference characters refer to similar parts throughout the different views.
Host read/write command execution and algorithms to process storage I/O commands in parallel remains similar to that as described in U.S. Pat. No. 9,118,698 entitled “Scalable Data Storage Architecture and Methods of Eliminating I/O Traffic Bottlenecks”. However, the new nFE_SAN 300 controller design enables a redundant copy of write data to be maintained locally on the nFE_SAN controller card without the need to first traverse the network fabric and store the data on BE_SAN controller(s) before returning “COMMAND COMPLETE” message to the host. To further explain how nFE_SAN controller improves write performance, if write-back-caching-with-mirroring is enabled, the write data is simultaneously transferred, using Copy-on-Write (CoW) to two independent memory buffers 302 and 352 and a “COMMAND COMPLETE” message is returned to the host OS as soon as the data transfer is completed and lock(s) from (n)BE_SAN controller(s) acquired. Thus, the key feature of the nFE_SAN 300 controller design is that the two independent memory buffers 302 and 352 and corresponding processing 303 and 353 and network components 320 and 321 are functionally and physically self-sufficient, effectively creating two operationally independent controller cards (FE_SAN 301 controller and nFE_SAN-S 350 sub-controller). However, a single nFE_SAN 300 controller may be presented to the host OS while the underlying complexity is hidden away from the host OS and user. This design also mitigates single-point-of-failure issues. De-staging of the nFE_SAN 300 controller data cache to (n)BE_SAN controller(s) is done as soon as possible and upon its completion the one of the memory buffers 302 or 352 may be released keeping one copy of the data in the nFE_SAN 300 controller. However, in order to improve performance and support for virtual environments, the device driver together with nFE_SAN firmware may present multiple virtual nFE_SAN 300 controllers to the host OS, hypervisor manager or other virtualization layers, guest OSs, and their applications.
SCSI standard defines device interface model and SCSI command set for all SCSI devices. For purposes of this document, a storage device is a disk drive such as a traditional hard disk, a flash-memory based “thumb drive”, a solid-state disk (SSD), non-volatile memory express (NVMe), an optical storage device such as a digital versatile disk (DVD-RW) or blue-ray drive, or other block-oriented machine-readable drive as known in the computing industry. One of the key roles (function) of a SCSI device interface is to protect data residing on the SCSI device from misbehaving host. The SCSI command set is designed to provide efficient peer-to-peer (initiator-target) operation of SCSI devices (HBAs, disks, tapes, printers, etc.). However, with improvements in non-volatile memory, flash memory, and SSD technology, the new solid state based persistent memory devices have outperformed legacy SCSI interface which became performance bottleneck. Thus, in recent years a new set emerging storage industry interface standards have been developed to overcome the legacy SCSI standard shortcomings. The new storage interface standards include SCSI Express (SCSIe), Non-Volatile Memory Express (NVMe), and other proprietary computer manufacturer peripheral device standards. (n)FE_SAN controllers can readily take advantage or the newly developed standards to further reduce latency and boost transfer rate between host memory and nFE_SAN controller memory 302 and 352. Furthermore, multiple FE_SAN/nFE_SAN controllers attached to a host via host or memory bus may be configured as a controller group conforming to legacy SCSI as well as new and emerging peripheral/storage device interface standards.
In an embodiment, on the host side, the nFE_SAN and FE_SAN controllers, in addition to conforming to Small Computer System Interface (SCSI) set of standards, also support new and emerging industry storage standards such as SCSI Express (SCSIe), Non-Volatile Memory Express (NVMe), as well as other proprietary computer manufacturer peripheral device standards. Still, multiple nFE_SAN/FE_SAN controllers attached to a host may be configured as a controller group conforming to legacy SCSI as well as new and emerging peripheral device and persistent data storage interface standards.
In an embodiment, the nFE_SAN controller is configured such that data is retained in a first write-back cache and transferred to (n)BE_SAN controllers over a first network when a first controller and power supply are operational, retained in a second write-back cache and transferred to (n)BE_SAN controllers over a second network when a second controller and power supply are operational, the first and second controllers and power supplies being independent. This configuration provides for storage and transmission of write-back cache data from the nFE_SAN controller to the (n)BE_SAN controller even if there is a failure of one unit selected from the group of the first and second controllers, the first and second network, and the first and second write-back cache, and first and second power supplies.
Furthermore, the new (n)FE_SAN controller firmware is configurable to logically partition and virtualize controller hardware and software resources to facilitate virtualized and cloud environments such as VMware, OpenBox, Microsoft Hyper-V, KVM, Xen, etc. Thus, a single physical (n)FE_SAN controller may present a number of different virtual controllers with different controller properties to different Virtual Machines (VMs) on the same physical server.
The, (n)FE_SAN controller may be provided in different physical formats but not limited to PCI, PCIe, SBus, or memory channel interface board formats. However, (n)FE_SAN controller may support a number of host buses such as PCI, PCIe, SBus, the IBM Coherent Accelerator Processor Interface (CAPI), QuickPath Interconnect (QPI), HyperTransport (HT), and various memory channel interface standards, to mention a few. In addition to basic functionality, such as SCSI device discovery, error handling and recovery, and some RAID functionality, each (n)FE_SAN interface card has firmware (software) that supports additional services and features such as compression, encryption, de-duplication, thin provisioning (TP), snapshots, remote replication, etc. The nBE_SAN controller (back-end) may be provided in a number of different physical formats such as in the standard disk drive enclosure format (including 19 inch rack and standalone enclosures), or an integrated circuit that is easily adaptable to the standard interior configuration of a SAN controller.
In an embodiment, the hardware design of a nBE_SAN controller may include integration of one or more of nFE_SAN or FE_SAN controller(s) that enables the controller to operate as BE_SAN controller when servicing requests from other (n)FE_SAN controllers and to act as nFE_SAN or FE_SAN controller when initiating requests and sending data to other BE_SAN or nBE_SAN controller(s). The hardware design and new firmware features of nBE_SAN controller enable dynamic reconfiguration of the data access (storage) system and creation of multi-level (n)BE_SAN controller configurations.
A number of different storage interfaces may be supported at the back-end of a nBE_SAN controller, including legacy device standards such as SCSI, SAS, Advanced Technology Attachment (ATA), Serial ATA (SATA), FC, and emerging new standards such as NVMe, SCSIe, and other similar disk storage as well as PCI Express (PCIe), QuickPath Interconnect (QPI), HyperTransport (HT), CAPI, memory channel interfaces, etc.
Each (n)BE_SAN controller has firmware (software) that supports legacy features such as SCSI device discovery, fault management, RAID functionality, remote direct memory access (RDMA) capability, and error handling functionality as well as new features such as erasure coding, workload monitoring and when necessary workload sharing with other (n)BE_SAN controllers. The firmware on each (n)BE_SAN controller provides all necessary functionality to support legacy target disk storage interfaces such as SCSI, SAS, ATA, SATA, or FC disk drives, as well as PCIe or CAPI directly attached flash storage supporting emerging standards such as NVMe and SCSIe. In addition, (n)BE_SAN controller resources may be partitioned and virtualized to facilitate guest OSs and applications enabling user code to be executed on the (n)BE_SAN controllers.
In an embodiment, both (n)FE_SAN and (n)BE_SAN have firmware that provide Application Programming Interface (API) and Application Binary Interface (ABI) to allow host operating system (OS), guest OSs, and applications to memory-map a file or part of a file to directly access file and/or block data bypassing OS stack and SCSI/storage layers. Because of its hardware architecture and new firmware features (see
In an embodiment, (n)FE_SAN controllers are host dedicated resources while (n)BE_SAN controllers are shared resources. As with every shared resource, it is possible to encounter a condition in which the workload sent to a shared resource exceeds the capacity of that resource which may cause the resource to be overwhelmed and driven into saturation over a prolonged period of time. It is probable and expected that different parts of a storage system will be utilized at different levels and that the utilization will vary over time. Thus, if a particular I/O workload disproportionately targets a specific (n)BE_SAN controller or set of controllers exceeding the controller(s) performance limits, the overload condition can be alleviated by caching selected LBAs on other (n)BE_SAN controller(s) that are not experiencing an overload condition at the same time. Thus, when Distributed Resource Manager (DRM) detects excessive workload on a BE_SAN/nBE_SAN controller(s) (overload condition), it coordinates with other DRMs to redistribute (rebalance) the workload across one or more of additional (n)BE_SAN controllers by inserting them in the I/O path in front of the overloaded (n)BE_SAN controller. Using the same method, one or more of spare (n)BE_SAN controllers may be dynamically added to the (n)BE_SAN controller cluster configuration to further boost I/O processing and bandwidth performance in order to alleviate (n)BE_SAN controller(s) overload condition. Additionally, if a LUN is attached to two or more (n)BE_SAN controllers, DRM may find that the system is better utilized if the LUN or part of that LUN from the (n)BE_SAN that experiences overload condition is transferred to another (n)BE_SAN controller that has direct access to the LUN.
Furthermore, as (n)BE_SAN controller resources may be partitioned and virtualized to support guest OSs and applications, FE_SAN controller functionality may be implemented in one or more software modules in the firmware to be executed in a physical or logical partition. Yet, (n)FE_SAN chipsets may also be embedded within nBE_SAN controller for additional speed, reliability, and cost savings.
Applications running on the hosts 780 and 782 (
In an example
In this example, if all three nBE_SAN/nBE_SAN 912, 932, and 950 controllers have the same performance, assuming that initially the BE_SAN 932 and nBE_SAN 950 controllers were idle, by utilizing the described technique for the previously depicted scenario the peak aggregate performance for that specific workload could be increased as much as 300%. It is apparent that if sufficient hardware resources are provided, that significant performance gain may be attained during busy periods and the performance gain is limited only by the available hardware resources. Obviously, the long duration average I/O bandwidth (data rate) directed to the BE_SAN controller 912 has to be lower than the controller bandwidth and attached back-end storage devices aggregate bandwidth.
In the previously described example, all the configuration changes to LBA caching occur transparently in respect to the host computers 980 and 982. In addition, BE_SAN 912 controller views BE_SAN 932 and nBE_SAN 950 controllers as FE_SAN (nFE_SAN) controllers when receiving requests directed to LBAs-2 and LBAs-3 segments. Furthermore, as described in the paragraphs [0041 through 0045] of the U.S. Pat. No. 9,118,698, (PRIOR ART) the utilized locking algorithm is the same between BE_SAN 932 and 912 (or BE_SAN 950 and nBE_SAN 912) as, for instance, between FE_SAN 972 and BE_SAN 912. Thus, with this new firmware features, (n)BE_SAN 912, 932, and 950 controllers may play (n)BE_SAN controller role when servicing (n)FE_SAN 972 and 974 controllers'"'"' requests and to assume (n)FE_SAN controller'"'"'s role when caching LBAs (data) from another (n)BE_SAN controller.
The system allows to add/deploy as many (n)BE_SAN controllers 950, 951, 952, 953, etc. as needed to satisfy quality of service during busy periods. Typically, the additional nBE_SAN controllers may be the same or similar as nBE_SAN 950. nBE_SAN 950 may make use of one or more SSDs, PCIe Flash Memory cards, or additional NVRAM or other persistent memory devices to mitigate cache-full condition during extended periods of heavy writing. In addition, it may cache frequently read (accessed) LBAs to boost read performance. Furthermore, in Network Attached Storage (NAS) systems or file systems using file servers, the same concept may be used to alleviate a file server overload condition; thus, this technique may be universally applied to broad range of applications.
In another example,
The original FE_SAN design
(nFE_SAN 300)=(FE_SAN 301)+(nFE_SAN−S 350)
The nFE_SAN-S 350 sub-controller is operationally, functionally, and electrically independent of the FE_SAN 301 controller.
To keep the cost down, the performance CPU and memory capacity requirements of the nFE_SAN-S 350 controller may be significantly lower compared FE_SAN 301 controller, while the memory bandwidth requirement cannot be lower than the host side bus (channel) bandwidth. Both controllers 301 & 350 are physically built into a single PCI, PCIe, or other standard card format.
Traditionally, write-back-cache-with-mirroring feature requires to maintain at least two copies of the write data until the data is written to the back-end HDDs, SSDs, or other permanent storage media. In that respect, this implementation does not have single point of failure while the data coherency is maintained via locking mechanism which maintains the information where each piece of data may be found and retrieved from.
With the original design,
As depicted on
(nBE_SAN 370)=(BE_SAN371+SSD 374)+(n)FE_SAN
The additional features of the nBE_SAN controller 370 enable the Data Access System to dynamically change its topology to adapt to the changing workload requirements. Thus, a (n)BE_SAN controller can be automatically and dynamically inserted in front of another (n)BE_SAN controller to boost its I/O processing capability by caching certain LBAs from the (n)BE_SAN controller. Thus, it behaves as BE_SAN 371 controller when receiving commands from (n)FE_SAN controller(s) and as nFE_SAN 300 when forwarding the data to a BE_SAN controller. nBE_SAN controllers also have new firmware with Distributed Resource Manager (DRM) and virtualization capability to run guest OS and user applications on the nBE_SAN controller.
Each (n)BE_SAN controller may incorporate a multi-core CPU server motherboard with sufficient memory and additional components such as battery backup, NVRAM and NVMe SSDs, etc. Thus, each (n)BE_SAN controller is capable of running VMware or any other virtualization software to allow user application to run on the (n)BE_SAN controllers. Therefore, (n)BE_SAN controller provides capability to locally run applications or execute user code close to the dataset that it needs to process.
As shown on
In alternative embodiments, instead of automatically offloading or splitting individual LBAs, heavily accessed data or files may be relocated from storage on one BE_SAN to a less-heavily loaded BE_SAN that holds part of the same logical drive. As disclosed in the original U.S. Pat. Nos. 8,347,010, 8,725,906, 9,118,698, and 9,361,038 entitled “Scalable data storage architecture and methods of eliminating I/O traffic bottlenecks,” the contents of which are incorporated herein by reference for disclosure purposes, logical and physical drive (volume) location may be dissociated.
FE_SAN 1808 has a first, primary, power system 1810, a first cache memory 1812, and a first network interface 1814. FE_SAN 1808 also has a secondary power system 1816, a secondary cache 1818, and in some embodiments a second network interface 1820. FE_SAN operates under control of FE_SAN firmware 1822 executing on a primary processor 1824 and FE_SAN-S firmware 1823 executing on a secondary processor 1826, primary processor powered by power system 1810 and secondary processor 1826 powered by second power system 1816. Primary network interface 1814 couples through a first storage area network interconnect 1830 to a primary network interface 1832 of a BE_SAN 1834, and second network interface 1820 couples through a second storage area network interface 1828 to a secondary network interface 1836 of BE_SAN 1834. BE_SAN 1834 has a primary cache 1838 powered by a primary power supply 1840, and a second cache 1842 powered by a second power supply 1844. BE_SAN 1834 has a primary processor 1846 and a secondary processor 1848 operating under control of BE_SAN firmware 1850 and 1851 respectively. BE_SAN 1834 also has two or more disk drive interfaces 1852, 1854, each coupled to one or more storage drives 1856, 1858, 1860, 1862, 1864, 1866. BE_SAN firmware 1850 is configured to operate one or more virtual drives on storage drives 1856-1866 using a Redundant Array of Independent Drives (RAID) protocol permitting reconstruction of data on the virtual drives should a failure occur of any one of storage drives 1856, 1858, 1860, 1862, 1864, 1866. Also, Erasure Coding (EC) or other method of data protection may be implemented on the system depicted in
Second cache 1818, secondary power system 1816, second network interface 1820, and secondary processor 1826 together form an FE_SAN S as herein described.
Once write command 1704 is passed to FE_SAN 1808, FE_SAN 1808 passes a write lock request 1706 to BE_SAN 1834. FE_SAN 1808 also begins processing the write request by instructing 1708 driver 1805, including allocating 1709 and setting up any memory mapped buffers, to begin transferring 1710 data into both the primary and secondary cache 1812 and 1818 using “put (data)” operations. In embodiments where data is written directly to primary cache 1812, that data is copied (CoW) 1711 to secondary cache 1818.
Once the BE_SAN 1834 has locked the destination blocks of the RAID virtual drives maintained on storage devices 1856-1866 to prevent intervening writes from other FE_SANs (not shown), and has allocated buffer space in both its primary and secondary caches 1840, 1838, BE_SAN 1834 responds to FE_SAN 1808 with a lock-request-granted signal 1712.
Once lock-request-granted signal 1712 is received by FE_SAN 1808 and copies 1711 to secondary cache are complete, FE_SAN 1808 provides a write complete “ack” signal 1714 to driver 1805, which passes a command complete signal 1716 to the executing application 1804. Once lock request granted signal 1712 is received by FE_SAN 1808, FE_SAN 1808 begins transferring 1718 the data to BE_SAN 1834 where this data is mirrored in cache memories 1834, 1840.
Write data at this point may reside in duplicate in FE_SAN caches 1812, 1818; while data transfers 1718 continue. When all data is transferred to the BE_SAN, the BE_SAN sends a final acknowledgement signal 1720 to the FE_SAN, which may then release one or both copies in FE_SAN cache 1812, 1818.
In the event transfer is interrupted by failure of the FE_SAN primary cache 1812, primary power supply 1810, primary network interface 1814, or primary processor 1824, after write complete 1714 has been sent to driver 1805, but before all data has been transferred to the BE_SAN, the secondary processor 1826 continues to transfer data from the secondary cache 1818 to the BE_SAN to ensure completion of the write operation. In the event BE_SAN primary processor 1846, primary cache 1838, or primary power 1840 fail, BE_SAN secondary processor 1848 completes receiving the data over BE_SAN secondary network 1836 into secondary cache 1840 and completes writing data to storage devices 1856-1866. In an alternative embodiment, instead of completing the write, the data is retained in battery-backup secondary cache 1840 and alarms sounded; data writing is completed upon repair of the BE_SAN.
As can be seen from the diagram, data transfer between host and nFE_SAN controller is decoupled from the data transfer between nFE_SAN and (n)BE_SAN controller(s). Furthermore, to complete host write I/O transaction it is not necessary to copy the data between (n)FE_SAN and (n)BE_SAN controllers because two independent copies of the data are maintained on the nFE_SAN controller. Therefore, it is sufficient to obtain only write lock(s) from (n)BE_SAN controller(s) and complete data transfer from the host before a “COMMAND COMPLETE” message can be sent back to the host. Once the data is mirrored between nFE_SAN and (n)BE_SAN controller(s) the memory buffers containing the second copy of the data are released. Apparently, this technique facilitates additional performance improvements for small I/O writes because data transfer between nFE_SAN and (n)BE_SAN controllers can be done asynchronously.
The foregoing description should not be taken as limiting. It should thus be noted that the matter contained in the above description or shown in the accompanying drawings should be interpreted as illustrative and not in a limiting sense. The following claims are intended to cover generic and specific features described herein, as well as all statements of the scope of the present method and system.