Why/where are we stuck with OSS virtualization storage?
In the datacenter of medium companies you have a few well-established standards:
The year is 2024, and Broadcom just finished the aquisition of VMware. Every CIO is now scratching its head and asking ?now what?!?” with regards to virtualization. vSphere was:
I occasionally collaborate with a medium sized company that has 10k virtual machines in production and about 30k in other environments. The data is about 2PiB and stored on NVMe arrays. The said company buys the hardware in operational leasing for 4 years and this arrangement worked briliantly. They always have new hardware with all the advantages (their power envelope stayed basically constant as their effective computing needs grew by a factor of 5×).
Now, in 2024 vSphere is suddenly not an option anymore because it’s 3× the cost. So I started looking into OpenStack again and to my surprise, Broadcom and the storage vendors completely messed it up.
OpenStack has a component named Cinder that is used for provisioning the storage backends for VMs (volume creation, migration, resizing, attachment, detachment, snapshotting, cloning, etc.). Basically everything that you do on a datastore in VMware.
This daemon and library allows vendors to implement said functionality on their storages. It also includes a zone manager for Fibre Channel networks, in order for it to play nicely with the storage vendor provided drivers. As a side note, Kubernetes has an equivalent technology called CSI and with regards to block storage, most notes also apply to CSI.
In a virtualization environment where you have a few hundred servers, tens of SAN attached storage systems and tens of thousands of virtual machines, you cannot depend on a human to manually create, resize, snapshot, destroy LUNs and attach them to virtual machines.
A long time ago, all storage systems implemented an universal API called Storage Management Initiative – Specification and this CIM based API allowed all these operations on a storage in a vendor neutral way. This was used with Hyper-V, DellEMC storages, but the open source community prefered not to use it with Cinder/CSI as it was a very thick specification. Some time ago I wrote a Python prometheus exporter for SMI-S storages and I nearly lost it trying to get it working. It was tested against IBM StorWize, HPe 3Par, DellEMC VMAX/PowerMax, DellEMC VNX2, DellEMC PowerStore and PureStorage FlashArray storages and it showed that the SMI-S implementation for all vendors was lacking. IBM didn’t put the timestamps for the metrics, doesn't populate the SampleInterval, doesn't map NPIV to FCPorts and doesn't allow random time granularity, PureStorage was (and still is) slow and very basic, VNX2 didn’t give block storage performance at all, etc. Unexpectedly, HPe/3Par actually offered the best implementation of Block storage statistics since it allowed a lot of the subsets, but it didn't support CIM Associations at all in order to discover the Statistic to target (LUN/Port/Host, etc.) relationship. Except for HPe/3Par, nobody allowed the definition of a custom subset of statistics and the bulk transfer of the statistics without crawling the entire CIM tree (which can be very slow for a lot of LUNs and kills the storage CPUs), nobody allowed CSV transfers, etc. And with all implementations, basic SFP statistics (RX/TX levels, CRC errors, etc.) were completely ommited, eventhough we all need them. So it was clear why the open source community avoided it. It worked in narrow parameters, as used by Microsoft and that was it.
Now, that it’s 2024, I decided to look into OpenStack again and I saw that it still doesn’t fit modern needs (NVMe for modern VM OSs) for quite a lot of reasons.
Cinder to Storage feature parity
First you have the lack of feature parity between the storages and the Cinder drivers. DellEMC Powerstore supports NVMe on TCP and FibreChannel, but in the Cinder drivers only TCP is now supported in the latest release, so iSCSI style architectures that force storage traffic to go through your L3 router, are still a thing in 2024, where latency is incredibly sensitive.
PureStorage supports NVMe on TCP, FibreChannel and ROCEv2, but FibreChannel is absent in the Cinder support.
IBM Supports only SCSI/FCP and iSCSI, so they are completely absent from the NVMe stage.
When doing an RFP you need to be very careful. Even if NVMe/FC is supported by the vendor, it’s not available in their Cinder or CSI driver.
Cinder Driver Implementations
Secondly, you have the bad design of the Cinder driver. The IBM Cinder driver, instead of using native storage APIs (their own Rest API), still uses an SSH client to issue commands. This is not transactional and you might get a conflict with other commands issued to the storage by human admins. Furthermore, this is limited to 4 parallel sessions at once. So if you’re deploying an entire dev stack of 300 VMs with 1000 virtual disks (clones or new LUNs), you will run into troubles. It is just bad and very lazy design.
领英推荐
The Storage Networking Industry Association has published the Swordfish standard, similar to the Redfish standard that replaces IPMI and WBEM. It’s a decend REST API that should be vendor neutral and replace SMI-S. Redfish was adopted by all the major server vendors, so I expected all the problems to go away as soon as Swordfish was adopted by all the Storage vendors. The only problem: nobody did that! PureStorage made some demos a long time ago but abandoned the topic. DellEMC made no attempt except for their toy ME5 line. IBM made no public attempt. So we are stuck with bad Cinder drivers.
SR-IOV
All these issues could be solved if OpenStack and libvirt allowed the FC adapters to be virtualized using SR-IOV. Each guest VM could have it’s own tiny virtual function HBA that supports NVMe and/or SCSI and talks directly to the storage. This is incredibly efficient from the storage perspective and allows storage statistics and QoS rules to be directly coupled with VMs. We have NPIV that can be used for zoning and should work with migrations, and the NVMe Target Login is based on NQNs anyway, so the WWPN/WWNNs are irrelevant to the storage. Neutron, the OpenStack networking component has a lot of supporting infrastructure for doing this for Ethernet adapters, but when it comes to storage, there’s no love.
Emulex (another Broadcom company) and their main competition (QLogic), made no attempts to create the supporting infrastructure in OpenStack Cinder. Furthermore, the number of virtual functions is too small. While SR-IOV allows 64000 VFs per PCIe adapter, Emulex is limited to 16VFs/port. 128 would be the minimal number that we should have, since we’re at more than 100 VMs/host in 2024.
Doing this would have a lot of benefits: the Block I/O wouldn’t go through the hypervisor VM. Imagine 100 VMs, with 300 LUNs, from 3 storages (3 tiers), with 4 paths on each of the two fabrics: you get 2400 LUNs that need to be managed by the multipath daemon. In my experience, the slightest perturbation of any fabric will kill the hypervisor multipath daemon.
Newer kernels simplify this on NVMe by moving the multipath from userland to kernel-land, so you only get 300 LUNs in the hypervisor and 300 in the VMs, but it’s still not an optimal aproach since you effectively get the processing doubled. With SR-IOV: it’s only the VMs that see the LUNs and you only process the I/O in the VM.
RDMA vs FC
Another missed oportunity for Fibre Channel is the lack of RDMA support. Sure, NVMe is supported, but supporting RDMA instead of NVMe would have been a better idea.
You could have a lot of backend traffic over FC: Ceph, Gluster, IP (at least NTP and a few other services), etc. I could even imagine real time broadcast protocols such as Dante Audio working over RDMA, and anyone that has done broadcast over ethernet knows what a pain it can be once the fabric increases. Furthermore, GPU processing in some scenarios accesses storage directly for massive performance enhancements, and this oportunity is missed as well. FC could have offered to the masses everything that Infiniband offered the elite more than a decade ago. They chose to offer just NVMe, and not the memory transport, just the message transport. As such, the GPU cannot use it directly.
Since Fibre Channel is dictated fundamentally by the Broadcom twins: Brocade/Emulex, if they don’t support it, it doesn’t exist.
Preliminary Conclusions
Everywhere opensource virtualization is lacking, I see Broadcom. And storage is the missing key (plus, in VMware parlance: DRS, Storage DRS, DPM).
I hoped that Red Hat OpenStack would help, but it has zero added value compared to upstream OpenStack and is considerably more expensive than the new VMware subscriptions (if that is even possible).
How to fix storage for open source large scale virtualization
First, recognize that SCSI should die. We love it for the Scanners, Tape Drives, Block Storage, Robotic hands and many others, but it’s time to let it retire in glory and let NVMe take over.
Once all these are implemented, I will completely agree with Juan Tarrío in his analysis on the death of Fibre Channel.
Presales Staff Solutions Engineer bei Splunk
9 个月They had 10 years and have not replaced vmware en-mass (which I remember being the big promise! Meanwhile, cloud came around so ...it's all history IMHO
Intriguing insights—adoption of new technologies often hinges on a blend of cost-effectiveness, performance, and ease of integration, so it'll be interesting to see how OpenStack/KVM evolves to meet these challenges in the enterprise space.