Why/where are we stuck with OSS virtualization storage?

Why/where are we stuck with OSS virtualization storage?

In the datacenter of medium companies you have a few well-established standards:

  • 19” Rack
  • x86_64 Servers (but AARCH64 is looking quite interesting)
  • Cisco Ethernet (Broadcom chips)
  • Brocade Fibre Channel Switches (Broadcom)
  • Emulex HBAs (Broadcom chipsets).
  • vSphere Virtualization
  • FC Attached Storages

The year is 2024, and Broadcom just finished the aquisition of VMware. Every CIO is now scratching its head and asking ?now what?!?” with regards to virtualization. vSphere was:

  • easy to use
  • scalable
  • supported by all hardware vendors as a primary target
  • universally supported by all enterprise storage vendors

I occasionally collaborate with a medium sized company that has 10k virtual machines in production and about 30k in other environments. The data is about 2PiB and stored on NVMe arrays. The said company buys the hardware in operational leasing for 4 years and this arrangement worked briliantly. They always have new hardware with all the advantages (their power envelope stayed basically constant as their effective computing needs grew by a factor of 5×).

Now, in 2024 vSphere is suddenly not an option anymore because it’s 3× the cost. So I started looking into OpenStack again and to my surprise, Broadcom and the storage vendors completely messed it up.

OpenStack has a component named Cinder that is used for provisioning the storage backends for VMs (volume creation, migration, resizing, attachment, detachment, snapshotting, cloning, etc.). Basically everything that you do on a datastore in VMware.

This daemon and library allows vendors to implement said functionality on their storages. It also includes a zone manager for Fibre Channel networks, in order for it to play nicely with the storage vendor provided drivers. As a side note, Kubernetes has an equivalent technology called CSI and with regards to block storage, most notes also apply to CSI.

In a virtualization environment where you have a few hundred servers, tens of SAN attached storage systems and tens of thousands of virtual machines, you cannot depend on a human to manually create, resize, snapshot, destroy LUNs and attach them to virtual machines.

A long time ago, all storage systems implemented an universal API called Storage Management Initiative – Specification and this CIM based API allowed all these operations on a storage in a vendor neutral way. This was used with Hyper-V, DellEMC storages, but the open source community prefered not to use it with Cinder/CSI as it was a very thick specification. Some time ago I wrote a Python prometheus exporter for SMI-S storages and I nearly lost it trying to get it working. It was tested against IBM StorWize, HPe 3Par, DellEMC VMAX/PowerMax, DellEMC VNX2, DellEMC PowerStore and PureStorage FlashArray storages and it showed that the SMI-S implementation for all vendors was lacking. IBM didn’t put the timestamps for the metrics, doesn't populate the SampleInterval, doesn't map NPIV to FCPorts and doesn't allow random time granularity, PureStorage was (and still is) slow and very basic, VNX2 didn’t give block storage performance at all, etc. Unexpectedly, HPe/3Par actually offered the best implementation of Block storage statistics since it allowed a lot of the subsets, but it didn't support CIM Associations at all in order to discover the Statistic to target (LUN/Port/Host, etc.) relationship. Except for HPe/3Par, nobody allowed the definition of a custom subset of statistics and the bulk transfer of the statistics without crawling the entire CIM tree (which can be very slow for a lot of LUNs and kills the storage CPUs), nobody allowed CSV transfers, etc. And with all implementations, basic SFP statistics (RX/TX levels, CRC errors, etc.) were completely ommited, eventhough we all need them. So it was clear why the open source community avoided it. It worked in narrow parameters, as used by Microsoft and that was it.

Now, that it’s 2024, I decided to look into OpenStack again and I saw that it still doesn’t fit modern needs (NVMe for modern VM OSs) for quite a lot of reasons.

Cinder to Storage feature parity

First you have the lack of feature parity between the storages and the Cinder drivers. DellEMC Powerstore supports NVMe on TCP and FibreChannel, but in the Cinder drivers only TCP is now supported in the latest release, so iSCSI style architectures that force storage traffic to go through your L3 router, are still a thing in 2024, where latency is incredibly sensitive.

PureStorage supports NVMe on TCP, FibreChannel and ROCEv2, but FibreChannel is absent in the Cinder support.

IBM Supports only SCSI/FCP and iSCSI, so they are completely absent from the NVMe stage.

When doing an RFP you need to be very careful. Even if NVMe/FC is supported by the vendor, it’s not available in their Cinder or CSI driver.

Cinder Driver Implementations

Secondly, you have the bad design of the Cinder driver. The IBM Cinder driver, instead of using native storage APIs (their own Rest API), still uses an SSH client to issue commands. This is not transactional and you might get a conflict with other commands issued to the storage by human admins. Furthermore, this is limited to 4 parallel sessions at once. So if you’re deploying an entire dev stack of 300 VMs with 1000 virtual disks (clones or new LUNs), you will run into troubles. It is just bad and very lazy design.

The Storage Networking Industry Association has published the Swordfish standard, similar to the Redfish standard that replaces IPMI and WBEM. It’s a decend REST API that should be vendor neutral and replace SMI-S. Redfish was adopted by all the major server vendors, so I expected all the problems to go away as soon as Swordfish was adopted by all the Storage vendors. The only problem: nobody did that! PureStorage made some demos a long time ago but abandoned the topic. DellEMC made no attempt except for their toy ME5 line. IBM made no public attempt. So we are stuck with bad Cinder drivers.

SR-IOV

All these issues could be solved if OpenStack and libvirt allowed the FC adapters to be virtualized using SR-IOV. Each guest VM could have it’s own tiny virtual function HBA that supports NVMe and/or SCSI and talks directly to the storage. This is incredibly efficient from the storage perspective and allows storage statistics and QoS rules to be directly coupled with VMs. We have NPIV that can be used for zoning and should work with migrations, and the NVMe Target Login is based on NQNs anyway, so the WWPN/WWNNs are irrelevant to the storage. Neutron, the OpenStack networking component has a lot of supporting infrastructure for doing this for Ethernet adapters, but when it comes to storage, there’s no love.

Emulex (another Broadcom company) and their main competition (QLogic), made no attempts to create the supporting infrastructure in OpenStack Cinder. Furthermore, the number of virtual functions is too small. While SR-IOV allows 64000 VFs per PCIe adapter, Emulex is limited to 16VFs/port. 128 would be the minimal number that we should have, since we’re at more than 100 VMs/host in 2024.

Doing this would have a lot of benefits: the Block I/O wouldn’t go through the hypervisor VM. Imagine 100 VMs, with 300 LUNs, from 3 storages (3 tiers), with 4 paths on each of the two fabrics: you get 2400 LUNs that need to be managed by the multipath daemon. In my experience, the slightest perturbation of any fabric will kill the hypervisor multipath daemon.

Newer kernels simplify this on NVMe by moving the multipath from userland to kernel-land, so you only get 300 LUNs in the hypervisor and 300 in the VMs, but it’s still not an optimal aproach since you effectively get the processing doubled. With SR-IOV: it’s only the VMs that see the LUNs and you only process the I/O in the VM.

RDMA vs FC

Another missed oportunity for Fibre Channel is the lack of RDMA support. Sure, NVMe is supported, but supporting RDMA instead of NVMe would have been a better idea.

You could have a lot of backend traffic over FC: Ceph, Gluster, IP (at least NTP and a few other services), etc. I could even imagine real time broadcast protocols such as Dante Audio working over RDMA, and anyone that has done broadcast over ethernet knows what a pain it can be once the fabric increases. Furthermore, GPU processing in some scenarios accesses storage directly for massive performance enhancements, and this oportunity is missed as well. FC could have offered to the masses everything that Infiniband offered the elite more than a decade ago. They chose to offer just NVMe, and not the memory transport, just the message transport. As such, the GPU cannot use it directly.

Since Fibre Channel is dictated fundamentally by the Broadcom twins: Brocade/Emulex, if they don’t support it, it doesn’t exist.

Preliminary Conclusions

Everywhere opensource virtualization is lacking, I see Broadcom. And storage is the missing key (plus, in VMware parlance: DRS, Storage DRS, DPM).

I hoped that Red Hat OpenStack would help, but it has zero added value compared to upstream OpenStack and is considerably more expensive than the new VMware subscriptions (if that is even possible).

How to fix storage for open source large scale virtualization

First, recognize that SCSI should die. We love it for the Scanners, Tape Drives, Block Storage, Robotic hands and many others, but it’s time to let it retire in glory and let NVMe take over.

  • Give up on the crappy drivers (that means you IBM) and move to Swordfish. Nobody cares about the corner cases. Make the zoning and storage drivers use Swordfish to ensure vendor independence. Stop using SSH for anything in drivers. SSH is for interactive sessions, not for poor mans APIs using the equipment CLI. If you continue to use SSH it's going to be slow and buggy in so many ways (think of a LUN with 700 UCS-32 NFC-D characters as the name and try transfering that as a command output via SSH).
  • Allow the Cinder driver to explain to the admin what the current storage limits are (number of volumes, snapshots, etc.) since this differs based on firmware version on the storage array and storage configuration.
  • Add SR-IOV to Fibre Channel, it improves performance dramatically, and it simplifies the life of the hypervisor instance since it doesn’t have to care about Storage I/O anymore.
  • Add SR-IOV support to Cinder. It should provision the WWPNs and WWNNs to the VFs at VM startup, migration, resume and send the FC logins.
  • Just like with iSCSI, create a standardised UEFI variable for storing the VM NQN, so that the NQN can be provisioned by OpenStack and not be a random string that doesn’t match what Cinder provisioned. Resist the temptation to do that in cloud-init as it’s bad design. We have firmware with NVRAM for a reason, and we happen to standardise on UEFI, for better or worst.
  • Ceph and other distributed block storage systems should also allow for an NVMe target just like they do with iSCSI.
  • Storage Vendors should allow for a similar number of NQNs as WWPNs (IBM allows only 16 NQNs per IOGroup, PureStorage allows 64)
  • If Fibre Channel support cannot be aligned to 2024 expectations, let’s combine the performance and design of SPDK with the storage offload of other Cinder drivers. Bonus: add RDMA as an FC ULP: there are so many uses for it, especially for distributed clusters (Ceph as a 3rd tier, or VSAN for VMware Environments).

Once all these are implemented, I will completely agree with Juan Tarrío in his analysis on the death of Fibre Channel.

Endre Peterfi

Presales Staff Solutions Engineer bei Splunk

9 个月

They had 10 years and have not replaced vmware en-mass (which I remember being the big promise! Meanwhile, cloud came around so ...it's all history IMHO

回复

Intriguing insights—adoption of new technologies often hinges on a blend of cost-effectiveness, performance, and ease of integration, so it'll be interesting to see how OpenStack/KVM evolves to meet these challenges in the enterprise space.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了