Getting in touch with vPMEM volumes (virtual Persistent Memory volumes)
It has been some time (10/2019) since IBM introduced the possibility of using the vPMEM (virtual Persistent Memory) feature on Power9 systems from a certain version of the hypervisor firmware (FW930) level.
What is vPMEM?
vPMEM is the way in which the PowerVM hypervisor can (with the help of HMC, V9R1 M940 or later) create and address persistent memory volumes directly from the physically installed DRAM of the given frame (CEC) and present it to a specific LPAR whereby each such a volume appears in the operating system as a separate non-volatile memory device /disk.
Creating vPMEM has its possibilities, limitations (max. 10TB in size, cannot change the sizes of created volumes, creation only when the LPAR is stopped), rules (max. 4 pieces of volumes per lpar = max_num_dram_volumes), disadvantages (non-persistence across CEC restarts, non-zero capacity overhead / metadata) but also important advantages (persistence across LPAR reboots, I/O performance of x GB/s).
What is this good for?
Unlike the readily available dynamic RAM memory (DRAM) the vPMEM device retains its contents even during a logical partition restart (hence persistence). But this is no longer true if it would be necessary to do an IPL with the entire CEC (which is usually a very rare and not so frequent excercise). The most significant benefit of a device of this type is the speed of data access because it is defacto access to data directly in the physical DRAM memory (via a device that looks like a disk in the operating system).
New versions of AIX 7.3 and Linux operating systems for ppc64le such as RHEL (9.2 and later) or (SUSE 15.1 and later) can benefit from such a possibility of vPMEM availability in the operating system as well / in the same way. In systems with AIX 7.3 this vPMEM device can be used as a traditional hard disk for direct access to data, or as a disk with LVM/jfs2, paging device or flashcache. It cannot be used as a boot disk or as a dump device. In operating systems with Linux on Power (RHEL or SUSE - for ppc64le) partitions on Power9 and Power10 systems, this device makes sense for use in combination with SAP HANA which is an in-memory column database that can use the SAP implemented feature called Fast Restart (in case of planned maintenance tasks) just right with the use of a vPMEM type device for great performance (leveraging data persistence across LPAR reboots). SAP HANA than accesses persistent memory via memory mapped files from filesystems.
Performance
As already mentioned the main advantage of using vPMEM is the speed of access to gigabytes or even terabytes of data in combination with availability (persistence) immediately just right after restarting the partition (LPAR). In following picture below there is a output of nstress tool (ndisk) running on RHEL system over a bigfile placed in filesystem on vPMEM device showing that 16 simultaneous processes using random read(75):write(25) ratio with 1MB block in size are able to get over 10GB/sec with over 10000 IOPS . That's pretty much for 1MB block size. It's not common standard in compare with traditional SAS or even NVMe disks. The performance benefit is significant.
Implementation details
HMC
The right Hardware Management Console (HMC ) version and the Power9/10 frame (CEC) firmware (FW) are the key points to maintain vPMEM . Console GUI and CLI managements enables assign a vPMEM to specific LPAR which needs to be in shutdown state.
From HMC command line we can get a short overview of what LPARs have which and how many vPMEM (dram) devices already assigned via lshwres command:
As it was mentioned earlier each vPMEM volume appears in the operating system as a separate non-volatile memory device / disk. The AIX systems handle vPMEM devices like traditional hdisks (the one small difference is that there is only one path to each such disk device) while Linux systems need some system utilities to be present and installed on the system to be able to manage those vPMEM volumes. Some details are described below.
AIX (7.3)
After presenting 4 vPMEM volumes from HMC to standard AIX 7.3 system we can observe that system boots with several new disk devices presented as traditional hdisk with description Virtual Persistent Memory Disk. Digging down into system configuration details we can see that there is another one system device nvmem0 with description Special Use System Memory and exaclty the 4 scm devices with description Storage Class Memory:
Configuration of nvmem and scm devices in ODM is by default as it follows:
We are allowed and can create standard LVM setup with VGs and LVs with JSF2 (not JFS) filesystems using this vPMEM devices, however system notify and WARN us that we are dealing with different and specific kind of devices which might lost it's data once after CEC (frame) will be rebooted (see pictures below):
crfs -v jfs2 -d'lvsap01' -m'/usr/sap' -A'yes' -p'rw' -a options='cio,dio,noatime,log=NULL' -a agblksize='4096' -a logname='INLINE' -a isnapshot='no' -a lff='yes'
NOTE:
crfs: File system block size is not allowed to be smaller than lv block size.
LVM Setup
Linux (RHEL 9.2)
After presenting some vPMEM volumes from HMC to ppc64le linux systems (rhel 9.2) and having installed system utilities (ndctl and optionally numactl) we can observe that system boots with several new block devices presented as follows:
NDCTL
ndctl - Manage "libnvdimm" subsystem devices (Non-volatile Memory)
ndctl is utility for managing the "libnvdimm" kernel subsystem.
After installing necessary linux system utilities the management of vPMEM devices is performed as follows:
And after initializing persistent memory blockdev devices (init-labels, enable-region, create-namespace) the partial output could be as follows:
When a new non-volatile memory (pmemory) blockdev devices are wide available in the linux system we can perform the similar tasks for filesystem creation and mount the filesystem with some specific options like e.g. DAX - for direct access -as the dax code for block devices that are memory-like (what is a case of vPMEM) removes the extra copy pages (no caching pages) by performing reads and writes directly to the storage device. The linux filesystems that support DAX are ext2, ext4 and xfs. We use XFS for now as it is high-performance journaling datastream optimized 64-bit filesystem for parallel access to larger files.
Finally, xfs filesystems over new blockdev (vPMEM) pmemory devices are created and the output of mount can looks like:
Using standard NMON tool for LoP it shows new high-performance XFS filesystems (pmem0,1) as follows:
Conclusion
The presented topic above with regards of leveraging vPMEM features of IBM Power Systems hardware could be taken into account and should be considered whenewer performance is in the spotlight. This feature brings another interesting option on the table.
Persistent Memory also known as Non-Volatile Memory (NVM) is not only topic of IBM Hardware but very similar concepts are known and available for VMware vSphere 7.0 using ESXi hypervisor to significantly reduce storage latencies for VMs via creating PMem reservations for VMs .
For Openstack implementations (Attaching virtual persistent memory to guests) - there is a note that Starting in the 20.0.0 (Train) release, the virtual persistent memory (vPMEM) feature in Nova allows a deployment using the libvirt compute driver to provide vPMEMs for instances using physical persistent memory (PMEM) that can provide virtual devices.
Especialista en Sistemas en Banco Mercantil | IBM/EMC/PureStorage SAN switch skills, AIX and RHEL or SUSE Linux installation
8 个月What's the difference between using those vpmem or nvme devices in a environment like SAP Hana?
Client Technical Specialist for Power | Master's in Economy
9 个月Ivan, nice work!
Obchod Konica Minolta, Videoanalytické kamerové rie?enia??25+ rokov v security rie?eniach??Predávam so ??????% spokojnych zákazníkov??Som Jardo zo v?adia?
9 个月Ivan great article??????