Leverage AIX lsmpio to discover some details about issues on SAN networks
Working in a customer environment on AIX 7.2 (7.2 TL5 SP3) systems I came across one quite interesting problem on the SAN network. Having a running AIX system which had disks published from one (V7K-1) san storage cluster in a redundant way via NPIV (dual vio, 2 fabrics etc.) - did not show any problem.
At the moment of mapping and publishing another san LUN from another (V7K-2) san storage cluster (note: different storage ports) to the same AIX client system and performing a standard discovery procedure ( cfgmgr ) the system reached a state where it reported read-only status for rootvg and paths to old and new disks have got into degraded / failed state (output from lsmpio "Deg,Fai" and lsmpio -ar).
-a lists parent Fibre channel adapter information
-r flag adds information about remote ports
At the moment of unmapping this new san LUN the system has recovered and the previous functional state was automatically restored. The key point was the revelation that the problem occurred only when publishing disks from 2 different V7K storage clusters. In the case of publishing disks from one storage cluster, the problem did not appear.
Originally, the virtual vFC ports of the client AIX partition (dyntrk=yes, fc_err_recov=fast_fail) mapped via NPIV to individual V7K-1 storage ports were as follows:
LPAR
V7K-1
After expanding the disk configuration of the AIX client partition (SAN zoning was done automatically and correctly by PowerVC automation), the following connections were added to the V7K-2 storage ports:
V7K-2
After the next standard discovery ( cfgmgr ) on the client's AIX partition the system entered the read-only state ("read permission only"). This system error reported corresponds to the definition of the error message and value in the errno.h system header file. The only question was why?
LPAR
领英推荐
You can note on another picture below the adapter WWPN is 0, paths are failed and san IDs are N/A for each port.
LPAR
Discussing the previous with IBM Support came to the following conclusion:
Decoding errlog file, we see The VFC4_ERR15 with VFC_ERR_LOC_248 indicating the VIO servers forwarded a link down event due to a SCN received either for Fabric or Domain, however we don't get any related issue on the VIOSes.
It looks something happened on the Storage leading to this SCN, as there were no real link down issue on the physical ports, and the VIOS did not report any link down.
The entries match a known issue when using NPIV, described with HIPER APAR IJ31604 and APAR IJ32895 which currently missing on host.
Summary
Combo fix IJ32895m2a ( devices.vdevice.IBM.vfc-client.rte ) for APAR IJ32895 (DOMAIN RSCN CAUSE IO PATH FAILURE) and APAR IJ31604 (FABRIC FORMATTED RSCN CAUSE IO PATH FAILURE) is issued for this concrete and specific problem on the SAN network. Applying the combo fix is via reboot.
References:
# IJ31604: FABRIC FORMATTED RSCN MAY CAUSE IO PATH FAILURE APPLIES TO AIX 7200-05, 06 December 2022
https://www.ibm.com/support/pages/apar/IJ31604
# IJ32895: DOMAIN RSCN MAY CAUSE IO PATH FAILURE WHICH NEVER RECOVERS
https://www.ibm.com/support/pages/apar/IJ32895