Elegant way to fix the state of IBM Power host from the OpenStack level (PowerVC)
Last time we've have noticed interesting issue in customer environment (occurred after testing of power supply branches in datacenter) and I would like to share some information on how the discovered issue - most probably rare and related to this event - could be fixed.
PowerVC
Those who are running workloads on IBM Power systems managed by PowerVC (1.x or 2.x) perhaps know the situation when managed frame is running into reported state showing an Unknown (or Error, in PowerVC GUI) or down state in CLI in spite of that all workload on frame continues running without any issues, without any impact to tenants.
Openstack
On the picture above the title of this article there is an IBM hosts table output shown from openstack (nova) command which shows host ending with S/N ..66EW in state down although status of the host is enabled. This region is running 8 frames and 1 was reported as down. As I mentioned earlier, nothing has gone wrong with frame itself and all workload was running just fine. Only the state of the host was incorrect.
The issue itself can only be within OpenStack representation of the host in uncertain state due to missing primary HMC connection for this particular managed host on PowerVC.
In this case using other openstack commands for compute like nova hypervisor-list <host-id> and nova hypervisor-show <host-id> we can get some more details about health status of the host and from openstack nova tables as well as more details about host itself
In this case the missing primary HMC connection is represented by hmc_uuid = 'c5eecf48-9699-39b4-8b18-36bad376b7a5' .
From openstack table NOVA.compute_node_health_status we can reveal error message with some reason and description which can be related to host but tricky to understand:
I'm not going here into deeper details how to work with openstack (on PowerVC CLI) and openstack databases (earlier db2, later on mysql/MariaDB) consisting several tables for basic openstack services (like nova, glance, cinder, neutron, keystone, swift, ceilometer.. etc. ) and it's values and attributes.
MySQL
There is a pretty and useful way how to use python/python3 modules of particular PowerVC version to make connections into it's openstack databases with correct credentials. See shortened output below e.g.:
领英推荐
In the openstack table NOVA.ibm_hmc_hosts for a given host (with S/N ..66EW), there are 2 records for this host that have the deleted_at attribute set as NULL (which is always correct for all active hosts) AND (but) do not currently have a primary HMC console assigned to them of is_primary_hmc (=0, what is not proper).
Using the SQL statements above we will receive those 2 records causing incertain state (Error, Unknown, down) as because usually only one such record is common (and allowed) for each particular active host/frame.
According to what is defined in the table NOVA.ibm_hmc_hosts for other active IBM hosts that are in the OK state - only one such record should be defined there, let's see and verify that:
Showing us what was suggested: is_primary_hmc=1 for all other hosts .
Fixing: Alhought it might be probably possible to delete one of records with id=71 or id=82 (for host with S/N ..66EW) and set the is_primary_hmc value to 1 for the remaining record, its probably better to try not to delete anything from openstack tables just only set and modify (update) proper attribute is_primary_hmc=1 - only for the latest record with id=82 .
Conclusion: That key point was to update appropriate attribute in one of nova tables of openstack database and that was enough for fixing improper state reported by PowerVC GUI. After several minutes later the scheduler showed up the changed state and host went into state OK (all green) and state from PowerVC CLI also showed host is 'up' (OK).