Elegant way to fix the state of IBM Power host from the OpenStack level (PowerVC)

Elegant way to fix the state of IBM Power host from the OpenStack level (PowerVC)

Last time we've have noticed interesting issue in customer environment (occurred after testing of power supply branches in datacenter) and I would like to share some information on how the discovered issue - most probably rare and related to this event - could be fixed.

PowerVC

Improper state of the IBM frame in PowerVC management GUI with health status OK

Those who are running workloads on IBM Power systems managed by PowerVC (1.x or 2.x) perhaps know the situation when managed frame is running into reported state showing an Unknown (or Error, in PowerVC GUI) or down state in CLI in spite of that all workload on frame continues running without any issues, without any impact to tenants.

Getting PowerVC version and other info using in-built powervc python modules

Openstack

On the picture above the title of this article there is an IBM hosts table output shown from openstack (nova) command which shows host ending with S/N ..66EW in state down although status of the host is enabled. This region is running 8 frames and 1 was reported as down. As I mentioned earlier, nothing has gone wrong with frame itself and all workload was running just fine. Only the state of the host was incorrect.

The issue itself can only be within OpenStack representation of the host in uncertain state due to missing primary HMC connection for this particular managed host on PowerVC.

In this case using other openstack commands for compute like nova hypervisor-list <host-id> and nova hypervisor-show <host-id> we can get some more details about health status of the host and from openstack nova tables as well as more details about host itself

nova
nova

In this case the missing primary HMC connection is represented by hmc_uuid = 'c5eecf48-9699-39b4-8b18-36bad376b7a5' .

From openstack table NOVA.compute_node_health_status we can reveal error message with some reason and description which can be related to host but tricky to understand:

error message in table compute_node_health_status of NOVA database

I'm not going here into deeper details how to work with openstack (on PowerVC CLI) and openstack databases (earlier db2, later on mysql/MariaDB) consisting several tables for basic openstack services (like nova, glance, cinder, neutron, keystone, swift, ceilometer.. etc. ) and it's values and attributes.

MySQL

There is a pretty and useful way how to use python/python3 modules of particular PowerVC version to make connections into it's openstack databases with correct credentials. See shortened output below e.g.:

How to see and get openstack tables for database NOVA service (using python modules)

In the openstack table NOVA.ibm_hmc_hosts for a given host (with S/N ..66EW), there are 2 records for this host that have the deleted_at attribute set as NULL (which is always correct for all active hosts) AND (but) do not currently have a primary HMC console assigned to them of is_primary_hmc (=0, what is not proper).

Statements to select NULL records for particular host from specified NOVA table

Using the SQL statements above we will receive those 2 records causing incertain state (Error, Unknown, down) as because usually only one such record is common (and allowed) for each particular active host/frame.

NULL records for particular host (duplicated, improper state)

According to what is defined in the table NOVA.ibm_hmc_hosts for other active IBM hosts that are in the OK state - only one such record should be defined there, let's see and verify that:

Statements used to select other active IBM hosts with defined primary HMC connections

Showing us what was suggested: is_primary_hmc=1 for all other hosts .

all other records for particular hosts having HMC connections defined


Fixing: Alhought it might be probably possible to delete one of records with id=71 or id=82 (for host with S/N ..66EW) and set the is_primary_hmc value to 1 for the remaining record, its probably better to try not to delete anything from openstack tables just only set and modify (update) proper attribute is_primary_hmc=1 - only for the latest record with id=82 .


SQL Statements used to update attribute is_primary_hmc for IBM hosts in incorrect state

Conclusion: That key point was to update appropriate attribute in one of nova tables of openstack database and that was enough for fixing improper state reported by PowerVC GUI. After several minutes later the scheduler showed up the changed state and host went into state OK (all green) and state from PowerVC CLI also showed host is 'up' (OK).



要查看或添加评论,请登录

Ivan Rakus的更多文章

社区洞察

其他会员也浏览了