IBM PowerHA SystemMirror clusters - how to maintain and not go crazy
Michal Wiktorek
Unix Systems Administrator | AIX/Linux | IBM Power | Santander Bank Polska
PowerHA clusters (previously known as HACMP) provide extensive capabilities for managing availability and handling failures. Clusters are effective tools for system recovery and data integrity, but their improper use and lack of understanding can also lead to unexpected consequences.?
I’ve decided to share my experiences and?make someone’s life easier, reducing stress in maintaining this product? :)
I tried to keep the text concise and focused on the most practical aspects. If you find the text helpful, leave me a comment or a like – it will give me feedback on whether to publish additional parts of the material.?
Remember that I’m writing from a customer perspective, not an IBM employee. Always prioritize IBM recommendations and official documentation over my advice. In the text, I refer to PowerHA version 7.2 for the AIX system.?
I apologize in advance for any language errors, as this text has been translated from my article in Polish.
Simpler means better
Simplicity is a luxury we may not always afford, but if we have the opportunity or choice,?let's allow ourselves this luxury?:)
PowerHA clusters offer many features and configuration options, but remember that the easiest to maintain will be the simplest. Why? Because the less unique the cluster is, the greater the probability that a potential issue we experience has already been encountered by many clients around the world, and for support (and for us), not only will it be easier to analyze the configuration, but a fix or solution may already exist. Personally, I've had many occasions to be the world's first client to encounter a particular problem, and being such a pioneer is not a source of pride but rather frustration. It's essential to remember that any feature that may be less popular in terms of global configurations exposes us to the risk of "testing" on a smaller population.
I think we can compare it to owning an exotic car - it may look nice, but the problem arises when it turns out that no mechanic knows how to service it.
Above all, let's adhere to official documentation and try not to complicate configurations with non-standard elements.
If we have multiple clusters in the organization, let's strive for all of them to have configurations according to a single standard and be created in the same way.?
Don't play the hero – open a case?
You might be a highly skilled administrator, and you may have mastered cluster configurations, but there's always a chance of encountering a bug or a problem that will be exceptionally challenging to solve on your own.
If you come across an error or situation for the first time and you're working within a limited downtime window, don't waste time – start generating the cluster SNAP as early as possible (snap -ec on the primary node) and simultaneously open a case with IBM Support. Provide the essential information – when the incident occurred and what error or issue you're facing. If you have another specialist to assist you, share the crucial information with them and ask them to open a case, so you can concurrently conduct an analysis or work on restoring the system to its proper state (use multithreading like Power servers ;)).
If the problem occurs during a several-hour maintenance window, it's possible that support will provide a solution before the window's end.
If you manage to open a case early with the appropriate severity level and submit the SNAP file, you can still conduct your analysis. Meanwhile, someone from the support team may already be working on your behalf. It's entirely possible that you'll find a solution to the problem before the support engineer does, but it's worse if that doesn't happen, and you're left with nothing at the end of the service window.
If you manage to solve the problem independently before the IBM engineer, I recommend documenting in the service request before its closure what resolved the issue. It's very likely that six months later, we won't remember how the problem was solved, but a solution in the Case Portal database will remain.?
Keep an eye on PowerHA versions and check for known APARs?
ALWAYS ensure that you are working with a supported version of PowerHA, and don't wait until the last moment to upgrade. If you encounter a problem in an outdated version, support will likely have to reject your request. It's very possible that the issue you face has already been addressed in new, supported versions. Use the following page to avoid missing the end-of-support date for a particular version.
Is it worth upgrading the cluster to the latest version?
You should be aware that not everything can be tested in the lab before releasing a version to customers, and the first ones to use it may become testers for others – this applies to various products in many industries, not just software.
In my opinion, it's worth waiting until a new Technology Level (TL) has at least the first or second Service Pack. I would be cautious about using GA versions for production systems.
Personally, I prefer when other clients encounter a problem before me for which there is no APAR/Fix yet, and I can install a Service Pack that already includes it. Of course, one should not go to the extreme and wait ages for the version to "mature," as this approach is likely to be much worse, and we risk a shorter support time for a given release.
Remember to periodically check for updates for AIX and PowerHA, especially of the HIPER type, on the following pages. To stay up to date, it's advisable to subscribe to the IBM newsletter.?
Back up your cluster before performing maintenance
The AIX system is equipped with tools such as mksysb or alt_disk_copy, which are absent in popular Linux distributions (even those labeled as Enterprise).
It is definitely worth using them, as they can significantly reduce downtime and restore the system to a usable state.
The mksysb tool allows us to create a bootable image of the AIX system, so it's advisable to perform it periodically and directly before engaging in service activities and making changes to the system. Save the image in a location that will be accessible to us outside of the secured operating system. The procedure for restoring the system from the mksysb image, from the NIM server, is not much different from installing a new system. An example of usage is provided below
# mksysb -m /aix_system01.mksysb
Using the alt_disk_copy tool allows for creating a clone of the operating system on a backup disk. In the event that we want to revert changes that occurred in the operating system or for some reason it cannot boot at all, simply booting the LPAR from the backup disk is sufficient—repair time and rollback of changes then only cost us a system restart. Example of usage (without modifying the bootlist):
# alt_disk_copy -B -d hdisk0
Snapshot of the cluster configuration - the current cluster configuration can be saved to a file. It's advisable to store it on disk space outside the cluster and perform it before any operation related to upgrading the cluster version or making significant configuration changes.
Personally, I always perform these three safeguards and recommend them to you as well. Each of them can be executed without interruption to the operation of the system.
If you have the capability, you can attempt to create a copy at the disk array level (not only the system disk but also disks with application data). However, keep in mind that such an operation may freeze IO for a certain period, which, in turn, could be interpreted by the cluster or the Dead Man Switch mechanism as disk unavailability, triggering actions (depending on set wait times for storage access). Consult with your Storage Team to find out if it's feasible and safe for your specific system and disk array. I personally do not recommend cloning the CAA disk at the disk array level but rather rely on the cluster switching mechanism to the backup CAA disk (CAA is continuously monitored by the cluster, so freezing it carries risks).
Ensure that the start/stop scripts are of good quality
Scripts responsible for safely starting and stopping an application within a cluster are crucial. Neglecting them is not advisable because without them, even with a very good configuration, the cluster will significantly lose its usefulness. Unfortunately, these scripts can often be the weakest link in the entire cluster. Even if PowerHA seamlessly switchovers as part of fault handling, the lack of correct startup of a business-critical application on the backup node can negate the entire effectiveness of this solution.
It is essential for the script to conclude its operation with a status of 0 (zero). At the beginning of the starting script, it's beneficial to include code that checks whether the application is already up and running. If it is, the script should return exit 0. This is particularly important in cases where the cluster operation is resumed from UNMANAGE mode because the startup script will be executed again.
For the script stopping the database/application, the status 0 should also be returned even when the application/database is already stopped. In the event of an application CRASH, an improperly written script might extend the cluster failover time, for example, by attempting to kill processes that no longer exist.
If the cluster receives an incorrect status from the stopping script while stopping the Resource Group, the RG may enter an ERROR state. In such a case, it is necessary to verify the application/database shutdown and determine the cause. To manually clear the ERROR state from the RG, perform the following action: smitty sysmirror --> Problem Determination Tools --> Recover From PowerHA SystemMirror Script Failure.
If the script takes a long time to return a status, the cluster will display the message config_too_long, and the default waiting time is quite long. Investigate what is causing the hang in the entire stopping process. It could be an issue with killing processes, waiting for an NFS resource, application freeze, etc. Check if the Filesystem belonging to the Resource Group is not in use by any process, using tools like lsof or fuser.
For a Resource Group to switchover to another node, the situation must be clean. The cluster does not allow the Resource Group, and consequently, the application/database, to be started on the backup node until it is completely closed, and Filesystems, VG, and disks are deactivated. Allowing data corruption is worse than unavailability.
Handling the erroneous completion of the script is somewhat intricate, so I refer you to the official IBM documentation. In summary, let's ensure that scripts finish correctly and return a status of "0". We configure PowerHA clusters to sleep more peacefully - neglecting the start/stop script aspect is not worthwhile, as these scripts are a vital element of the entire process of automating the switch of business services in the event of a failure.
Upgrade the cluster version smarter :)
Upgrading the PowerHA cluster version can be done in several ways. If we have the opportunity to carry out the operation during a maintenance window, where service unavailability is expected, it is worthwhile to take advantage of it and directly conduct failover tests for Resource Groups after upgrading the cluster version.
Personally, I prefer to perform the version upgrade seamlessly, i.e., in UNMANAGE mode. This provides much greater flexibility when it comes to scheduling the operation, and failover tests with service unavailability can be conducted at an independent time.
领英推荐
To make life easier, I recommend using the cl_ezupdate command, which does almost everything for us. The tool automatically updates on all nodes one by one, starting from the backup node that does not have active Resource Groups.
Before performing the update, ensure the creation of LPP_SOURCE on the NIM server or NFS resource with filesets of the specific TL/SP or fix that will be available for all cluster nodes. It's always worth checking if we can make a leap of several Technology Levels at once. You can verify this possibility on the website:
Examples of cl_ezupdate usage
Querying the NIM server for the visibility of LPP_SOURCE for cluster nodes:
cl_ezupdate –Q NIM?
Preview installation from LPP_SOURCE (worth performing for verification of potential permission issues with the resource).
cl_ezupdate -P -S PowerHA_72_TL6?
Installation of PowerHA TL6 from LPP_SOURCE
cl_ezupdate -A -S PowerHA_72_TL6?
Installation of the Service Pack for TL6 from LPP_SOURCE
cl_ezupdate -A -S PowerHA_72_TL6_SP2?
Personally, I try not to mix TL and SP within the same LPP_SOURCE. Instead, I simply perform the upgrade to TL first and then proceed with another run for the SP installation.
For more information, refer to the documentation on the website:
Note: If application startup scripts do not verify whether the application is already running and can start it again, before updating in UNMANAGE mode, you can temporarily block the script's operation, for example, by adding the line "exit 0" at the beginning of the code.
Watch out for the Dead Man Switch (DMS)
I'm not a specialist in trains, but I know that in train safety automation, there's a mechanism that cyclically checks the presence of the train operator. At regular intervals, the operator must press a pedal or button to signal that they are awake and watchful. If he fail to do so, emergency braking of the locomotive occurs.
I think we can compare a cluster node to a speeding train here. If a node stops responding, a rather drastic action occurs - stopping the LPAR. Such behavior may be somewhat controversial, but I interpret it as the "lesser evil."
For me, the DMS is like a vigilant guard dog. It ensures our safety, but if we provoke it, it might bite our hand.
Since the Dead Man Switch monitors the responsiveness of a cluster node, situations where we manually influence disk IO freezing (e.g., through disk array cloning), network continuity, or restarting system services like RSCT are very risky. I've encountered situations where, in various versions of AIX and PowerHA, during an AIX update, a node was suddenly stopped during the update of filesets and the restart of RSCT services.
This is a very dangerous situation because directly after updating RSCT, the bosboot command followed. In some cases, the AIX system could not be booted, and system restoration or a challenging reanimation from the Maintenance Mode was necessary.
A surprising fact for many administrators is that simply turning off cluster services is not enough to deactivate DMS. After stopping services, for example, through smitty clstop, the mechanism is still active. You can verify this with the command: lssrc -ls cthags – the last line of the output is crucial; if DMS is active, it will look like: "Critical clients will be terminated if unresponsive."
How to protect yourself from DMS triggering when you fear that your planned action may be considered unresponsive (e.g., you want to upgrade RSCT or MPIO driver)?
DMS operates at both the RSCT and CAA levels. Therefore, to be 100% certain before an AIX update (which will also update RSCT filesets), you can stop the cluster services with the STOP_CAA=yes parameter. Remember that this option is not available from the SMIT tool. If you are not planning a system update, try to avoid using this parameter.
clmgr offline node NODE01?STOP_CAA=yes
If you stopped cluster services this way, to restart them, you must use the START_CAA=yes parameter.
clmgr online?node NODE01?START_CAA=yes
Note: Make sure that CAA has actually stopped, e.g., using the command lspv | grep caa_private. If it hasn't, repeat the stopping process (see APAR IJ27046). Also, don't stop CAA if you plan to run AIX on replicated LUNs because you may have trouble booting after renumbering the newly detected "hdisk."
In cases other than AIX updates, where you don't want to stop cluster services, you can try temporarily extending heartbeat intervals or deactivating RSCT monitoring. However, remember that CAA will still be active and monitored. If you want to suspend DMS during Live Partition Mobility (LPM), you don't need to take additional actions - PowerHA clusters in newer versions detect LPM and automatically extend the HEARTBEAT and temporarily stop RSCT monitoring.
Stopping RSCT monitoring:
/usr/sbin/rsct/bin/hags_disable_client_kill -s cthags?
/usr/sbin/rsct/bin/dms/stopdms -s cthags
Activating RSCT monitoring:
/usr/sbin/rsct/bin/dms/startdms -s cthags?
/usr/sbin/rsct/bin/hags_enable_client_kill -s cthags
Something isn't working? Start from the basics
It happens that solutions to complex problems are surprisingly simple... if only you stumble upon them first. Use dedicated commands to check the cluster status, such as cldump, cldisp, clRGinfo, cltopinfo, etc.
Personally, I start with cldump as a priority because if the command doesn't return a status or takes too long, it is already a cause for concern, indicating that the cluster or its required services are not functioning correctly, or there is a communication problem between nodes.
If communication issues between nodes arise during installation, take care of the fundamentals. It's always worth checking typical things like filesystem usage, RAM, PowerHA logs, errpt logs, etc., because the problem may not be with the cluster itself but with the configuration or state of the operating system.
Simple errors at the level of entries in /etc/hosts or the absence of active services, such as inetd, can lead to long and tedious analyses, searching for problems where they don't exist. I recommend checking the prerequisites in IBM RedBooks or on the following website:
There's always a risk that you've encountered a bug in a specific version. Check APARs for your AIX and PowerHA levels. For this purpose, I refer you again to the following page, to the "PowerHA SystemMirror APARs" and "AIX APARs" sections:
Check the correctness of the SSH connection to external systems. If the cluster controls the direction of replication an, for this purpose, connects via SSH to the disk array interface or uses ROHA and connects to the HMC console, check if the fingerprint is confirmed. This can be a very frustrating problem because it may be challenging to notice in the logs, but the solution is quite simple. It may occur, for example, on the first connection or when a change has occurred at the SSH server/client level or the address.
# ssh server01
The authenticity of host 'server01 (XXX.XXX.XXX.XXX)' can't be established.
ECDSA key fingerprint is XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX.
Are you sure you want to continue connecting (yes/no)?
Still can't figure it out? Maybe it's not worth struggling with it and wasting time; instead, just ask IBM support. During your free time, you can read something to relax... or the Redbook titled "Troubleshooting PowerHA SystemMirror" :)
In the end, I wish you only stable clusters :)
If you've noticed an error in the text or disagree with any of my assumptions, or if you find them not in line with best practices, feel free to contact me through Linkedin.
Do-er of the Difficult, Wizard of Why Not, and Certified IT Curmudgeon
8 个月So much good information there! "Ensure that the start/stop scripts are of good quality" is absolutely make-or-break, and having documented policies about who owns that, who maintains it, and when it gets checked can be extremely helpful.
Customer Success Manager Architect IBM ???? ??
10 个月Where were you when I was looking for this information on blogs and the entire Internet? Well done!