IBM PowerHA SystemMirror – is it still worth it?
In this article I share my private thoughts about PowerHA clusters. If you are looking for more practical advice, I refer you to another article written by me: https://www.dhirubhai.net/pulse/ibm-powerha-systemmirror-clusters-how-maintain-without-wiktorek-ivvfc
I apologize in advance for any language errors, as this text has been translated from my article in Polish.
Introduction
HA (High Availability) clusters developed by IBM have a fairly long history and were operational even in times when systems did not use the advantages of server virtualization or redundancy at the SAN and LAN network level. LPM (Live Partition Mobility), which is the ability to move a system to another physical machine, appeared many years later.
The first version of the PowerHA cluster (then under the name HACMP - IBM High Availability Cluster Multiprocessing) was published as early as 1991, so the product has been on the market for quite a long time. The current full name of the product is "IBM PowerHA SystemMirror".
Many people may wonder if PowerHA clusters still make sense in today's IT world, considering the solution was developed and operated when servers were large bare-metal machines without any virtualization or containerization. Of course, in an ideal world, everything would be cloud-based, and blowing up half the globe wouldn't cause any service unavailability. However, in the imperfect world, the life of infrastructure and metal servers in our own Data Centers is fought for by brave administrators, and large systems that function for our daily convenience (including banking, telecommunications services, etc.) are often extremely difficult or unprofitable to rewrite into something more flexible.
Personally, I believe that perfect solutions do not exist, and the art is to choose the right tools and solutions for a given case. PowerHA clusters are, of course, not suitable for everything. There will certainly be cases where they are an overkill, or it will be possible to replace them with a solution that ensures continuity of availability at the application level. Perhaps for a given application, a simple mechanism of stopping an LPAR and starting the system on a different LPAR profile will be sufficient.
However, there will also be many cases where, after analyzing the available options, it turns out that there is nothing better. Like any other product, PowerHA has its pros and cons, but does it evolve? In my opinion, definitely yes. New versions of TL PowerHA expand functionality, including integration with Ansible and easier installation and configuration from a dedicated GUI, which lowers the level of operational difficulty in managing clusters. From my experience, I can say that over the past few years, issues such as launching clusters on disk replicas (e.g., in disaster recovery location) and upgrading cluster versions have been significantly improved, which can usually be done seamlessly, without impacting the operation of business services.
Maintenance
PowerHA requires a certain level of commitment and willingness to understand. It's necessary to be prepared that even the most adept administrator won't overcome some problems without contacting IBM support. Not everything is available and described in the documentation available online or in IBM Redbooks. In my opinion, system administrators often have less confidence in dealing with clustered systems. Why? I think it follows from the fact that clusters are usually set up where there are systems critical to the continuity of business services, and in such cases, human error is more painful.
A certain difficulty is that clusters require a specific approach to typical administrative tasks. For the typical AIX or Linux administrator, expanding space on LVM is not a problem, but if they have to do it on a clustered system... a warning alert lights up in their head. In older versions of PowerHA, indeed, many tasks required a specific approach when it comes to a cluster, but with current versions, things are much better now. Someone executed chfs instead of cli_chfs? No problem - the cluster handles it. However, unfortunately, in many less typical actions, such as splitting an LVM mirror or migrating between disk arrays, special caution is required, as clustered equivalents of system commands do not always work exactly the same and may support different parameters.
Clusters reduce unplanned downtime but at the same time can increase the amount of planned downtime.
If you have a cluster, it's worthwhile to periodically test its failover to avoid surprises in case of a failure. Is this a big problem? It seems to me that in the current IT world, boasting about high uptime is rather a shame than a reason for pride - after all, systems need to be updated and vulnerabilities patched, and the "If it ain't broke, don't fix it" approach doesn't seem very wise anymore. Planning a cluster failover and thus causing a short downtime at least once a year should not be a big issue (and if it is, perhaps the problem is not with the system, but within the organization itself).
Cluster switchover is one case of unavailability, but what about upgrading PowerHA?
From experience, I can say that in newer versions of PowerHA, upgrading can be quite painless. It's worth getting acquainted with the tool cl_ezupdate, which automates the upgrade process, and does so without interruption (in UNMANAGE mode) for the operation of Resource Groups, thus without impact on the operation of business services. It's possible to even make a bigger version jump, for example, directly from 7.2 TL4 to 7.2 TL7 - before upgrading, it's always worth checking the migration compatibility matrix on the page:
Just a few years ago, a cluster version upgrade required a lengthy procedure to ensure everything went according to plan, so there is a very significant improvement here, in my opinion.
Frequently Asked Questions
The justification for using clusters is sometimes questioned. I have allowed myself to confront questions that I have occasionally heard, which resulted more from a reluctance to maintain a cluster or a misunderstanding of the topic, rather than rational contraindications.
Let's not forget that LPM is not an HA solution. Indeed, it offers tremendous flexibility and allows for the seamless transfer of a system to another physical machine, which is extremely useful (especially when there's a hardware component failure on the physical server and we need to "escape" from it), but it will not operate automatically in case of failure and requires planning – it also doesn't protect against issues such as operating system damage due to human error or software defects.
领英推荐
I believe it's always worth questioning existing concepts and looking for new solutions, but the argument that "we have never had a server failure" could be presented as – "there's never been a fire in the building, so there's no need for a fire extinguishing system."
The fact that there has never been a failure that forced the cluster to switch over may mean that we were simply lucky, but it could also be due to the fact that the operation of the system was saved by the redundancy of physical components at the Power platform level, switching SEA at the VIOS level, multipathing of disk devices, or external mechanisms, operating outside the platform protecting the continuity of network operations and access to disk resources (brave infrastructure administrators also play a huge role here :)). However, redundancy won't always save us. We can never predict when we might encounter a Kernel panic - although in AIX systems, this happens extremely rarely, but Murphy's Law is always alive :)
Many things can be done manually.
But why, if they can be automated :)
The idea seems valid. Why pay for expensive cluster licenses if a simple script could suffice to stop and start an application, transfer the IP address, and remount resources between servers. Upon further consideration, it might also prevent duplicate IP addresses when the primary server is resurrected after a failure... It would be nice to also prevent a situation where the primary and backup servers start writing to the same disks because the outcome would be worse than the system's unavailability... It would be good to do this in a way that prevents the same database/application from running on both servers at the same time because if transactions are scattered between two systems, it could end badly... It would be nice also if the systems could communicate so that they know which one is not working properly... And maybe protect against a Split-Brain scenario?
Well... when considering various failure scenarios and cases to handle, the list of things to manage with "simple scripts" becomes long, and it's worth asking oneself – Wouldn't it be easier to use a ready-made solution as PowerHA, that already has all this and is developed by people who have been dealing with it for years :)
Personally, I believe that where possible, it's worth using a dedicated solution from the provider of the database/application – it won't always be a better solution, so it's best to conduct a Proof of Concept, test, and verify if it works as well as the provider promised :) (everything always works during presentations)
Many applications were developed without any concept of HA, and it's in such cases that PowerHA performs quite well. Why? Because in a typical case, for an application to shut down and start on another server, a switchable IP address, switchable (or shared) storage, and a script to start/stop the application are needed – PowerHA provides this (and much more) and allows us to ensure HA even for our own program/script.
Advantages and disadvantages (my subjective opinion)
Pros
Cons
Summary
Is it still worth using PowerHA clusters? As is often the case, the answer is "it depends" :) In my opinion, it is worthwhile, as long as they are used according to their purpose and where they are justified.
Certainly, it is not a dead product and is still being actively developed. Even though PowerHA clusters require some expert knowledge, they still have their advantages, and in many cases, it is hard to find an alternative solution that would be equally versatile for various applications and databases.
I hope you found the text interesting. If you've noticed an error in the text or disagree with any of my assumptions, feel free to contact me through Linkedin :)
?
?
Senior IT Architect, Presales, Open Source and HW Systems Team leader.
1 年Is still worth it? ... more than ever, NFRs are the most important bits of any production environment.
Senior IT Consultant | IBM Champion| POWER, Storage, Watson, Linux, AIX | POWERHAguy
1 年STOP_CAA functionality is in the SMUI. I personally don’t think it should be in either place nor a default as it should rarely ever be used.
Do-er of the Difficult, Wizard of Why Not, and Certified IT Curmudgeon
1 年Very much agreed, and the introduction of Simplified Remote Restart and VM Recovery Manager (HA and DR now) that uses it under the covers has muddied the waters as well. Most of the places I've been that deemed PowerHA "not worth it" or "causing more problems than it solves" were in that class of organizations that didn't devote the strategy, design, planning, documentation of procedures, and run-time testing + familiarization that it requires to really shine. Additionally, there are a lot of "bad" clusters out there in the wild where admins have been forced into bad tech designs because of well-meaning business decisions. Yes, of course, we can run 3 databases and 3 full app stacks in a single resource group and active/passive cluster to simplify some other group's lives, but ......... all kinds of downstream problems and pitfalls ensue.
Senior Manager
1 年I have this question bothering me for years. Oracle RAC has dbs across lpars with no downtime using load balancers, while IBM vio has better or the best LOAD BALANCING techniques using SEA, why dont it should be replicated on system mirror and provide better for users and this can be used where DB RAC can not be used and as high app availabality
Senior Regional Sales Director at Precisely | Better data means better decisions.
1 年You may also wish to consider a solution from people who developed HACMP for IBM back in 1992: https://www.precisely.com/product/precisely-assure/assure-mimix-for-aix