IBM PowerHA SystemMirror – is it still worth it?

IBM PowerHA SystemMirror – is it still worth it?

In this article I share my private thoughts about PowerHA clusters. If you are looking for more practical advice, I refer you to another article written by me: https://www.dhirubhai.net/pulse/ibm-powerha-systemmirror-clusters-how-maintain-without-wiktorek-ivvfc

I apologize in advance for any language errors, as this text has been translated from my article in Polish.

Introduction

HA (High Availability) clusters developed by IBM have a fairly long history and were operational even in times when systems did not use the advantages of server virtualization or redundancy at the SAN and LAN network level. LPM (Live Partition Mobility), which is the ability to move a system to another physical machine, appeared many years later.

The first version of the PowerHA cluster (then under the name HACMP - IBM High Availability Cluster Multiprocessing) was published as early as 1991, so the product has been on the market for quite a long time. The current full name of the product is "IBM PowerHA SystemMirror".

Many people may wonder if PowerHA clusters still make sense in today's IT world, considering the solution was developed and operated when servers were large bare-metal machines without any virtualization or containerization. Of course, in an ideal world, everything would be cloud-based, and blowing up half the globe wouldn't cause any service unavailability. However, in the imperfect world, the life of infrastructure and metal servers in our own Data Centers is fought for by brave administrators, and large systems that function for our daily convenience (including banking, telecommunications services, etc.) are often extremely difficult or unprofitable to rewrite into something more flexible.

Personally, I believe that perfect solutions do not exist, and the art is to choose the right tools and solutions for a given case. PowerHA clusters are, of course, not suitable for everything. There will certainly be cases where they are an overkill, or it will be possible to replace them with a solution that ensures continuity of availability at the application level. Perhaps for a given application, a simple mechanism of stopping an LPAR and starting the system on a different LPAR profile will be sufficient.

However, there will also be many cases where, after analyzing the available options, it turns out that there is nothing better. Like any other product, PowerHA has its pros and cons, but does it evolve? In my opinion, definitely yes. New versions of TL PowerHA expand functionality, including integration with Ansible and easier installation and configuration from a dedicated GUI, which lowers the level of operational difficulty in managing clusters. From my experience, I can say that over the past few years, issues such as launching clusters on disk replicas (e.g., in disaster recovery location) and upgrading cluster versions have been significantly improved, which can usually be done seamlessly, without impacting the operation of business services.

IBM PowerHA GUI - source: "PowerHA SystemMirror Graphical User Interface" Redbook

Maintenance

PowerHA requires a certain level of commitment and willingness to understand. It's necessary to be prepared that even the most adept administrator won't overcome some problems without contacting IBM support. Not everything is available and described in the documentation available online or in IBM Redbooks. In my opinion, system administrators often have less confidence in dealing with clustered systems. Why? I think it follows from the fact that clusters are usually set up where there are systems critical to the continuity of business services, and in such cases, human error is more painful.

A certain difficulty is that clusters require a specific approach to typical administrative tasks. For the typical AIX or Linux administrator, expanding space on LVM is not a problem, but if they have to do it on a clustered system... a warning alert lights up in their head. In older versions of PowerHA, indeed, many tasks required a specific approach when it comes to a cluster, but with current versions, things are much better now. Someone executed chfs instead of cli_chfs? No problem - the cluster handles it. However, unfortunately, in many less typical actions, such as splitting an LVM mirror or migrating between disk arrays, special caution is required, as clustered equivalents of system commands do not always work exactly the same and may support different parameters.

Clusters reduce unplanned downtime but at the same time can increase the amount of planned downtime.

If you have a cluster, it's worthwhile to periodically test its failover to avoid surprises in case of a failure. Is this a big problem? It seems to me that in the current IT world, boasting about high uptime is rather a shame than a reason for pride - after all, systems need to be updated and vulnerabilities patched, and the "If it ain't broke, don't fix it" approach doesn't seem very wise anymore. Planning a cluster failover and thus causing a short downtime at least once a year should not be a big issue (and if it is, perhaps the problem is not with the system, but within the organization itself).

Cluster switchover is one case of unavailability, but what about upgrading PowerHA?

From experience, I can say that in newer versions of PowerHA, upgrading can be quite painless. It's worth getting acquainted with the tool cl_ezupdate, which automates the upgrade process, and does so without interruption (in UNMANAGE mode) for the operation of Resource Groups, thus without impact on the operation of business services. It's possible to even make a bigger version jump, for example, directly from 7.2 TL4 to 7.2 TL7 - before upgrading, it's always worth checking the migration compatibility matrix on the page:

https://www.ibm.com/docs/en/powerha-aix/7.2?topic=reference-information

Just a few years ago, a cluster version upgrade required a lengthy procedure to ensure everything went according to plan, so there is a very significant improvement here, in my opinion.

Frequently Asked Questions

The justification for using clusters is sometimes questioned. I have allowed myself to confront questions that I have occasionally heard, which resulted more from a reluctance to maintain a cluster or a misunderstanding of the topic, rather than rational contraindications.

  • Why do we need a cluster if there's LPM?

Let's not forget that LPM is not an HA solution. Indeed, it offers tremendous flexibility and allows for the seamless transfer of a system to another physical machine, which is extremely useful (especially when there's a hardware component failure on the physical server and we need to "escape" from it), but it will not operate automatically in case of failure and requires planning – it also doesn't protect against issues such as operating system damage due to human error or software defects.

  • We have a cluster but we have never had a server failure - shouldn't we stop using it?

I believe it's always worth questioning existing concepts and looking for new solutions, but the argument that "we have never had a server failure" could be presented as – "there's never been a fire in the building, so there's no need for a fire extinguishing system."

The fact that there has never been a failure that forced the cluster to switch over may mean that we were simply lucky, but it could also be due to the fact that the operation of the system was saved by the redundancy of physical components at the Power platform level, switching SEA at the VIOS level, multipathing of disk devices, or external mechanisms, operating outside the platform protecting the continuity of network operations and access to disk resources (brave infrastructure administrators also play a huge role here :)). However, redundancy won't always save us. We can never predict when we might encounter a Kernel panic - although in AIX systems, this happens extremely rarely, but Murphy's Law is always alive :)

  • Why do we need a cluster if the system or services can be manually started on another server?

Many things can be done manually.

But why, if they can be automated :)

  • PowerHA cluster is too complicated – can't resource switching be managed with our own scripts?

The idea seems valid. Why pay for expensive cluster licenses if a simple script could suffice to stop and start an application, transfer the IP address, and remount resources between servers. Upon further consideration, it might also prevent duplicate IP addresses when the primary server is resurrected after a failure... It would be nice to also prevent a situation where the primary and backup servers start writing to the same disks because the outcome would be worse than the system's unavailability... It would be good to do this in a way that prevents the same database/application from running on both servers at the same time because if transactions are scattered between two systems, it could end badly... It would be nice also if the systems could communicate so that they know which one is not working properly... And maybe protect against a Split-Brain scenario?

Well... when considering various failure scenarios and cases to handle, the list of things to manage with "simple scripts" becomes long, and it's worth asking oneself – Wouldn't it be easier to use a ready-made solution as PowerHA, that already has all this and is developed by people who have been dealing with it for years :)

  • Application XXX has its own HA solution – wouldn't it be better to use that?

Personally, I believe that where possible, it's worth using a dedicated solution from the provider of the database/application – it won't always be a better solution, so it's best to conduct a Proof of Concept, test, and verify if it works as well as the provider promised :) (everything always works during presentations)

Many applications were developed without any concept of HA, and it's in such cases that PowerHA performs quite well. Why? Because in a typical case, for an application to shut down and start on another server, a switchable IP address, switchable (or shared) storage, and a script to start/stop the application are needed – PowerHA provides this (and much more) and allows us to ensure HA even for our own program/script.

Advantages and disadvantages (my subjective opinion)

Pros

  • Great capabilities for providing HA for very diverse applications and databases
  • The majority of configuration work can be done on an active cluster
  • Possibility to upgrade TL/SP in Online mode
  • Quite good documentation in the form of IBM Redbooks
  • Support for Ansible in the new TL (in my opinion, this is a good direction)
  • Configuration and management through the SMIT tool (though in a limited scope)
  • In my assessment, a significant improvement in product stability in recent years

Cons

  • High entry barrier to the technology, which may discourage new specialists
  • difficult to master and problematic Dead Man Switch mechanism, which is active even when cluster services are stopped
  • A large portion of knowledge about the product is locked within IBM, unavailable to the client
  • Clusters are not very liked by administrators due to their complexity and the high responsibility associated with maintaining them. In my opinion, this is why it is hard to find administrators for this product
  • CAA is like a black box (clients have limited capabilities to look inside CAA, e.g., for problem analysis)
  • Not all important options are available through the SMIT tool (e.g., no STOP_CAA option for smit clstop)

Summary

Is it still worth using PowerHA clusters? As is often the case, the answer is "it depends" :) In my opinion, it is worthwhile, as long as they are used according to their purpose and where they are justified.

Certainly, it is not a dead product and is still being actively developed. Even though PowerHA clusters require some expert knowledge, they still have their advantages, and in many cases, it is hard to find an alternative solution that would be equally versatile for various applications and databases.


I hope you found the text interesting. If you've noticed an error in the text or disagree with any of my assumptions, feel free to contact me through Linkedin :)

?

?

Ander Ochoa Gilo

Senior IT Architect, Presales, Open Source and HW Systems Team leader.

1 年

Is still worth it? ... more than ever, NFRs are the most important bits of any production environment.

回复
Shawn B.

Senior IT Consultant | IBM Champion| POWER, Storage, Watson, Linux, AIX | POWERHAguy

1 年

STOP_CAA functionality is in the SMUI. I personally don’t think it should be in either place nor a default as it should rarely ever be used.

回复
Chris Petersen

Do-er of the Difficult, Wizard of Why Not, and Certified IT Curmudgeon

1 年

Very much agreed, and the introduction of Simplified Remote Restart and VM Recovery Manager (HA and DR now) that uses it under the covers has muddied the waters as well. Most of the places I've been that deemed PowerHA "not worth it" or "causing more problems than it solves" were in that class of organizations that didn't devote the strategy, design, planning, documentation of procedures, and run-time testing + familiarization that it requires to really shine. Additionally, there are a lot of "bad" clusters out there in the wild where admins have been forced into bad tech designs because of well-meaning business decisions. Yes, of course, we can run 3 databases and 3 full app stacks in a single resource group and active/passive cluster to simplify some other group's lives, but ......... all kinds of downstream problems and pitfalls ensue.

I have this question bothering me for years. Oracle RAC has dbs across lpars with no downtime using load balancers, while IBM vio has better or the best LOAD BALANCING techniques using SEA, why dont it should be replicated on system mirror and provide better for users and this can be used where DB RAC can not be used and as high app availabality

回复
Alexander Trekin

Senior Regional Sales Director at Precisely | Better data means better decisions.

1 年

You may also wish to consider a solution from people who developed HACMP for IBM back in 1992: https://www.precisely.com/product/precisely-assure/assure-mimix-for-aix

要查看或添加评论,请登录

Michal Wiktorek的更多文章

社区洞察

其他会员也浏览了