IBM PowerHA SystemMirror for AIX - Operational Basics

IBM PowerHA clusters are used where the high availability of business-critical applications and databases is paramount. Maintaining and configuring these clusters requires specialized knowledge, which is why working with them can be a source of concern for many IT employees. Not everyone needs to be a cluster expert, and in cases where dedicated individuals, teams, or subsequent support lines are responsible for their maintenance, a basic operational knowledge of the cluster may be sufficient for the administrators of applications protected by the PowerHA cluster.

The IBM PowerHA product documentation can be overwhelming with the amount of information. Therefore, in this article, I would like to demonstrate the most fundamental operations related to PowerHA cluster management in the simplest way possible. The text is dedicated to specialists who do not need knowledge of installation, configuration, or troubleshooting, but only need to perform the most practical tasks, such as:

Checking the current status of the cluster and Resource Group (RG)
Stopping or starting cluster services
Switching a Resource Group between cluster nodes or between Data Centers

How to check if AIX is running in a PowerHA cluster?

The simplest way is to execute a command intended for listing the current cluster status, such as cltopinfo, clRGinfo, or cldump (clRGinfo and cldump may not work if the cluster is stopped). An AIX system without a PowerHA cluster should not correctly launch any of these commands, and if the cluster status is displayed, we have a quick confirmation.

If cluster services are stopped on all nodes, the commands cltopinfo or cldisp should still return information about the PowerHA configuration. When a cluster is configured, we should also be able to check the status of the Cluster Manager with the command:

lssrc -ls clstrmgrES

If the cluster is disabled or unconfigured, we can check if the AIX system has the PowerHA filesets installed:

lslpp -l cluster*

This command will not confirm if the cluster has been configured, but only that AIX has the PowerHA software installed. If AIX does not have a cluster configuration but the cluster filesets are installed, it is worth considering their removal so as not to pay for unused licenses.

The version of the installed cluster can be checked with the command:

halevel -s

How to check on which node RG is online, how many nodes there are, and what their status is?

clRGinfo

The most basic view will be provided by the previously mentioned command, clRGinfo. Here we have information about the names of the cluster nodes, and on which cluster nodes the Resource Groups are located and in which Site (if sites are configured).

cldump

This is the command I personally execute first, but often different people don’t quite know what to look at and are a bit overwhelmed by the information layout in the output. Of course, one could argue that everything is visible and there is nothing to explain here, but under stress, e.g., right after a failure, people may tend to quickly skim the information rather than carefully read everything they see on the screen.

The first section (yellow border) talks about the cluster status: cluster name, state (e.g., UP), and substate (e.g., STABLE, UNSTABLE).
The next section (blue border) concerns the cluster nodes, and the status of their network resources with associated IP addresses and Service IP Labels.
The section further down (green border) describes the policies and status of the Resource Groups. Here you can see information on where a given RG is currently located. This example shows only one Resource Group, but if there are many, the output of the cldump command will be significantly longer, repeating similar information about every RG. The information in the green box is displayed sequentially, for each Resource Group separately.

In the case of a Standard cluster, the RG status is rather obvious, but if the cluster has nodes in different Sites and is configured as Stretched or Linked, then we might see an RG in the “Online Secondary” state, which may surprise someone. The difference between these statuses is as follows:

Online – The Resource Group is active on that node.
Online Secondary – The Resource Group is not active, but the node where the RG is visible as Online Secondary is designated as the backup node for that RG, and the RG will be activated on it in case of a primary node failure.

cltopinfo

I have marked the sections with colors analogously to the cldump command. Here we can notice additional information, such as the cluster type (Standard/Stretched/Linked), Heartbeat Type, and the repository disk.

How to perform a manual switchover?

Colloquially, people often say that we are “switching over the cluster,” but, to be precise, we are actually moving a Resource Group between the cluster nodes. A single cluster can have many Resource Groups, and we don’t necessarily have to switch all of them at once.

Moving a Resource Group to another node

The easiest way to perform this action is by using the SMIT tool, launching it from the command line using its short form, which is smitty sysmirror or smitty hacmp

smitty sysmirror --> System Management (C-SPOC) --> Resource Group and Applications --> Move Resource Groups to Another Node

Why am I showing these screens?

Because it often happens that someone wants to know what will appear after selecting a given item in SMIT, but is not sure whether pressing Enter will lead to the next selection screen or execute the actual action. An unintentional moving of an RG can cost us a bit of stress… 🙂

In the first step, we select the appropriate Resource Group (i.e., the one with the Online status, if we want to move the RG that is currently active):

In the next step, we select the target node on which we want to start the chosen RG.

If the Resource Group is in OFFLINE mode, we do not move it (because there is nowhere to move it from ;)). We simply start it on the target node.

smitty sysmirror --> System Management (C-SPOC) --> Resource Group and Applications --> Bring a Resource Group Online

Moving a Resource Group to another Site

The procedure is very similar to moving to another node. The only difference is that instead of the target node, we select the target Site. Naturally, this option is only available if ‘site policies’ have been defined previously.

Stopping cluster services

It is worth remembering that in common language, “Stopping the cluster” does not necessarily mean stopping the operating system. It is always important to precisely define whether the situation involves stopping the cluster nodes including the AIX systems/LPARs, or just stopping the cluster services without stopping the operating system.

In the standard case (i.e., stopping the cluster services), it is sufficient to use the command smitty clstop or select the appropriate option from the C-SPOC menu.

smitty sysmirror --> System Management (C-SPOC) --> PowerHA SystemMirror Services

It is important to pay attention to which action we will take regarding the Resource Groups currently running on the node we intend to shut down. There are 3 actions to choose from.

Bring Resource Groups Offline
Move Resource Groups
Unmanage Resource Groups

While the first two options are rather obvious, the “Unmanage Resource Groups” option may raise questions. You should know that in this mode, the PowerHA services will be stopped, but the RG will remain active. This is a very useful option for performing a PowerHA version upgrade or applying a FIX, while keeping the RG (and thus the database or application) running. However, it is definitely not appropriate before an AIX restart, because in that case, the application’s stop script (Application Server Controller Scripts) will not be executed.

It should be remembered that stopping a Resource Group will also trigger the execution of the script used to stop the database/application. To check the location of the scripts, you can use the cllsserv command, which will display all application servers (which, simply put, are the assigned stop/start scripts linked to the Resource Group).

cllsserv

NOTE: Before stopping a Resource Group (RG), remember that if the database/application is stopped manually by an administrator, the stop script may try to stop processes that have already been terminated (in theory, every such script should be written to recognize the current application state, but in practice, not all scripts may be written perfectly.). This could cause the script to run for a very long time or cause the cluster services to stop with an error. If the RG has already gone into ERROR status and the cause has been identified, you can restore the correct state by following the documentation: https://www.ibm.com/docs/en/powerha-aix/7.2.x?topic=tools-recovering-from-powerha-systemmirror-script-failure

If the database or application needs to be stopped manually for a justified reason (and the script was not written to recognize this), it is absolutely necessary to comment out the content of the stop scripts before shutting down the Resource Group. The simplest way to do this is to add the line exit 0 at the beginning of the script. The PowerHA cluster expects the scripts to return a status of 0 – if this does not happen, the cluster waits and then initiates a recovery action after some time. This is why it is so important that the scripts finish with a correct status. In the example below, the line exit 0 has been added, so the stopping of the Oracle database, which is further down in the script, will not take place:

#!/usr/bin/ksh
exit 0
/u01/oracle/stop_oracle.sh

If an operating system restart is planned, it should only take place after the Resource Group and cluster services have been stopped. If the RG is in the RELEASING or ONLINE status, the AIX system definitely should not be restarted yet.

In the case where PowerHA cluster services are being stopped before performing an AIX software update, you should first familiarize yourself with the following item from the IBM documentation: https://www.ibm.com/docs/en/powerha-aix/7.2.x?topic=maintenance-updating-software-cluster

Note! If the update is intended to upgrade the RSCT fileset versions, there may be an impact on the PowerHA cluster and the operation of the DMS (Dead Man Switch) mechanism. Therefore, you must disable cthags or consider the STOP_CAA/START_CAA option.

Starting cluster services

In the common case, it is sufficient to use the command smitty clstart or select the appropriate option from the C-SPOC menu.

smitty sysmirror --> System Management (C-SPOC) --> PowerHA SystemMirror Services

It is worth paying attention to the Manage Resource Groups setting, which determines whether, along with starting the cluster services, we also start the Resource Group (Automatically), or just the cluster services without the RG (Manually).

Summary

I tried to prepare this text to be as simple and understandable as possible. If you are interested in the topic of PowerHA SystemMirror clusters for the AIX system, I invite you to check out my other publications:

IBM PowerHA SystemMirror clusters – How to maintain and not go crazy – Part1

IBM PowerHA SystemMirror clusters – how to maintain and not go crazy – part2