AIX System Dump: Your Server’s “Flight recorder”

Even on the best-maintained LPAR partitions, critical errors (system crashes) can occur that you simply can’t predict. When the system suddenly crashes, you have two options: either you try to read tea leaves, or you open the memory dump. Without data from the System Dump, RCA becomes a guessing game, and IBM Support won’t be able to help without hard data. In these moments, the System Dump is your only source of truth.

What is System Dump?

Think of a System Dump as a forensic snapshot. At the moment of a crash, the AIX kernel freezes everything and dumps the entire contents of your RAM onto a dedicated device. Unlike standard logs, which only tell you what happened, a dump shows you how it happened—it contains raw kernel structures, process stacks, and CPU registers from the exact second the system went down.

Without a solid dump, you’re just guessing. A full memory image is the only way for IBM Support to pinpoint the real culprit—be it a buggy driver, a kernel extension memory leak, or a silent hardware failure.

Dump type: Traditional vs. FW-Assisted

Dump technology has come a long way since the early days of POWER architecture:

Traditional System Dump: The crashed kernel itself is responsible for saving the dump.
⚠️ Note: If the failure involves the I/O subsystem, the kernel may not be able to write data to disk.
Firmware-Assisted Dump (FW-assisted): Available starting with POWER6 processors. At the moment of a crash, the kernel transfers control to the Hypervisor (PHYP). The partition is restarted, but the contents of RAM are preserved. The dump is written after the system restarts using functional drivers, which ensures nearly 100% certainty of data preservation.

Verify current settings

You can check the current configuration using the command: sysdumpdev -l

primary              /dev/lg_dumplv
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     FALSE
always allow dump    TRUE
dump compression     ON
type of dump         fw-assisted
full memory dump     disallow

Key parameters:

primary: Primary dump device. It is recommended to use a dedicated volume (np. /dev/lg_dumplv).
secondary: Backup device. Usually set to /dev/sysdumpnull.
forced copy flag: Specifies whether the system should force the snapshot to be copied to the directory specified in copy directory after a restart. If the parameter is FALSE, and if there isn’t enough space at the destination, the dump may be lost.
dump compression: Enabled by default starting with AIX 6.1. Compression is necessary to ensure that the dump fits on a smaller logical volume.
type of dump: fw-assisted (recomended) lub traditional.

Forget about `/dev/hd6` (Paging Space)

Many administrators still leave the configuration set up so that the dump is saved to the swap space. This is a high-risk mistake.

Why? Paging space is critical for AIX immediately after startup. If a dump ends up there, the system will start overwriting it with its own processes before you even have a chance to think about copying it. The result? A corrupted, unreadable dump and no chance of diagnosis.
The rule is simple: Always create a dedicated volume (e.g. lg_dumplv) of the dump type. Disk space is cheap these days—but time lost during a failure is not.

Is the system frozen and unresponsive? Force it down manually.

The worst-case scenario is when the system doesn’t respond to pings or SSH requests but doesn’t reboot on its own. In that case, you have no choice—you have to force a reboot manually to find out what’s causing the system to freeze.

You have two options, depending on what’s still working:

From HMC (The most reliable method): If the partition doesn’t respond, go straight to Hypervisor.

GUI: Operations -> Restart -> Dump.
CLI: chsysstate -m ManagedSystem -r lpar -n LPARname -o dumprestart

From the shell (if the terminal is still running):
Use the command: sysdumpstart -p (dump on the primary device) or sysdumpstart -s (dump on secondary device)

⚠️ A quick heads-up: Patience pays off. You need to realize that forcing a dump during a restart takes significantly longer than a simple “hard reset”—dumping gigabytes of RAM to disk isn’t instant. However, that extra downtime is your only safety net. A cold reboot during a freeze is a wasted opportunity for a diagnosis. Manually triggering a dump is the only way to expose the root cause and make sure you aren’t dealing with the exact same crash tomorrow.

Configuring settings and carving out space

Adjusting your dump settings is straightforward with the sysdumpdev command. Here’s how to tweak the parameters and, more importantly, how to set up a dedicated logical volume so you’re not caught off guard.

Setting	Change command
Primary device	`sysdumpdev -P -p /dev/lg_dumplv`
Secondary device	`sysdumpdev -P -s /dev/sysdumpnull`
type of dump	sysdumpdev -t traditional sysdumpdev -t fw-assisted
full memory dump	sysdumpdev -f allow sysdumpdev -f disallow

Two quick steps to carve out your dump space:

Create a dump-type LV in rootvg:: mklv -y lg_dumplv -t dump rootvg 16
Make the new device permanent: sysdumpdev -P -p /dev/lg_dumplv

Sizing your dump device: How much space do you actually need?

How much space do you actually need? You can check this with the command: sysdumpdev -e

# sysdumpdev -e
Estimated dump size in bytes: 1443406806

When should you run the verification? Ideally during peak system load and after each RAM upgrade. AIX automatically checks the required dump device size, the root crontab contains an entry for dumpcheck:

0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1

This ensures the system checks available space once a day. Just remember that the time the cron job runs may not always coincide with peak load. If you see the following entry in errpt: “dumpcheck: The largest dump device is too small”, don’t delay—enlarge the volume using extendlv.

Remember: dumpcheck can be a bit of a risk-taking optimist

This script is based on the output of sysdumpdev -l, but it has one key feature: if you have compression enabled (dump compression ON), dumpcheck assumes that the data will be reduced by exactly 50%. As a result, it divides the required output by two and uses that figure to determine whether you have enough space.

Gathering data for analysis

Once the system is up and running, you need to collect a complete set of data for IBM Support. The quickest way to do this is with the snap or savecore tool.

Using snap -ac will generate a compressed package containing the dump and logs. A quick warning: the diagnostic package can be quite large. If you don’t have enough space in /tmp (the default location), use the -d switch to redirect the output file to another, larger filesystem: snap -ac -d /path/to/large/fs

Why is that even important?

Tools like nmon or Grafana are fantastic for performance tracking, but they’re completely blind when a kernel panic hits. Standard logs often won’t show a thing because the crash happened deep within the kernel code, where typical logging can’t reach.

That’s where the memory dump becomes your only source of truth. Just look at these APAR examples:

These fixes point to specific code bugs that no monitoring tool could ever catch. Setting up your dump device properly takes five minutes and a tiny fraction of disk space. It’s a ridiculously small price to pay compared to hours of unexplained downtime—and the nightmare of waiting for the next crash just to finally get a clue.

Don’t leave your RCA to chance. Configure your dump today.