Posted in

The Myth of Infinite Uptime in High Availability Systems

I often come across posts online where people boast about their system’s impressive uptime. “Server running without interruption for 5 years!” – sounds proud, right? But is it really a reason for uncritical admiration? I’ll try to look at this from my point of view.

Of course, stability and continuous operation of systems are crucial for business. No one wants unplanned downtime. However, long uptime, in my opinion, often hides potential problems that are not immediately apparent.

When I hear about a system that hasn’t been restarted in years, the following issues immediately come to mind:

  • Unpatched Security Vulnerabilities: Patches and updates, even critical ones, are waiting to be implemented. This is like an open door for potential threats.
  • Accumulating Configuration Errors: Minor changes and modifications that can accumulate over time and lead to unpredictable system behavior.
  • Outdated Dependencies: Systems with long uptime may still be running on older versions of libraries, leading to compatibility issues with new applications.
  • Accumulating Memory Leaks: Some processes may have small memory leaks that accumulate over time and lead to performance degradation.
  • Problems when rebooting, and the system may not start.

As a result, when the time finally comes for a restart – whether planned or forced – the system may… simply not come back up. Unfortunately, this happens in practice.

Of course, there are mechanisms like Linux Kernel Live Patching or Live Kernel Update (LKU) in AIX that allow you to update the system kernel without a restart. These are great tools, but their purpose shouldn’t be to extend uptime indefinitely. They serve to increase flexibility and separate the update window from the restart window. They allow for the deployment of fixes “on the fly,” but they don’t replace regular, planned restarts.

The same applies to high-availability servers, such as IBM Z (mainframe) or IBM Power, which feature the concept of Reliability, Availability, and Serviceability (RAS). Redundancy of key components and the ability to replace I/O elements without shutting down the system are fantastic solutions. But they are also not meant for the system to work continuously “until death.” Their goal is to ensure that we decide when the restart occurs, not chance.

High availability is not synonymous with infinite uptime. It’s rather the ability to manage the system’s life cycle in a controlled and planned manner.

I know that for many experienced administrators, what I’m writing is obvious. However, for people who are just starting their IT adventure, as well as for some managers, long uptime can be misinterpreted as an indicator of success. “10 years without a restart!” sounds impressive, but is it really something to be proud of? Rather, it’s a warning sign that the system may be outdated and hides numerous problems that will only be revealed after a restart.

I believe that a common-sense approach to uptime is worthwhile. Let’s not chase records, but focus on regular, planned system maintenance. Because ultimately, it’s about systems running stably and securely, not just for a long time. And that often requires… regular restarts.

Leave a Reply

Your email address will not be published. Required fields are marked *