How to Evaluate Replication Latency in IBM i Disaster Recovery

Ash Giddings
Feb 16
5 min read

Replication latency is one of the most frequently discussed topics in IBM i disaster recovery, yet it remains one of the most misunderstood. It is often reduced to a single number on a dashboard, quoted as evidence that a system is protected, without sufficient consideration of what that number actually represents or whether it reflects real business exposure.

Latency is not just a technical performance metric. It is the practical expression of Recovery Point Objective. At any given moment, it represents how much business activity may not yet be usable on a recovery system if the primary environment were to fail.

The challenge is that many organizations measure replication latency in ways that are technically correct, but operationally misleading. To evaluate it properly, IBM i teams need to understand where latency is introduced, how it behaves under real workloads, and whether the data on the target system is genuinely ready to support business continuity. This is particularly relevant when assessing logical replication solutions such as Maxava HA, where latency must be understood end to end rather than inferred from a single indicator.

Evaluate Replication Latency in IBM i Disaster Recovery

Latency is not a single delay

One of the most common mistakes in evaluating replication latency is assuming it is a single, uniform delay. In reality, replication is a pipeline with multiple stages, and different approaches expose different parts of that pipeline.

On IBM i, changes are typically captured through journaling, transmitted to a target system, and then processed so that they are usable by applications. Latency can be introduced at each of these stages. A transaction may be committed on the source, written to the journal, transmitted quickly across the network, yet still not be fully usable on the target if downstream processing cannot keep pace.

This distinction matters because many reported latency figures only describe part of the journey. A low transport delay does not necessarily mean a low business RPO. What ultimately matters is end-to-end delay, from commit on the source to a state on the target where applications can safely resume processing. Modern IBM i HA solutions, including Maxava HA, are designed around preserving transactional sequence so that this end-to-end view can be evaluated accurately.

Why IBM i replication latency is workload driven

It is tempting to think of replication latency as a network problem, something that can be solved by adding bandwidth or reducing distance. In practice, IBM i replication latency is far more dependent on workload characteristics and system behavior than on raw network capacity.

IBM documentation makes this clear. Remote journaling performance is influenced by the rate at which journal entries are generated, the delivery mode in use, processor utilization on both systems, disk behavior, and the quality of the network connection. During quiet periods, replication may appear almost instantaneous. During heavy batch processing or peak transactional load, latency can increase rapidly even on well-provisioned links.

Synchronous delivery modes prioritize keeping the target current, but can introduce response time impact on the source. Asynchronous modes minimize impact on production workloads, but allow lag to build if change rates exceed transmission or processing capacity. Neither approach is inherently wrong. What matters is understanding how each behaves under the workloads that matter most to the business. Logical replication architectures such as Maxava HA are specifically designed to minimize production impact while maintaining predictable behavior as workloads fluctuate.

Measuring latency where it matters

IBM i provides a useful indicator for asynchronous remote journaling, often referred to as the estimated time behind. This value represents the estimated delay between when journal entries are written to disk on the source system and when they are received on the target. When monitored over time, it can provide valuable insight into how replication behaves during normal operations and how it responds to bursts of activity.

Used correctly, this metric allows teams to observe trends rather than snapshots. It highlights sustained increases in lag, exposes patterns tied to batch windows or maintenance activity, and provides early warning when replication is no longer keeping pace with the workload.

However, it is important to understand what this metric does and does not represent. It describes receipt on the target, not full application readiness. It is an indicator of transport behavior, not a complete measure of recoverability. Treating it as a proxy for business RPO without further validation can create a false sense of security. This is why IBM i HA platforms such as Maxava HA complement latency indicators with integrity validation and role-swap testing to confirm true recoverability.

Looking beyond dashboards with receiver analysis

For organizations that want to go deeper, IBM provides a practical method for analyzing replication behavior using journal receiver timing. This approach examines how quickly receivers are filled on the source and how quickly they are created on the target, allowing teams to calculate effective throughput and identify backlogs.

What makes this technique particularly valuable is that it separates data generation from data transmission. It becomes immediately clear whether replication is falling behind because the business is generating change faster than expected, because the network cannot sustain the load, or because target-side processing is constrained.

This type of analysis is especially useful when replication performance is questioned during peak periods. Rather than relying on assumptions or anecdotal evidence, teams can point to measured rates and observed behavior over defined time windows. These techniques are commonly used when validating IBM i HA environments built on logical replication, including Maxava HA deployments.

The hidden impact of network quality

Another area that is often overlooked in latency evaluation is network stability. IBM documentation highlights that network retransmissions can materially affect remote journaling performance. Even when bandwidth appears sufficient, retransmissions can introduce delays that are difficult to diagnose without explicit monitoring.

In practice, replication issues are often the first place where network problems surface, simply because of the volume and continuity of journal traffic. Incorporating basic retransmission checks into regular evaluation helps avoid misattributing latency to replication design when the underlying issue is line quality or configuration. This applies equally to all IBM i logical replication solutions, including Maxava HA.

Journaling configuration and its trade offs

Replication latency is also influenced by how journaling itself is configured. Options such as journal caching can improve throughput by buffering entries in memory before writing them to disk, reducing I/O pressure during busy periods.

However, IBM documentation is clear about the trade-offs. Cached journal entries are not preserved if the system fails before they are written to disk, and they may not be immediately visible to retrieval functions. For environments where near-zero data loss is a hard requirement, these trade-offs must be carefully considered.

When evaluating latency, journaling configuration should always be part of the discussion. Performance improvements that compromise recoverability undermine the purpose of disaster recovery. Architectures like Maxava HA are designed to balance performance optimization with recoverability requirements rather than prioritizing raw speed alone.

Interpreting latency in business terms

There is no single “good” replication latency number that applies to every IBM i environment. The correct interpretation depends on workload patterns, business criticality, and tolerance for data loss during different operational periods.

A more useful approach is to define acceptable latency ranges for normal operations, peak processing, and recovery catch-up. This allows technical measurements to be translated into meaningful business exposure and supports honest discussions with stakeholders about what is achievable in practice.

Replication latency should never be evaluated only during quiet periods. It should be measured and understood at the moments when failure would hurt most. IBM i HA solutions such as Maxava HA are typically evaluated using this multi-scenario approach rather than single-point measurements.

Why this discipline matters

Replication latency is easy to underestimate and difficult to explain after an incident. When it is evaluated rigorously, it becomes a powerful assurance tool. It connects technical behavior to business risk, exposes weaknesses before they result in outages, and supports realistic recovery commitments.

In IBM i disaster recovery, low latency only matters if it reflects recoverable, usable data. Understanding how to measure it, interpret it, and validate it under real conditions is what separates confidence from assumption. This is the standard that modern IBM i HA platforms, including Maxava HA, are designed to support.