high-availability – mmooresystems

Single points of failure are unacceptable in production environments. That’s why nearly every enterprise F5 LTM deployment runs in high availability (HA) pairs—two devices working together to ensure load balancing services remain available even when hardware fails, software crashes, or maintenance is required. Let’s dive into how F5 LTM HA actually works, the different deployment models, and the gotchas you’ll encounter when building resilient load balancer infrastructure.

What Is F5 LTM High Availability?

F5 LTM High Availability is a clustering technology that pairs two (or more) BIG-IP devices to eliminate single points of failure. When configured correctly, an HA pair ensures that if one device fails, the other seamlessly takes over—maintaining application availability without user impact.

Core HA Capabilities

Configuration Synchronization: Changes made on one device automatically replicate to its partner
Automatic Failover: When the active device fails, the standby becomes active within seconds
Connection Mirroring: Active connections can be synchronized so failover is stateful (optional)
Health Monitoring: Devices continuously monitor each other’s health via heartbeat mechanisms
Shared Floating IPs: Virtual IP addresses (VIPs) move between devices during failover

Analogy: Think of an HA pair like two pilots in a cockpit. The captain (active device) flies the plane while the first officer (standby device) monitors everything and stays ready. If the captain becomes incapacitated, the first officer immediately takes the controls. Passengers (users) never notice the transition.

HA Deployment Models

F5 supports multiple HA configurations, each with different use cases and trade-offs:

1. Active-Standby (Most Common)

How it works:

One device is Active and processes all traffic
The other device is Standby and ready to take over
Floating IP addresses (self IPs and VIPs) live on the active device
During failover, IPs move to the standby device (which becomes active)

Traffic Flow:

Normal Operation:
[Clients] → [Active LTM] → [Servers]
             ↓
        [Standby LTM] (idle, monitoring)

After Failover:
[Clients] → [New Active LTM (was standby)] → [Servers]
             ↓
        [Failed LTM] (offline)Code language: CSS (css)

Pros:

Simple to understand and troubleshoot
Standby has full capacity available during failover
Clean separation of roles (one device actively processing)
Best for most enterprise deployments

Cons:

50% of hardware capacity sits idle
Standby device doesn’t process traffic (wasted investment)

2. Active-Active

How it works:

Both devices are Active and process traffic simultaneously
Different VIPs are configured on each device (or same VIPs with traffic splitting)
During failover, the surviving device takes over all VIPs

Example Setup:

Device A (Active): Handles VIP 10.1.1.10 (Web App)
Device B (Active): Handles VIP 10.1.1.20 (API App)

During Normal Operation:
[Web Clients] → [Device A] → [Web Servers]
[API Clients] → [Device B] → [API Servers]

If Device A Fails:
[Web Clients] → [Device B (takes over VIP 10.1.1.10)] → [Web Servers]
[API Clients] → [Device B (already handling)] → [API Servers]Code language: CSS (css)

Pros:

100% hardware utilization (no idle capacity)
Better ROI on hardware investment
Load distribution across both devices

Cons:

More complex configuration and troubleshooting
During failover, the surviving device handles 200% load (must be sized accordingly)
Connection mirroring is more complicated
Higher risk of performance degradation during failure

When to use: When hardware utilization is more important than operational simplicity, and you’ve sized each device to handle 100% of traffic alone.

Device Service Clustering (DSC): The Foundation

F5’s HA functionality is built on Device Service Clustering (DSC)—the framework that enables devices to work together.

Key DSC Components

1. Device Trust

Before devices can cluster, they must establish trust via certificate exchange (using iQuery protocol on TCP 4353):

# On Device A, add Device B to trust domain
tmsh run cm config-sync to-group device_trust_group
tmsh modify cm device-group device_trust_group devices add { device-b.example.com }Code language: PHP (php)

2. Device Groups

Device Groups define which devices work together and what gets synchronized:

Sync-Failover Group: Devices that sync config AND handle failover together (typical HA pair)
Sync-Only Group: Devices that only sync config (no failover coordination)

# Create sync-failover device group
tmsh create cm device-group my-ha-pair {
    type sync-failover
    devices { device-a.example.com device-b.example.com }
    auto-sync enabled
    network-failover enabled
}Code language: PHP (php)

3. Traffic Groups

Traffic Groups define which floating IP addresses move together during failover:

Floating Self IPs (device management/communication IPs)
Virtual Server IPs (VIPs that clients connect to)
SNAT IPs (if used)

In Active-Standby, you typically have one traffic group. In Active-Active, you have multiple traffic groups distributed across devices.

How Failover Actually Works

Failover Triggers

Failover can be triggered by:

Hardware failure: Power loss, CPU failure, memory failure
Software failure: TMOS crash, kernel panic, critical daemon failure
Network failure: Loss of network connectivity (monitored interfaces down)
Manual failover: Administrator forces failover for maintenance
Gateway pool failure: All gateway pool members down (if configured)

Failover Sequence

When failover occurs:

Detection: Standby detects active failure (missed heartbeats, interface down, etc.)
Transition: Standby promotes itself to Active state
IP Migration: Floating IPs (Self IPs, VIPs, SNATs) move to new active device
Gratuitous ARP: New active sends GARP to update network switch MAC tables
Traffic Resumption: New active begins processing traffic
Connection Recovery: Existing connections either break (stateless) or continue (if mirrored)

Typical failover time: 3-10 seconds for network failover, longer if connection mirroring is enabled.

Connection Mirroring: Stateful Failover

By default, failover is stateless—existing connections break and clients must reconnect. For mission-critical applications, you can enable connection mirroring:

# Enable mirroring on a virtual server
tmsh modify ltm virtual my-vip mirror enabledCode language: PHP (php)

How it works:

Active device continuously replicates connection state to standby via dedicated mirroring network
Standby maintains a synchronized connection table
During failover, standby already knows about all active connections
Connections continue seamlessly (from client perspective)

Trade-offs:

Pro: Zero connection loss during failover
Con: Significant performance overhead (each connection requires mirroring traffic)
Con: Requires dedicated high-bandwidth mirroring VLAN
Con: Only mirrors certain connection types (not all protocols supported)

When to use: Long-lived connections (FTP, database, SSH) where reconnection is expensive or disruptive. Not worth it for short HTTP requests.

Network Connectivity Requirements

HA pairs require specific network connectivity:

1. HA VLAN (ConfigSync/Failover)

Purpose: Configuration synchronization and heartbeat monitoring

Dedicated VLAN connecting both devices
Carries iQuery traffic (TCP 4353) for config sync
Carries heartbeat traffic for failover detection
Typically uses non-floating Self IPs

Best practice: Use a dedicated physical interface (not shared with data traffic) on a private VLAN.

2. Network Failover VLAN

Purpose: Redundant heartbeat path

Secondary heartbeat mechanism (separate from HA VLAN)
Prevents false failovers from single link failures
Can share data VLANs or use dedicated link

Recommendation: Always configure network failover on at least one additional VLAN beyond the HA VLAN.

3. Mirroring VLAN (Optional)

Purpose: Connection state synchronization

High-bandwidth dedicated link for connection mirroring
Should be separate from HA VLAN (mirroring is bandwidth-intensive)
10G+ recommended for high-throughput environments

[Device A]                    [Device B]
    |                              |
    |--- HA VLAN (1.1) ------------|  (Config Sync, Heartbeat)
    |                              |
    |--- Mirror VLAN (1.2) --------|  (Connection Mirroring)
    |                              |
    |--- Client VLAN (10.1) -------|  (Data + Network Failover)
    |                              |
    |--- Server VLAN (10.2) -------|  (Data + Network Failover)

Configuration Walkthrough: Building an Active-Standby Pair

Here’s the step-by-step process for configuring a basic Active-Standby HA pair:

Step 1: Configure Management and HA Interfaces

On both devices, configure:

# Device A
tmsh create net vlan ha-vlan interfaces add { 1.1 }
tmsh create net self 192.168.1.10 address 192.168.1.10/24 vlan ha-vlan allow-service default

# Device B
tmsh create net vlan ha-vlan interfaces add { 1.1 }
tmsh create net self 192.168.1.11 address 192.168.1.11/24 vlan ha-vlan allow-service defaultCode language: PHP (php)

Step 2: Establish Device Trust

On Device A:

# Discover and add Device B
tmsh modify cm device device-a.example.com configsync-ip 192.168.1.10
tmsh modify cm device device-a.example.com unicast-address { { ip 192.168.1.10 } }

# Add Device B to trust domain (enter Device B's credentials when prompted)
tmsh run cm config-sync to-group datasync-global-dgCode language: PHP (php)

Step 3: Create Device Group

# On Device A (will sync to Device B)
tmsh create cm device-group my-ha-pair {
    type sync-failover
    devices { device-a.example.com device-b.example.com }
    auto-sync enabled
    network-failover enabled
}Code language: PHP (php)

Step 4: Configure Floating IPs

# Create client-facing VLAN on both devices (already done in initial setup)
# Then create FLOATING Self IP (will move during failover)
tmsh create net self 10.1.1.10 address 10.1.1.10/24 vlan client-vlan traffic-group traffic-group-1 allow-service noneCode language: PHP (php)

Step 5: Configure Network Failover

# Enable network failover on client VLAN
tmsh modify cm device device-a.example.com unicast-address add { { ip 10.1.1.10 } }Code language: PHP (php)

Step 6: Perform Initial Sync

# Force sync from Device A to Device B
tmsh run cm config-sync to-group my-ha-pairCode language: PHP (php)

Step 7: Verify HA Status

# Check sync status
tmsh show cm sync-status

# Check failover status
tmsh show cm failover-status

# Verify device group
tmsh show cm device-group my-ha-pairCode language: PHP (php)

You should see Device A as Active and Device B as Standby, with sync status showing In Sync.

Common HA Problems and Solutions

Problem 1: Config Sync Fails

Symptom: “Changes Pending” or “Awaiting Initial Sync” that never resolves.

Causes:

iQuery connectivity broken (TCP 4353 blocked)
Certificate trust issues
Version mismatch between devices
Device group misconfiguration

Solutions:

# Verify iQuery connectivity
telnet <peer-ip> 4353

# Check sync status details
tmsh show cm sync-status detail

# Force sync from known-good device
tmsh run cm config-sync to-group my-ha-pair

# Nuclear option: remove and re-add device to trust
tmsh delete cm device <device-name>
# Re-establish trust and device group</device-name></peer-ip>Code language: PHP (php)

Problem 2: Split-Brain (Both Devices Active)

Symptom: Both devices think they’re active, both serving traffic.

Cause: Heartbeat communication failed on ALL monitored paths, so each device assumes the other is dead.

Prevention:

Configure network failover on multiple VLANs
Use dedicated HA VLAN separate from data VLANs
Monitor HA link health proactively

Recovery:

# Force one device to standby
tmsh run sys failover standby

# Investigate why heartbeat failed
# Fix network connectivity
# Verify heartbeat restored before trusting HA againCode language: PHP (php)

Problem 3: Failover Takes Too Long

Symptom: Failover takes 30+ seconds, causing extended outages.

Causes:

Connection mirroring enabled on high-connection-count VIPs
Network convergence delays (STP, routing protocols)
Gateway pool checks delaying transition

Solutions:

Disable connection mirroring unless absolutely necessary
Use portfast/RSTP on HA switch ports
Tune gateway pool monitor intervals
Consider static routes instead of dynamic routing on HA links

Problem 4: Flapping (Repeated Failovers)

Symptom: Devices keep failing over back and forth.

Causes:

Intermittent network connectivity
Resource exhaustion (CPU, memory) causing heartbeat delays
Gateway pool flapping
Hardware issues (failing NIC, power supply)

Solutions:

Check `/var/log/ltm` for failover reason codes
Monitor resource utilization (CPU, memory, network)
Verify physical connectivity and cable health
Tune gateway pool monitors to be less sensitive

Monitoring HA Health

Proactive monitoring prevents HA failures from becoming outages:

Critical Metrics to Monitor

Sync status: Should always be “In Sync”
Failover status: Active/Standby as expected (not both active)
Heartbeat health: All monitored paths sending heartbeats
Traffic group location: Floating IPs on expected device
Failover event count: Alert on unexpected failovers
Certificate expiration: Device trust certs

Monitoring via iControl REST

# Check sync status
GET https://ltm-ip/mgmt/tm/cm/sync-status

# Check failover status
GET https://ltm-ip/mgmt/tm/cm/failover-status

# Check device status
GET https://ltm-ip/mgmt/tm/cm/device

# Check traffic group status
GET https://ltm-ip/mgmt/tm/cm/traffic-groupCode language: PHP (php)

Integrate these API calls into Prometheus, Zabbix, or your monitoring platform to alert on HA issues before they cause outages.

Best Practices

Use identical hardware: HA pairs should have matching models, memory, CPU
Keep versions in sync: Run the same TMOS version on both devices
Dedicated HA VLAN: Don’t share HA traffic with production data
Multiple heartbeat paths: Network failover on at least 2 VLANs
Auto-sync enabled: Reduces manual sync operations and human error
Test failover regularly: Don’t wait for real failure to discover problems
Document traffic group mappings: Know which VIPs are in which traffic groups
Monitor sync status: Alert on “Changes Pending” that persist > 5 minutes
Avoid connection mirroring unless necessary: Performance overhead is significant
Plan capacity for Active-Active: Each device must handle 100% load alone

Conclusion

F5 LTM High Availability transforms load balancers from single points of failure into resilient infrastructure. When configured correctly, HA pairs provide seamless failover, automated configuration synchronization, and the peace of mind that comes from knowing your application delivery tier can survive hardware failures, software crashes, and planned maintenance.

The key to successful HA deployments:

Understand the different deployment models (Active-Standby vs Active-Active)
Configure redundant heartbeat paths
Monitor sync and failover status proactively
Test failover regularly (don’t wait for production failures)
Keep devices matched (hardware, software, configuration)

Get HA right, and your F5 infrastructure becomes bulletproof. Get it wrong, and you have two expensive single points of failure that can’t talk to each other.

Building F5 HA pairs or troubleshooting sync issues? Let’s connect on LinkedIn.

Tag: high-availability

F5 LTM High Availability: Building Bulletproof Load Balancer Pairs

What Is F5 LTM High Availability?

Core HA Capabilities

HA Deployment Models

1. Active-Standby (Most Common)

2. Active-Active

Device Service Clustering (DSC): The Foundation

Key DSC Components

How Failover Actually Works

Failover Triggers

Failover Sequence

Connection Mirroring: Stateful Failover

Network Connectivity Requirements

1. HA VLAN (ConfigSync/Failover)

2. Network Failover VLAN

3. Mirroring VLAN (Optional)

Configuration Walkthrough: Building an Active-Standby Pair

Step 1: Configure Management and HA Interfaces

Step 2: Establish Device Trust

Step 3: Create Device Group

Step 4: Configure Floating IPs

Step 5: Configure Network Failover

Step 6: Perform Initial Sync

Step 7: Verify HA Status

Common HA Problems and Solutions

Problem 1: Config Sync Fails

Problem 2: Split-Brain (Both Devices Active)

Problem 3: Failover Takes Too Long

Problem 4: Flapping (Repeated Failovers)

Monitoring HA Health

Critical Metrics to Monitor

Monitoring via iControl REST

Best Practices

Conclusion