Tag: high-availability

  • F5 LTM High Availability: Building Bulletproof Load Balancer Pairs

    Single points of failure are unacceptable in production environments. That’s why nearly every enterprise F5 LTM deployment runs in high availability (HA) pairs—two devices working together to ensure load balancing services remain available even when hardware fails, software crashes, or maintenance is required. Let’s dive into how F5 LTM HA actually works, the different deployment models, and the gotchas you’ll encounter when building resilient load balancer infrastructure.


    What Is F5 LTM High Availability?

    F5 LTM High Availability is a clustering technology that pairs two (or more) BIG-IP devices to eliminate single points of failure. When configured correctly, an HA pair ensures that if one device fails, the other seamlessly takes over—maintaining application availability without user impact.

    Core HA Capabilities

    • Configuration Synchronization: Changes made on one device automatically replicate to its partner
    • Automatic Failover: When the active device fails, the standby becomes active within seconds
    • Connection Mirroring: Active connections can be synchronized so failover is stateful (optional)
    • Health Monitoring: Devices continuously monitor each other’s health via heartbeat mechanisms
    • Shared Floating IPs: Virtual IP addresses (VIPs) move between devices during failover

    Analogy: Think of an HA pair like two pilots in a cockpit. The captain (active device) flies the plane while the first officer (standby device) monitors everything and stays ready. If the captain becomes incapacitated, the first officer immediately takes the controls. Passengers (users) never notice the transition.

    HA Deployment Models

    F5 supports multiple HA configurations, each with different use cases and trade-offs:

    1. Active-Standby (Most Common)

    How it works:

    • One device is Active and processes all traffic
    • The other device is Standby and ready to take over
    • Floating IP addresses (self IPs and VIPs) live on the active device
    • During failover, IPs move to the standby device (which becomes active)

    Traffic Flow:

    Normal Operation:
    [Clients][Active LTM][Servers][Standby LTM] (idle, monitoring)
    
    After Failover:
    [Clients][New Active LTM (was standby)][Servers][Failed LTM] (offline)Code language: CSS (css)

    Pros:

    • Simple to understand and troubleshoot
    • Standby has full capacity available during failover
    • Clean separation of roles (one device actively processing)
    • Best for most enterprise deployments

    Cons:

    • 50% of hardware capacity sits idle
    • Standby device doesn’t process traffic (wasted investment)

    2. Active-Active

    How it works:

    • Both devices are Active and process traffic simultaneously
    • Different VIPs are configured on each device (or same VIPs with traffic splitting)
    • During failover, the surviving device takes over all VIPs

    Example Setup:

    Device A (Active): Handles VIP 10.1.1.10 (Web App)
    Device B (Active): Handles VIP 10.1.1.20 (API App)
    
    During Normal Operation:
    [Web Clients][Device A][Web Servers]
    [API Clients][Device B][API Servers]
    
    If Device A Fails:
    [Web Clients][Device B (takes over VIP 10.1.1.10)][Web Servers]
    [API Clients][Device B (already handling)][API Servers]Code language: CSS (css)

    Pros:

    • 100% hardware utilization (no idle capacity)
    • Better ROI on hardware investment
    • Load distribution across both devices

    Cons:

    • More complex configuration and troubleshooting
    • During failover, the surviving device handles 200% load (must be sized accordingly)
    • Connection mirroring is more complicated
    • Higher risk of performance degradation during failure

    When to use: When hardware utilization is more important than operational simplicity, and you’ve sized each device to handle 100% of traffic alone.

    Device Service Clustering (DSC): The Foundation

    F5’s HA functionality is built on Device Service Clustering (DSC)—the framework that enables devices to work together.

    Key DSC Components

    1. Device Trust

    Before devices can cluster, they must establish trust via certificate exchange (using iQuery protocol on TCP 4353):

    # On Device A, add Device B to trust domain
    tmsh run cm config-sync to-group device_trust_group
    tmsh modify cm device-group device_trust_group devices add { device-b.example.com }Code language: PHP (php)

    2. Device Groups

    Device Groups define which devices work together and what gets synchronized:

    • Sync-Failover Group: Devices that sync config AND handle failover together (typical HA pair)
    • Sync-Only Group: Devices that only sync config (no failover coordination)
    # Create sync-failover device group
    tmsh create cm device-group my-ha-pair {
        type sync-failover
        devices { device-a.example.com device-b.example.com }
        auto-sync enabled
        network-failover enabled
    }Code language: PHP (php)

    3. Traffic Groups

    Traffic Groups define which floating IP addresses move together during failover:

    • Floating Self IPs (device management/communication IPs)
    • Virtual Server IPs (VIPs that clients connect to)
    • SNAT IPs (if used)

    In Active-Standby, you typically have one traffic group. In Active-Active, you have multiple traffic groups distributed across devices.

    How Failover Actually Works

    Failover Triggers

    Failover can be triggered by:

    • Hardware failure: Power loss, CPU failure, memory failure
    • Software failure: TMOS crash, kernel panic, critical daemon failure
    • Network failure: Loss of network connectivity (monitored interfaces down)
    • Manual failover: Administrator forces failover for maintenance
    • Gateway pool failure: All gateway pool members down (if configured)

    Failover Sequence

    When failover occurs:

    1. Detection: Standby detects active failure (missed heartbeats, interface down, etc.)
    2. Transition: Standby promotes itself to Active state
    3. IP Migration: Floating IPs (Self IPs, VIPs, SNATs) move to new active device
    4. Gratuitous ARP: New active sends GARP to update network switch MAC tables
    5. Traffic Resumption: New active begins processing traffic
    6. Connection Recovery: Existing connections either break (stateless) or continue (if mirrored)

    Typical failover time: 3-10 seconds for network failover, longer if connection mirroring is enabled.

    Connection Mirroring: Stateful Failover

    By default, failover is stateless—existing connections break and clients must reconnect. For mission-critical applications, you can enable connection mirroring:

    # Enable mirroring on a virtual server
    tmsh modify ltm virtual my-vip mirror enabledCode language: PHP (php)

    How it works:

    • Active device continuously replicates connection state to standby via dedicated mirroring network
    • Standby maintains a synchronized connection table
    • During failover, standby already knows about all active connections
    • Connections continue seamlessly (from client perspective)

    Trade-offs:

    • Pro: Zero connection loss during failover
    • Con: Significant performance overhead (each connection requires mirroring traffic)
    • Con: Requires dedicated high-bandwidth mirroring VLAN
    • Con: Only mirrors certain connection types (not all protocols supported)

    When to use: Long-lived connections (FTP, database, SSH) where reconnection is expensive or disruptive. Not worth it for short HTTP requests.

    Network Connectivity Requirements

    HA pairs require specific network connectivity:

    1. HA VLAN (ConfigSync/Failover)

    Purpose: Configuration synchronization and heartbeat monitoring

    • Dedicated VLAN connecting both devices
    • Carries iQuery traffic (TCP 4353) for config sync
    • Carries heartbeat traffic for failover detection
    • Typically uses non-floating Self IPs

    Best practice: Use a dedicated physical interface (not shared with data traffic) on a private VLAN.

    2. Network Failover VLAN

    Purpose: Redundant heartbeat path

    • Secondary heartbeat mechanism (separate from HA VLAN)
    • Prevents false failovers from single link failures
    • Can share data VLANs or use dedicated link

    Recommendation: Always configure network failover on at least one additional VLAN beyond the HA VLAN.

    3. Mirroring VLAN (Optional)

    Purpose: Connection state synchronization

    • High-bandwidth dedicated link for connection mirroring
    • Should be separate from HA VLAN (mirroring is bandwidth-intensive)
    • 10G+ recommended for high-throughput environments

    [Device A]                    [Device B]
        |                              |
        |--- HA VLAN (1.1) ------------|  (Config Sync, Heartbeat)
        |                              |
        |--- Mirror VLAN (1.2) --------|  (Connection Mirroring)
        |                              |
        |--- Client VLAN (10.1) -------|  (Data + Network Failover)
        |                              |
        |--- Server VLAN (10.2) -------|  (Data + Network Failover)

    Configuration Walkthrough: Building an Active-Standby Pair

    Here’s the step-by-step process for configuring a basic Active-Standby HA pair:

    Step 1: Configure Management and HA Interfaces

    On both devices, configure:

    # Device A
    tmsh create net vlan ha-vlan interfaces add { 1.1 }
    tmsh create net self 192.168.1.10 address 192.168.1.10/24 vlan ha-vlan allow-service default
    
    # Device B
    tmsh create net vlan ha-vlan interfaces add { 1.1 }
    tmsh create net self 192.168.1.11 address 192.168.1.11/24 vlan ha-vlan allow-service defaultCode language: PHP (php)

    Step 2: Establish Device Trust

    On Device A:

    # Discover and add Device B
    tmsh modify cm device device-a.example.com configsync-ip 192.168.1.10
    tmsh modify cm device device-a.example.com unicast-address { { ip 192.168.1.10 } }
    
    # Add Device B to trust domain (enter Device B's credentials when prompted)
    tmsh run cm config-sync to-group datasync-global-dgCode language: PHP (php)

    Step 3: Create Device Group

    # On Device A (will sync to Device B)
    tmsh create cm device-group my-ha-pair {
        type sync-failover
        devices { device-a.example.com device-b.example.com }
        auto-sync enabled
        network-failover enabled
    }Code language: PHP (php)

    Step 4: Configure Floating IPs

    # Create client-facing VLAN on both devices (already done in initial setup)
    # Then create FLOATING Self IP (will move during failover)
    tmsh create net self 10.1.1.10 address 10.1.1.10/24 vlan client-vlan traffic-group traffic-group-1 allow-service noneCode language: PHP (php)

    Step 5: Configure Network Failover

    # Enable network failover on client VLAN
    tmsh modify cm device device-a.example.com unicast-address add { { ip 10.1.1.10 } }Code language: PHP (php)

    Step 6: Perform Initial Sync

    # Force sync from Device A to Device B
    tmsh run cm config-sync to-group my-ha-pairCode language: PHP (php)

    Step 7: Verify HA Status

    # Check sync status
    tmsh show cm sync-status
    
    # Check failover status
    tmsh show cm failover-status
    
    # Verify device group
    tmsh show cm device-group my-ha-pairCode language: PHP (php)

    You should see Device A as Active and Device B as Standby, with sync status showing In Sync.

    Common HA Problems and Solutions

    Problem 1: Config Sync Fails

    Symptom: “Changes Pending” or “Awaiting Initial Sync” that never resolves.

    Causes:

    • iQuery connectivity broken (TCP 4353 blocked)
    • Certificate trust issues
    • Version mismatch between devices
    • Device group misconfiguration

    Solutions:

    # Verify iQuery connectivity
    telnet <peer-ip> 4353
    
    # Check sync status details
    tmsh show cm sync-status detail
    
    # Force sync from known-good device
    tmsh run cm config-sync to-group my-ha-pair
    
    # Nuclear option: remove and re-add device to trust
    tmsh delete cm device <device-name>
    # Re-establish trust and device group</device-name></peer-ip>Code language: PHP (php)

    Problem 2: Split-Brain (Both Devices Active)

    Symptom: Both devices think they’re active, both serving traffic.

    Cause: Heartbeat communication failed on ALL monitored paths, so each device assumes the other is dead.

    Prevention:

    • Configure network failover on multiple VLANs
    • Use dedicated HA VLAN separate from data VLANs
    • Monitor HA link health proactively

    Recovery:

    # Force one device to standby
    tmsh run sys failover standby
    
    # Investigate why heartbeat failed
    # Fix network connectivity
    # Verify heartbeat restored before trusting HA againCode language: PHP (php)

    Problem 3: Failover Takes Too Long

    Symptom: Failover takes 30+ seconds, causing extended outages.

    Causes:

    • Connection mirroring enabled on high-connection-count VIPs
    • Network convergence delays (STP, routing protocols)
    • Gateway pool checks delaying transition

    Solutions:

    • Disable connection mirroring unless absolutely necessary
    • Use portfast/RSTP on HA switch ports
    • Tune gateway pool monitor intervals
    • Consider static routes instead of dynamic routing on HA links

    Problem 4: Flapping (Repeated Failovers)

    Symptom: Devices keep failing over back and forth.

    Causes:

    • Intermittent network connectivity
    • Resource exhaustion (CPU, memory) causing heartbeat delays
    • Gateway pool flapping
    • Hardware issues (failing NIC, power supply)

    Solutions:

    • Check `/var/log/ltm` for failover reason codes
    • Monitor resource utilization (CPU, memory, network)
    • Verify physical connectivity and cable health
    • Tune gateway pool monitors to be less sensitive

    Monitoring HA Health

    Proactive monitoring prevents HA failures from becoming outages:

    Critical Metrics to Monitor

    • Sync status: Should always be “In Sync”
    • Failover status: Active/Standby as expected (not both active)
    • Heartbeat health: All monitored paths sending heartbeats
    • Traffic group location: Floating IPs on expected device
    • Failover event count: Alert on unexpected failovers
    • Certificate expiration: Device trust certs

    Monitoring via iControl REST

    # Check sync status
    GET https://ltm-ip/mgmt/tm/cm/sync-status
    
    # Check failover status
    GET https://ltm-ip/mgmt/tm/cm/failover-status
    
    # Check device status
    GET https://ltm-ip/mgmt/tm/cm/device
    
    # Check traffic group status
    GET https://ltm-ip/mgmt/tm/cm/traffic-groupCode language: PHP (php)

    Integrate these API calls into Prometheus, Zabbix, or your monitoring platform to alert on HA issues before they cause outages.

    Best Practices

    1. Use identical hardware: HA pairs should have matching models, memory, CPU
    2. Keep versions in sync: Run the same TMOS version on both devices
    3. Dedicated HA VLAN: Don’t share HA traffic with production data
    4. Multiple heartbeat paths: Network failover on at least 2 VLANs
    5. Auto-sync enabled: Reduces manual sync operations and human error
    6. Test failover regularly: Don’t wait for real failure to discover problems
    7. Document traffic group mappings: Know which VIPs are in which traffic groups
    8. Monitor sync status: Alert on “Changes Pending” that persist > 5 minutes
    9. Avoid connection mirroring unless necessary: Performance overhead is significant
    10. Plan capacity for Active-Active: Each device must handle 100% load alone

    Conclusion

    F5 LTM High Availability transforms load balancers from single points of failure into resilient infrastructure. When configured correctly, HA pairs provide seamless failover, automated configuration synchronization, and the peace of mind that comes from knowing your application delivery tier can survive hardware failures, software crashes, and planned maintenance.

    The key to successful HA deployments:

    • Understand the different deployment models (Active-Standby vs Active-Active)
    • Configure redundant heartbeat paths
    • Monitor sync and failover status proactively
    • Test failover regularly (don’t wait for production failures)
    • Keep devices matched (hardware, software, configuration)

    Get HA right, and your F5 infrastructure becomes bulletproof. Get it wrong, and you have two expensive single points of failure that can’t talk to each other.


    Building F5 HA pairs or troubleshooting sync issues? Let’s connect on LinkedIn.