Author: MIke

  • F5 LTM High Availability: Building Bulletproof Load Balancer Pairs

    Single points of failure are unacceptable in production environments. That’s why nearly every enterprise F5 LTM deployment runs in high availability (HA) pairs—two devices working together to ensure load balancing services remain available even when hardware fails, software crashes, or maintenance is required. Let’s dive into how F5 LTM HA actually works, the different deployment models, and the gotchas you’ll encounter when building resilient load balancer infrastructure.


    What Is F5 LTM High Availability?

    F5 LTM High Availability is a clustering technology that pairs two (or more) BIG-IP devices to eliminate single points of failure. When configured correctly, an HA pair ensures that if one device fails, the other seamlessly takes over—maintaining application availability without user impact.

    Core HA Capabilities

    • Configuration Synchronization: Changes made on one device automatically replicate to its partner
    • Automatic Failover: When the active device fails, the standby becomes active within seconds
    • Connection Mirroring: Active connections can be synchronized so failover is stateful (optional)
    • Health Monitoring: Devices continuously monitor each other’s health via heartbeat mechanisms
    • Shared Floating IPs: Virtual IP addresses (VIPs) move between devices during failover

    Analogy: Think of an HA pair like two pilots in a cockpit. The captain (active device) flies the plane while the first officer (standby device) monitors everything and stays ready. If the captain becomes incapacitated, the first officer immediately takes the controls. Passengers (users) never notice the transition.

    HA Deployment Models

    F5 supports multiple HA configurations, each with different use cases and trade-offs:

    1. Active-Standby (Most Common)

    How it works:

    • One device is Active and processes all traffic
    • The other device is Standby and ready to take over
    • Floating IP addresses (self IPs and VIPs) live on the active device
    • During failover, IPs move to the standby device (which becomes active)

    Traffic Flow:

    Normal Operation:
    [Clients][Active LTM][Servers][Standby LTM] (idle, monitoring)
    
    After Failover:
    [Clients][New Active LTM (was standby)][Servers][Failed LTM] (offline)Code language: CSS (css)

    Pros:

    • Simple to understand and troubleshoot
    • Standby has full capacity available during failover
    • Clean separation of roles (one device actively processing)
    • Best for most enterprise deployments

    Cons:

    • 50% of hardware capacity sits idle
    • Standby device doesn’t process traffic (wasted investment)

    2. Active-Active

    How it works:

    • Both devices are Active and process traffic simultaneously
    • Different VIPs are configured on each device (or same VIPs with traffic splitting)
    • During failover, the surviving device takes over all VIPs

    Example Setup:

    Device A (Active): Handles VIP 10.1.1.10 (Web App)
    Device B (Active): Handles VIP 10.1.1.20 (API App)
    
    During Normal Operation:
    [Web Clients][Device A][Web Servers]
    [API Clients][Device B][API Servers]
    
    If Device A Fails:
    [Web Clients][Device B (takes over VIP 10.1.1.10)][Web Servers]
    [API Clients][Device B (already handling)][API Servers]Code language: CSS (css)

    Pros:

    • 100% hardware utilization (no idle capacity)
    • Better ROI on hardware investment
    • Load distribution across both devices

    Cons:

    • More complex configuration and troubleshooting
    • During failover, the surviving device handles 200% load (must be sized accordingly)
    • Connection mirroring is more complicated
    • Higher risk of performance degradation during failure

    When to use: When hardware utilization is more important than operational simplicity, and you’ve sized each device to handle 100% of traffic alone.

    Device Service Clustering (DSC): The Foundation

    F5’s HA functionality is built on Device Service Clustering (DSC)—the framework that enables devices to work together.

    Key DSC Components

    1. Device Trust

    Before devices can cluster, they must establish trust via certificate exchange (using iQuery protocol on TCP 4353):

    # On Device A, add Device B to trust domain
    tmsh run cm config-sync to-group device_trust_group
    tmsh modify cm device-group device_trust_group devices add { device-b.example.com }Code language: PHP (php)

    2. Device Groups

    Device Groups define which devices work together and what gets synchronized:

    • Sync-Failover Group: Devices that sync config AND handle failover together (typical HA pair)
    • Sync-Only Group: Devices that only sync config (no failover coordination)
    # Create sync-failover device group
    tmsh create cm device-group my-ha-pair {
        type sync-failover
        devices { device-a.example.com device-b.example.com }
        auto-sync enabled
        network-failover enabled
    }Code language: PHP (php)

    3. Traffic Groups

    Traffic Groups define which floating IP addresses move together during failover:

    • Floating Self IPs (device management/communication IPs)
    • Virtual Server IPs (VIPs that clients connect to)
    • SNAT IPs (if used)

    In Active-Standby, you typically have one traffic group. In Active-Active, you have multiple traffic groups distributed across devices.

    How Failover Actually Works

    Failover Triggers

    Failover can be triggered by:

    • Hardware failure: Power loss, CPU failure, memory failure
    • Software failure: TMOS crash, kernel panic, critical daemon failure
    • Network failure: Loss of network connectivity (monitored interfaces down)
    • Manual failover: Administrator forces failover for maintenance
    • Gateway pool failure: All gateway pool members down (if configured)

    Failover Sequence

    When failover occurs:

    1. Detection: Standby detects active failure (missed heartbeats, interface down, etc.)
    2. Transition: Standby promotes itself to Active state
    3. IP Migration: Floating IPs (Self IPs, VIPs, SNATs) move to new active device
    4. Gratuitous ARP: New active sends GARP to update network switch MAC tables
    5. Traffic Resumption: New active begins processing traffic
    6. Connection Recovery: Existing connections either break (stateless) or continue (if mirrored)

    Typical failover time: 3-10 seconds for network failover, longer if connection mirroring is enabled.

    Connection Mirroring: Stateful Failover

    By default, failover is stateless—existing connections break and clients must reconnect. For mission-critical applications, you can enable connection mirroring:

    # Enable mirroring on a virtual server
    tmsh modify ltm virtual my-vip mirror enabledCode language: PHP (php)

    How it works:

    • Active device continuously replicates connection state to standby via dedicated mirroring network
    • Standby maintains a synchronized connection table
    • During failover, standby already knows about all active connections
    • Connections continue seamlessly (from client perspective)

    Trade-offs:

    • Pro: Zero connection loss during failover
    • Con: Significant performance overhead (each connection requires mirroring traffic)
    • Con: Requires dedicated high-bandwidth mirroring VLAN
    • Con: Only mirrors certain connection types (not all protocols supported)

    When to use: Long-lived connections (FTP, database, SSH) where reconnection is expensive or disruptive. Not worth it for short HTTP requests.

    Network Connectivity Requirements

    HA pairs require specific network connectivity:

    1. HA VLAN (ConfigSync/Failover)

    Purpose: Configuration synchronization and heartbeat monitoring

    • Dedicated VLAN connecting both devices
    • Carries iQuery traffic (TCP 4353) for config sync
    • Carries heartbeat traffic for failover detection
    • Typically uses non-floating Self IPs

    Best practice: Use a dedicated physical interface (not shared with data traffic) on a private VLAN.

    2. Network Failover VLAN

    Purpose: Redundant heartbeat path

    • Secondary heartbeat mechanism (separate from HA VLAN)
    • Prevents false failovers from single link failures
    • Can share data VLANs or use dedicated link

    Recommendation: Always configure network failover on at least one additional VLAN beyond the HA VLAN.

    3. Mirroring VLAN (Optional)

    Purpose: Connection state synchronization

    • High-bandwidth dedicated link for connection mirroring
    • Should be separate from HA VLAN (mirroring is bandwidth-intensive)
    • 10G+ recommended for high-throughput environments

    [Device A]                    [Device B]
        |                              |
        |--- HA VLAN (1.1) ------------|  (Config Sync, Heartbeat)
        |                              |
        |--- Mirror VLAN (1.2) --------|  (Connection Mirroring)
        |                              |
        |--- Client VLAN (10.1) -------|  (Data + Network Failover)
        |                              |
        |--- Server VLAN (10.2) -------|  (Data + Network Failover)

    Configuration Walkthrough: Building an Active-Standby Pair

    Here’s the step-by-step process for configuring a basic Active-Standby HA pair:

    Step 1: Configure Management and HA Interfaces

    On both devices, configure:

    # Device A
    tmsh create net vlan ha-vlan interfaces add { 1.1 }
    tmsh create net self 192.168.1.10 address 192.168.1.10/24 vlan ha-vlan allow-service default
    
    # Device B
    tmsh create net vlan ha-vlan interfaces add { 1.1 }
    tmsh create net self 192.168.1.11 address 192.168.1.11/24 vlan ha-vlan allow-service defaultCode language: PHP (php)

    Step 2: Establish Device Trust

    On Device A:

    # Discover and add Device B
    tmsh modify cm device device-a.example.com configsync-ip 192.168.1.10
    tmsh modify cm device device-a.example.com unicast-address { { ip 192.168.1.10 } }
    
    # Add Device B to trust domain (enter Device B's credentials when prompted)
    tmsh run cm config-sync to-group datasync-global-dgCode language: PHP (php)

    Step 3: Create Device Group

    # On Device A (will sync to Device B)
    tmsh create cm device-group my-ha-pair {
        type sync-failover
        devices { device-a.example.com device-b.example.com }
        auto-sync enabled
        network-failover enabled
    }Code language: PHP (php)

    Step 4: Configure Floating IPs

    # Create client-facing VLAN on both devices (already done in initial setup)
    # Then create FLOATING Self IP (will move during failover)
    tmsh create net self 10.1.1.10 address 10.1.1.10/24 vlan client-vlan traffic-group traffic-group-1 allow-service noneCode language: PHP (php)

    Step 5: Configure Network Failover

    # Enable network failover on client VLAN
    tmsh modify cm device device-a.example.com unicast-address add { { ip 10.1.1.10 } }Code language: PHP (php)

    Step 6: Perform Initial Sync

    # Force sync from Device A to Device B
    tmsh run cm config-sync to-group my-ha-pairCode language: PHP (php)

    Step 7: Verify HA Status

    # Check sync status
    tmsh show cm sync-status
    
    # Check failover status
    tmsh show cm failover-status
    
    # Verify device group
    tmsh show cm device-group my-ha-pairCode language: PHP (php)

    You should see Device A as Active and Device B as Standby, with sync status showing In Sync.

    Common HA Problems and Solutions

    Problem 1: Config Sync Fails

    Symptom: “Changes Pending” or “Awaiting Initial Sync” that never resolves.

    Causes:

    • iQuery connectivity broken (TCP 4353 blocked)
    • Certificate trust issues
    • Version mismatch between devices
    • Device group misconfiguration

    Solutions:

    # Verify iQuery connectivity
    telnet <peer-ip> 4353
    
    # Check sync status details
    tmsh show cm sync-status detail
    
    # Force sync from known-good device
    tmsh run cm config-sync to-group my-ha-pair
    
    # Nuclear option: remove and re-add device to trust
    tmsh delete cm device <device-name>
    # Re-establish trust and device group</device-name></peer-ip>Code language: PHP (php)

    Problem 2: Split-Brain (Both Devices Active)

    Symptom: Both devices think they’re active, both serving traffic.

    Cause: Heartbeat communication failed on ALL monitored paths, so each device assumes the other is dead.

    Prevention:

    • Configure network failover on multiple VLANs
    • Use dedicated HA VLAN separate from data VLANs
    • Monitor HA link health proactively

    Recovery:

    # Force one device to standby
    tmsh run sys failover standby
    
    # Investigate why heartbeat failed
    # Fix network connectivity
    # Verify heartbeat restored before trusting HA againCode language: PHP (php)

    Problem 3: Failover Takes Too Long

    Symptom: Failover takes 30+ seconds, causing extended outages.

    Causes:

    • Connection mirroring enabled on high-connection-count VIPs
    • Network convergence delays (STP, routing protocols)
    • Gateway pool checks delaying transition

    Solutions:

    • Disable connection mirroring unless absolutely necessary
    • Use portfast/RSTP on HA switch ports
    • Tune gateway pool monitor intervals
    • Consider static routes instead of dynamic routing on HA links

    Problem 4: Flapping (Repeated Failovers)

    Symptom: Devices keep failing over back and forth.

    Causes:

    • Intermittent network connectivity
    • Resource exhaustion (CPU, memory) causing heartbeat delays
    • Gateway pool flapping
    • Hardware issues (failing NIC, power supply)

    Solutions:

    • Check `/var/log/ltm` for failover reason codes
    • Monitor resource utilization (CPU, memory, network)
    • Verify physical connectivity and cable health
    • Tune gateway pool monitors to be less sensitive

    Monitoring HA Health

    Proactive monitoring prevents HA failures from becoming outages:

    Critical Metrics to Monitor

    • Sync status: Should always be “In Sync”
    • Failover status: Active/Standby as expected (not both active)
    • Heartbeat health: All monitored paths sending heartbeats
    • Traffic group location: Floating IPs on expected device
    • Failover event count: Alert on unexpected failovers
    • Certificate expiration: Device trust certs

    Monitoring via iControl REST

    # Check sync status
    GET https://ltm-ip/mgmt/tm/cm/sync-status
    
    # Check failover status
    GET https://ltm-ip/mgmt/tm/cm/failover-status
    
    # Check device status
    GET https://ltm-ip/mgmt/tm/cm/device
    
    # Check traffic group status
    GET https://ltm-ip/mgmt/tm/cm/traffic-groupCode language: PHP (php)

    Integrate these API calls into Prometheus, Zabbix, or your monitoring platform to alert on HA issues before they cause outages.

    Best Practices

    1. Use identical hardware: HA pairs should have matching models, memory, CPU
    2. Keep versions in sync: Run the same TMOS version on both devices
    3. Dedicated HA VLAN: Don’t share HA traffic with production data
    4. Multiple heartbeat paths: Network failover on at least 2 VLANs
    5. Auto-sync enabled: Reduces manual sync operations and human error
    6. Test failover regularly: Don’t wait for real failure to discover problems
    7. Document traffic group mappings: Know which VIPs are in which traffic groups
    8. Monitor sync status: Alert on “Changes Pending” that persist > 5 minutes
    9. Avoid connection mirroring unless necessary: Performance overhead is significant
    10. Plan capacity for Active-Active: Each device must handle 100% load alone

    Conclusion

    F5 LTM High Availability transforms load balancers from single points of failure into resilient infrastructure. When configured correctly, HA pairs provide seamless failover, automated configuration synchronization, and the peace of mind that comes from knowing your application delivery tier can survive hardware failures, software crashes, and planned maintenance.

    The key to successful HA deployments:

    • Understand the different deployment models (Active-Standby vs Active-Active)
    • Configure redundant heartbeat paths
    • Monitor sync and failover status proactively
    • Test failover regularly (don’t wait for production failures)
    • Keep devices matched (hardware, software, configuration)

    Get HA right, and your F5 infrastructure becomes bulletproof. Get it wrong, and you have two expensive single points of failure that can’t talk to each other.


    Building F5 HA pairs or troubleshooting sync issues? Let’s connect on LinkedIn.

  • F5 iQuery: The Silent Protocol That Makes GTM Actually Work

    If you’ve ever configured F5 GTM, set up an LTM HA pair, or joined BIG-IP devices into a Device Service Cluster, you’ve used iQuery—even if you didn’t realize it. iQuery is F5’s proprietary communication protocol that enables BIG-IP devices to discover each other, exchange configuration data, share health status, and synchronize state. It’s the invisible backbone of nearly every multi-device F5 deployment, yet it’s often overlooked until something breaks. Let’s explore what iQuery actually is, where it’s used, and why it matters.


    What Is iQuery?

    iQuery is F5’s proprietary protocol for BIG-IP device-to-device communication. It’s the universal language that allows BIG-IP systems to discover each other, establish trust, exchange data, and coordinate operations—regardless of whether they’re LTMs, GTMs, or any other BIG-IP module.

    Technical Details

    • Protocol: Encrypted TCP-based communication
    • Default Port: TCP 4353
    • Encryption: SSL/TLS with certificate-based mutual authentication
    • Scope: Device trust, config sync, health monitoring, state sharing
    • Firewall Requirements: Must allow TCP 4353 between all BIG-IP devices that need to communicate

    Think of iQuery as the nervous system connecting all your BIG-IP devices. It’s how they talk to each other, trust each other, and coordinate their actions.

    Where iQuery Is Used

    iQuery powers multiple critical F5 features across different deployment scenarios:

    1. LTM High Availability (Device Service Clustering)

    Use Case: Active-Standby or Active-Active LTM pairs

    When you set up an LTM HA pair, iQuery handles:

    • Device trust establishment: Initial pairing and certificate exchange
    • Configuration synchronization: Keeping both devices’ configs identical
    • Failover coordination: Detecting failures and triggering failover
    • Connection mirroring setup: Synchronizing connection tables for stateful failover

    Example Scenario:

    1. You create a virtual server on the active LTM
    2. iQuery synchronizes that configuration to the standby LTM
    3. Both devices now have identical configs
    4. If active fails, standby takes over seamlessly

    Without iQuery: Your HA pair can’t sync configs, coordinate failover, or mirror connections. You’d have to manually configure both devices and hope they stay in sync.

    2. GTM to LTM Communication

    Use Case: Global load balancing with GTM managing remote LTM pools

    This is where iQuery becomes highly visible and absolutely critical:

    The Scenario: GTM in New York making global load balancing decisions for LTM pools in:

    • New York data center (local LTM)
    • London data center (remote LTM)
    • Singapore data center (remote LTM)

    How iQuery enables this:

    1. GTM establishes iQuery connections to all three LTMs
    2. Each LTM reports pool member health status via iQuery
    3. LTMs share performance metrics (connections, throughput, response times)
    4. GTM uses this real-time data to make intelligent routing decisions

    Without iQuery: GTM has no idea if London’s web servers are down or Singapore is experiencing high latency. It would blindly send traffic to dead pools.

    3. GTM to GTM Synchronization

    Use Case: Redundant GTM pairs (active-active or active-standby)

    iQuery synchronizes between GTM devices:

    • Configuration changes: Wide IPs, pools, data centers
    • Wide IP states: Enabled/disabled status
    • Topology records: Geographic routing rules
    • Listener decisions: DNS query handling

    4. Device Trust and Discovery

    Use Case: Any multi-device BIG-IP deployment

    Before BIG-IP devices can work together, they must establish trust via iQuery:

    1. Administrator initiates device discovery
    2. Devices exchange SSL certificates via iQuery
    3. Mutual authentication validates both devices
    4. Trust relationship established
    5. Devices can now sync configs, share data, coordinate operations

    This certificate-based trust is the foundation for all other iQuery functionality.

    How iQuery Works: A Deep Dive

    Step 1: Certificate Exchange and Trust

    Every BIG-IP device has a unique SSL certificate. When you add a device to a trust domain or Device Service Cluster:

    1. Discovery: You specify the remote device’s IP address
    2. Connection: Device A connects to Device B on TCP 4353
    3. Certificate Exchange: Both devices share their SSL certificates
    4. Validation: Each device validates the other’s certificate
    5. Trust Established: Encrypted iQuery channel is now active

    This mutual authentication ensures only authorized BIG-IP devices can participate in the cluster.

    Step 2: Ongoing Communication

    Once trust is established, iQuery carries different types of data depending on the use case:

    For LTM HA:

    • Configuration changes (immediate sync)
    • Heartbeat signals (continuous)
    • Failover state (event-driven)
    • Connection mirror data (if enabled)

    For GTM → LTM:

    • Virtual server status (polling, typically every few seconds)
    • Pool member health (continuous monitoring)
    • Performance metrics (periodic updates)
    • System resources (CPU, memory, connections)

    Step 3: Encrypted Transport

    All iQuery traffic is encrypted with SSL/TLS, so:

    • Configuration data can’t be intercepted
    • Health status remains confidential
    • Performance metrics are protected
    • Only trusted devices can decrypt the data

    Configuration Examples

    Example 1: Setting Up LTM HA (Device Trust)

    On Device A (192.168.1.10):

    # Add Device B to the trust domain
    tmsh modify cm device-group device_trust_group devices add { device-b.example.com }
    tmsh run cm config-sync to-group device_trust_groupCode language: PHP (php)

    Behind the scenes:

    1. Device A initiates iQuery connection to Device B (192.168.1.11:4353)
    2. Certificates exchanged and validated
    3. Device trust established
    4. Configuration sync begins via iQuery

    Example 2: Adding LTM Servers to GTM

    On GTM:

    # Create datacenter
    tmsh create gtm datacenter NYC address 10.1.1.1
    
    # Add LTM server
    tmsh create gtm server nyc-ltm1 {
        datacenter NYC
        addresses { 10.1.1.100 }
        product bigip
    }
    
    # GTM automatically discovers virtual servers via iQueryCode language: PHP (php)

    Behind the scenes:

    1. GTM connects to LTM at 10.1.1.100:4353
    2. Certificate exchange and validation
    3. GTM queries LTM for available virtual servers
    4. LTM begins reporting health/performance data via iQuery

    How Important Is iQuery?

    For Any Multi-Device F5 Deployment: Critical

    iQuery is not optional for multi-device F5 deployments. Here’s what breaks without it:

    LTM HA Failures:

    • Configuration sync stops working
    • HA pair can’t coordinate failover
    • Connection mirroring fails
    • Config drift between devices
    • Manual intervention required for every change

    GTM Failures:

    • GTM cannot determine pool member health
    • Load balancing decisions become stale and inaccurate
    • Traffic sent to failed data centers
    • Performance-based algorithms stop working
    • “Global” load balancing degrades to DNS round-robin

    Real-World Impact

    I’ve seen iQuery failures cause:

    • Split-brain HA pairs: Both devices think they’re active because they can’t communicate
    • Configuration drift: Changes on active LTM never sync to standby, then failover reveals completely different configs
    • GTM sending traffic to offline data centers: No iQuery = no health visibility
    • Unbalanced load distribution: One DC overwhelmed while others idle

    Common iQuery Problems and Solutions

    Problem 1: Firewall Blocking Port 4353

    Symptom: Devices show as “Unknown” or config sync fails with connection errors.

    Cause: Firewall between devices is blocking TCP 4353.

    Solution:

    # Test connectivity
    telnet <remote-device-ip> 4353
    
    # Check iQuery status
    tmsh show cm device
    
    # For GTM specifically
    tmsh show gtm server <server-name>
    
    # Verify device is listening
    netstat -an | grep 4353</server-name></remote-device-ip>Code language: HTML, XML (xml)

    Work with your network team to allow bidirectional TCP 4353 between all BIG-IP devices that need to communicate.

    Problem 2: Certificate Mismatch or Expiration

    Symptom: iQuery connection fails with SSL/certificate errors in `/var/log/ltm`.

    Cause: Certificates were regenerated, expired, or trust relationship corrupted.

    Solution for LTM HA:

    # Remove device from trust
    tmsh delete cm device <device-name>
    
    # Re-establish trust
    tmsh modify cm device-group device_trust_group devices add { <device-name> }
    
    # Force config sync
    tmsh run cm config-sync to-group device_trust_group</device-name></device-name>Code language: HTML, XML (xml)

    Solution for GTM:

    # Remove and re-add server to force certificate re-exchange
    tmsh delete gtm server <server-name>
    tmsh create gtm server <server-name> addresses { <ltm-ip> } datacenter <dc-name></dc-name></ltm-ip></server-name></server-name>Code language: HTML, XML (xml)

    Problem 3: Version Mismatch

    Symptom: Some features don’t work, partial data sync, or connection instability.

    Cause: Devices running significantly different TMOS versions with incompatible iQuery protocol changes.

    Solution: While iQuery is generally backward-compatible, F5 recommends keeping device versions within 2-3 major releases. Upgrade devices to align versions.

    Problem 4: Config Sync Failures

    Symptom: “Awaiting Initial Sync” or “Changes Pending” that never resolve.

    Cause: iQuery connection issues or sync-failover device group problems.

    Solution:

    # Check sync status
    tmsh show cm sync-status
    
    # Force sync from known-good device
    tmsh run cm config-sync to-group <device-group-name>
    
    # If all else fails, restart config sync service
    tmsh restart sys service mcpd</device-group-name>Code language: PHP (php)

    Monitoring iQuery Health

    Proactive monitoring prevents iQuery failures from causing outages:

    Key Metrics to Monitor

    For LTM HA:

    • Device trust status: All devices should show as trusted
    • Config sync state: Should be “In Sync”
    • Failover status: Active/Standby as expected
    • Certificate expiration: Monitor device certs

    For GTM:

    • Server status: All GTM servers should show “Available (Enabled)”
    • Virtual server status: Monitor state of all VS objects
    • iQuery connection count: Should match expected number of LTMs
    • Last update timestamp: Data should be fresh (< 10 seconds)

    Monitoring via iControl REST API

    # Check LTM HA sync status
    GET https://ltm-ip/mgmt/tm/cm/sync-status
    
    # Check device trust
    GET https://ltm-ip/mgmt/tm/cm/device
    
    # Query GTM server status
    GET https://gtm-ip/mgmt/tm/gtm/server
    
    # Check GTM virtual server health
    GET https://gtm-ip/mgmt/tm/gtm/server/~Common~ltm-server/virtual-servers/statsCode language: PHP (php)

    Integrate these checks into your monitoring platform (Prometheus, Zabbix, Nagios) to alert on iQuery failures before users are impacted.

    Security Considerations

    1. Mutual Certificate Authentication

    iQuery’s certificate-based mutual auth is strong, but:

    • Protect certificate private keys on all devices
    • Monitor for unauthorized devices attempting iQuery connections
    • Rotate certificates periodically (though F5 doesn’t make this easy)

    2. Network Segmentation

    Limit TCP 4353 access:

    • Only allow between trusted BIG-IP devices
    • Don’t expose port 4353 to the internet
    • Use management VLANs for iQuery traffic when possible
    • Implement firewall rules between data centers

    3. Encryption

    iQuery traffic is encrypted by default (SSL/TLS), so passive sniffing won’t reveal configuration or health data. Ensure you’re running modern TMOS versions with up-to-date cipher suites.

    The Bottom Line: iQuery’s Importance

    iQuery is the universal glue that holds multi-device F5 deployments together.

    • For LTM HA: iQuery enables config sync, failover coordination, and connection mirroring
    • For GTM: iQuery provides the health visibility that makes intelligent global load balancing possible
    • For any multi-device deployment: iQuery is how devices discover, trust, and communicate with each other

    Without iQuery, you don’t have high availability, you don’t have global load balancing, and you don’t have device clustering. You just have isolated BIG-IP boxes that happen to be on the same network.

    Key Takeaways

    1. iQuery is the universal BIG-IP device-to-device protocol, not just for GTM
    2. Runs on TCP port 4353 with SSL/TLS encryption
    3. Powers LTM HA: config sync, failover, connection mirroring
    4. Enables GTM intelligence: health monitoring and performance metrics from LTMs
    5. Requires device trust via certificate exchange before communication
    6. Firewall rules must permit TCP 4353 between all communicating devices
    7. Monitor iQuery health proactively to prevent deployment failures

    Conclusion

    iQuery is one of those foundational technologies that “just works” until it doesn’t—and when it breaks, entire F5 deployments fail. LTM HA pairs can’t sync. GTM sends traffic to dead pools. Failovers don’t happen. It’s catastrophic.

    Understanding iQuery, ensuring TCP 4353 connectivity, monitoring certificate health, and watching for sync failures will save you from 2 AM pages about your load balancers being in split-brain or your global traffic manager routing everyone to an offline data center.

    If you manage F5 infrastructure—whether LTM HA pairs or global GTM deployments—treat iQuery health as seriously as you treat power and network connectivity. It’s the invisible backbone holding everything together.


    Managing F5 infrastructure or troubleshooting iQuery? Let’s connect on LinkedIn.

  • F5 iControl: The API That Powers Everything

    If you’ve ever used the F5 BIG-IP GUI, deployed an iApp, or run a Terraform script against your load balancers, you’ve used iControl—even if you didn’t realize it. iControl is the foundational API layer that sits beneath nearly every interaction with F5 devices. Let’s demystify what iControl actually is, how it works, and why it matters for modern F5 management.


    What Is iControl?

    iControl is F5’s programmatic interface for managing BIG-IP systems. It’s the API layer that allows external applications, scripts, and tools to interact with the BIG-IP platform without touching the command line or GUI.

    The Core Components

    iControl isn’t a single thing—it’s actually a family of APIs:

    • iControl SOAP API: The original SOAP-based web services interface (legacy, still supported)
    • iControl REST API: Modern RESTful API introduced in TMOS v11.5+ (current standard)
    • iControl Extensions: Specialized APIs for specific functions (LX for custom JavaScript workers)

    When people say “iControl” today, they almost always mean the iControl REST API.

    What Can iControl Do?

    Anything you can do through the GUI or CLI, you can do through iControl:

    • Create/modify/delete virtual servers, pools, nodes, monitors
    • Upload SSL certificates and manage profiles
    • Deploy iRules and iApps
    • Query statistics and performance metrics
    • Manage device configuration and system settings
    • Handle failover and high availability operations
    • Pull logs and troubleshooting data

    Think of iControl as the universal remote control for your F5 infrastructure.

    iControl REST: The Modern Standard

    The iControl REST API is what you’ll interact with in modern F5 environments. It follows standard REST principles:

    • HTTP verbs: GET (read), POST (create), PUT/PATCH (update), DELETE (remove)
    • JSON format: Requests and responses use JSON
    • URI structure: Resources are accessed via hierarchical URLs
    • Stateless: Each request contains all necessary information

    Basic REST Endpoint Structure

    All iControl REST API calls follow this pattern:

    https://<BIG-IP-IP>/mgmt/tm/<module>/<component>/<object>Code language: HTML, XML (xml)

    Examples:

    # List all virtual servers
    GET https://192.168.1.100/mgmt/tm/ltm/virtual
    
    # Get details of a specific pool
    GET https://192.168.1.100/mgmt/tm/ltm/pool/~Common~web_pool
    
    # View pool member statistics
    GET https://192.168.1.100/mgmt/tm/ltm/pool/~Common~web_pool/members/stats
    
    # Query system information
    GET https://192.168.1.100/mgmt/tm/sys/global-settingsCode language: PHP (php)

    Authentication

    iControl REST supports two authentication methods:

    1. Basic Authentication (simple, but credentials sent with every request):

    curl -u admin:password \
      https://192.168.1.100/mgmt/tm/ltm/virtualCode language: JavaScript (javascript)

    2. Token-Based Authentication (recommended for automation):

    # Get a token
    curl -X POST \
      -u admin:password \
      https://192.168.1.100/mgmt/shared/authn/login \
      -d '{"username":"admin","password":"password","loginProviderName":"tmos"}'
    
    # Use the token
    curl -H "X-F5-Auth-Token: <token>" \
      https://192.168.1.100/mgmt/tm/ltm/virtualCode language: PHP (php)

    Real-World Examples: iControl in Action

    Example 1: Creating a Pool

    POST https://192.168.1.100/mgmt/tm/ltm/pool
    
    {
      "name": "web_pool",
      "monitor": "/Common/http",
      "loadBalancingMode": "round-robin",
      "members": [
        {
          "name": "192.168.10.10:80",
          "address": "192.168.10.10"
        },
        {
          "name": "192.168.10.11:80",
          "address": "192.168.10.11"
        }
      ]
    }Code language: JavaScript (javascript)

    Example 2: Querying Pool Member Status

    GET https://192.168.1.100/mgmt/tm/ltm/pool/~Common~web_pool/members/stats
    
    # Returns JSON with member state, connection counts, etc.Code language: PHP (php)

    Example 3: Disabling a Pool Member

    PATCH https://192.168.1.100/mgmt/tm/ltm/pool/~Common~web_pool/members/~Common~192.168.10.10:80
    
    {
      "state": "user-down",
      "session": "user-disabled"
    }Code language: JavaScript (javascript)

    Why iControl Matters

    1. Automation and Infrastructure-as-Code

    iControl is the foundation for all F5 automation:

    • Ansible: F5 modules use iControl REST under the hood
    • Terraform: F5 provider leverages iControl API
    • Python scripts: f5-sdk library wraps iControl calls
    • Custom integrations: ServiceNow, CI/CD pipelines, monitoring tools

    Without iControl, there would be no programmatic F5 management.

    2. The GUI Uses iControl

    Here’s something most people don’t realize: the F5 web GUI is just a pretty wrapper around iControl REST calls.

    When you click “Create” on a virtual server in the GUI, it’s making an iControl REST POST behind the scenes. You can actually watch this happen in your browser’s developer tools—every GUI action translates to API calls.

    This means anything you can do in the GUI, you can do via API (and vice versa).

    3. Multi-Device Management

    iControl makes it trivial to manage dozens or hundreds of F5 devices consistently:

    • Deploy identical configurations across multiple BIG-IPs
    • Query status from all devices simultaneously
    • Implement configuration drift detection
    • Orchestrate complex multi-device workflows

    4. Monitoring and Observability

    iControl enables deep integration with monitoring platforms:

    • Pull real-time statistics (connections, throughput, CPU, memory)
    • Query pool member health states
    • Extract virtual server performance metrics
    • Retrieve event logs and alerts

    Tools like Prometheus exporters, Grafana dashboards, and custom monitoring scripts all rely on iControl to gather data.

    iControl vs. TMSH: Which Should You Use?

    F5 devices also have a command-line interface called TMSH (Traffic Management Shell). How does it compare to iControl?

    FeatureiControl REST APITMSH
    Access MethodHTTP/HTTPS (remote)SSH (direct access required)
    FormatJSON (structured data)Text output (parsing required)
    Automation-FriendlyExcellent (designed for it)Good (with scripting)
    IdempotencyNative REST semanticsManual implementation
    Cross-PlatformAny HTTP clientSSH client required
    Firewall-FriendlyYes (HTTPS port 443)SSH port 22
    Learning CurveModerate (REST/JSON)Low (CLI-based)
    Best ForAutomation, integration, appsManual admin, troubleshooting

    General rule: Use iControl for automation and programmatic access. Use TMSH for interactive troubleshooting and one-off administrative tasks.

    Common iControl Use Cases

    1. Blue-Green Deployments

    Script iControl calls to:

    1. Deploy new application version to “green” pool
    2. Run health checks via API
    3. Switch traffic from “blue” to “green” pool
    4. Disable old pool members

    2. Dynamic Scaling

    Integrate with orchestration platforms (Kubernetes, AWS Auto Scaling) to:

    • Automatically add pool members when containers/instances launch
    • Remove pool members when instances terminate
    • Adjust connection limits based on demand

    3. Configuration Backup and Disaster Recovery

    Use iControl to:

    • Export UCS archives programmatically
    • Pull configuration as JSON for version control
    • Compare configurations across devices
    • Restore configurations automatically

    4. Security and Compliance Auditing

    Query iControl to:

    • Verify SSL/TLS cipher suites across all virtual servers
    • Check certificate expiration dates
    • Audit unused objects and orphaned configurations
    • Generate compliance reports

    The Gotchas and Limitations

    1. URI Encoding Hell

    F5 object names often contain special characters (slashes, tildes) that must be URL-encoded:

    # Partition "Common", pool "web_pool"
    Wrong: /mgmt/tm/ltm/pool/Common/web_pool
    Right: /mgmt/tm/ltm/pool/~Common~web_poolCode language: PHP (php)

    Forgetting to encode URIs is a common source of “404 Not Found” errors.

    2. Transaction Support is Limited

    iControl REST supports transactions for atomic multi-object changes, but they’re clunky and not widely used. Most automation tools just make sequential API calls and hope nothing breaks mid-flight.

    3. Rate Limiting and Performance

    The F5 API has limits:

    • Default maximum of 10 concurrent connections per user
    • Heavy API usage can impact control plane performance
    • Large configuration changes (hundreds of objects) can be slow

    Plan accordingly when building high-volume automation.

    4. Documentation Can Be Dense

    F5’s official iControl REST documentation is comprehensive but overwhelming. Finding the exact API endpoint and payload structure for your use case requires patience and experimentation.

    Pro tip: Use the GUI with browser developer tools open to see what API calls it makes—this is often faster than reading documentation.

    Getting Started with iControl

    Tools and Libraries

    Python:

    # Official F5 SDK
    pip install f5-sdk
    
    # Example usage
    from f5.bigip import ManagementRoot
    mgmt = ManagementRoot('192.168.1.100', 'admin', 'password')
    pools = mgmt.tm.ltm.pools.get_collection()
    for pool in pools:
        print(pool.name)Code language: PHP (php)

    curl (for quick testing):

    curl -sku admin:password \
      https://192.168.1.100/mgmt/tm/ltm/virtual | jq .Code language: JavaScript (javascript)

    Postman: Great for exploring the API interactively

    Best Practices

    1. Use token authentication for scripts and automation
    2. Implement idempotency: Check if object exists before creating
    3. Handle errors gracefully: Don’t assume API calls always succeed
    4. Log API interactions for debugging and audit trails
    5. Test in dev/lab first: Never prototype against production

    Conclusion

    iControl is the invisible foundation of modern F5 management. Whether you’re clicking buttons in the GUI, running Ansible playbooks, or building custom integrations, it all flows through iControl.

    Understanding iControl unlocks the full potential of F5 automation:

    • Automate repetitive tasks
    • Integrate F5 into CI/CD pipelines
    • Build self-service portals for application teams
    • Implement advanced monitoring and observability
    • Scale F5 management across large deployments

    If you manage F5 devices and haven’t explored iControl yet, you’re missing out on the most powerful tool in your toolbox. Start simple—query some pool stats, create a test object, watch what the GUI does—and build from there.

    The API is there, it’s well-supported, and it’s waiting for you to automate away the mundane parts of F5 administration.


    Building F5 automation or have iControl questions? Connect with me on LinkedIn.

  • F5 iApps: The Promise vs. The Reality

    If you’ve worked with F5 BIG-IP for any length of time, you’ve probably encountered iApps—F5’s application template framework designed to simplify complex configurations. On paper, they sound great: standardized deployments, reduced errors, faster provisioning. In practice? Well, let’s talk about what iApps actually are, when you should use them, and whether they live up to the hype.


    What Are F5 iApps?

    iApps (Application Services) are pre-built configuration templates that bundle together all the components needed to deploy an application on F5 BIG-IP. Instead of manually creating virtual servers, pools, profiles, monitors, and iRules individually, an iApp presents you with a guided form that handles the orchestration for you.

    The Core Concept

    Think of iApps as Infrastructure-as-Code templates for F5. You answer questions about your application (IP addresses, ports, SSL requirements, pool members, health checks), and the iApp generates and manages all the underlying BIG-IP objects as a single logical unit.

    Key characteristics:

    • Atomic deployments: All components are created/updated together
    • Reconfiguration protection: Objects managed by iApps can’t be modified outside the template (without breaking the iApp)
    • Standardization: Enforces consistent configurations across deployments
    • Abstraction: Hides complexity from users who may not be F5 experts

    Built-In vs. Custom iApps

    F5 ships with built-in iApps for common applications:

    • Microsoft Exchange
    • Microsoft SharePoint
    • Microsoft Lync/Skype for Business
    • Oracle E-Business Suite
    • SAP NetWeaver
    • Citrix XenApp/XenDesktop
    • Generic HTTP/HTTPS applications

    Organizations can also develop custom iApps using the iApp template language (Tcl-based) to standardize their own application deployments.

    The Intended Use Cases

    F5 designed iApps to solve specific problems:

    1. Standardization Across Teams

    In large organizations with multiple F5 administrators, iApps ensure everyone configures applications the same way. No more “this admin uses FastL4, that admin uses Standard virtual servers” inconsistencies.

    2. Reducing Configuration Errors

    Manually configuring an SSL-offloaded application with SNAT, persistence, connection limits, and custom iRules leaves room for mistakes. iApps bundle best practices into validated templates.

    3. Delegating to Non-Experts

    The vision: application teams can deploy their own services through iApps without deep F5 knowledge. Fill out the form, click deploy, done.

    4. Faster Time-to-Production

    Pre-built templates for complex applications (Exchange, SharePoint, SAP) theoretically reduce deployment time from hours to minutes.

    The Reality: When iApps Work Well

    Let’s be fair—iApps can be useful in specific scenarios:

    Scenario 1: Cookie-Cutter Deployments

    If you deploy the same application configuration repeatedly (e.g., hosting 50 identical web applications for different customers), iApps shine. One template, multiple instances, guaranteed consistency.

    Example: MSPs hosting identical WordPress sites for multiple clients.

    Scenario 2: Mature Built-In Templates

    F5’s Exchange and SharePoint iApps are well-tested and handle the complexity of these Microsoft products better than most admins would manually. If you’re deploying one of these specific applications, the built-in iApp is genuinely helpful.

    Scenario 3: Self-Service Portals

    Organizations with automation frameworks (ServiceNow, custom portals) can integrate iApps as the backend for application provisioning workflows. The iApp enforces standards while the portal provides the user interface.

    The Reality: Where iApps Fall Short

    Now for the uncomfortable truth most F5 engineers have experienced:

    Problem 1: Rigidity and Lack of Flexibility

    iApps are opinionated. They enforce a specific configuration pattern, and deviating from that pattern is difficult or impossible. Real-world applications rarely fit perfectly into templates.

    Example frustration: You need to add a custom iRule that the iApp doesn’t support. Your options:

    • Modify the iApp template (requires Tcl knowledge, testing, ongoing maintenance)
    • Break the iApp and manage objects manually (defeats the purpose)
    • Give up on your requirement (unacceptable in production)

    Problem 2: The Lock-In Effect

    Once you deploy an application via iApp, all objects it creates are managed by that iApp. You can’t casually edit a pool member or tweak a profile setting through the GUI—you must go back to the iApp interface and reconfigure there.

    This is fine when it works. When the iApp doesn’t expose the setting you need to change? You’re stuck.

    Problem 3: Troubleshooting Complexity

    Debugging an iApp-deployed application is harder than debugging manually created objects. The iApp abstracts away the actual configuration, so you’re looking at generated objects with auto-generated names and relationships you didn’t explicitly create.

    Analogy: It’s like troubleshooting compiled code when you only have access to the high-level source. You know what the iApp was supposed to do, but figuring out what it actually did requires reverse-engineering.

    Problem 4: Version Drift and Upgrades

    iApp templates are versioned. If F5 releases an updated template, you need to:

    1. Import the new template version
    2. Test it in a lab
    3. Reconfigure existing deployments to use the new version
    4. Hope nothing breaks

    Many organizations avoid this pain by just… not upgrading iApp templates. Which means you’re running outdated configurations with known issues.

    Problem 5: Limited Adoption and Expertise

    Custom iApp development requires Tcl scripting knowledge and deep understanding of F5 internals. Most organizations don’t have this expertise in-house, so they’re limited to F5’s built-in templates—which may or may not fit their needs.

    The Decline of iApps: AS3 and Declarative Configurations

    F5 has largely moved away from promoting iApps in favor of AS3 (Application Services 3), a newer declarative configuration framework that addresses many of iApps’ shortcomings:

    FeatureiAppsAS3
    Configuration FormatGUI forms + Tcl templatesJSON declarations
    FlexibilityLimited by template designHighly flexible
    Version ControlDifficultJSON files in Git
    API-FriendlyClunkyNative REST API
    Learning CurveModerate (GUI-based)Steeper (JSON + API)
    F5 SupportLegacy/maintenance modeActive development

    AS3 treats F5 configurations as declarative JSON documents. You describe the desired state, POST it to the API, and AS3 figures out how to configure the BIG-IP to match. No more template lock-in, no more Tcl scripting.

    So… Should You Use iApps?

    Use iApps If:

    • You’re deploying one of F5’s well-supported built-in applications (Exchange, SharePoint, etc.)
    • You have truly cookie-cutter deployments with zero customization needs
    • You already have mature custom iApps that work well and meet your needs
    • You’re in a legacy environment where migrating away isn’t feasible

    Avoid iApps If:

    • You need flexibility and customization
    • Your applications have unique requirements not covered by templates
    • You’re starting fresh and can adopt AS3/declarative configs instead
    • You value visibility into exactly what’s configured and why
    • You want to integrate F5 into modern CI/CD pipelines

    The Middle Ground: Hybrid Approach

    Some organizations use iApps for initial deployment and then “orphan” the configuration by managing objects manually afterward. This gives you the standardization benefit of iApps without the long-term lock-in.

    Process:

    1. Deploy via iApp to get a baseline configuration
    2. Document the generated objects
    3. Break the iApp association
    4. Manage objects manually going forward

    This isn’t ideal, but it’s pragmatic.

    Real-World Perspective: What I’ve Seen

    After 13+ years working with F5 in enterprise environments, here’s my honest take:

    iApps looked great in 2013. They promised standardization and simplification at a time when F5 configurations were becoming increasingly complex. The vision of application teams self-provisioning load balancers through templates was compelling.

    By 2018, most teams had moved on. The rigidity became a problem as applications evolved. Custom iApps required expertise most teams didn’t have. Troubleshooting was painful. And when something didn’t fit the template, you were stuck.

    In 2026, iApps are legacy. New deployments should use AS3 or manual configurations with proper automation (Ansible, Terraform). Existing iApp deployments are maintained but not expanded.

    The Verdict

    iApps solved real problems—standardization, error reduction, and faster deployments. For specific use cases (built-in templates, cookie-cutter apps), they still work fine.

    But they didn’t age well. The lack of flexibility, troubleshooting complexity, and lock-in effects became deal-breakers as infrastructure-as-code practices matured. F5’s own pivot to AS3 signals that even they recognize iApps’ limitations.

    For new deployments in 2026: Skip iApps. Use AS3 for API-driven automation, or stick with manual configurations wrapped in proper version control and automation tooling. Your future self will thank you.

    For existing iApp deployments: They’re not going away overnight. Keep them running if they work, but plan a migration strategy to more flexible approaches when opportunities arise.


    The Bottom Line: iApps are useful in narrow scenarios but generally not worth adopting today. The future of F5 automation lies in declarative configurations and modern API-driven workflows.


    Working with F5 or struggling with iApps? Let’s connect on LinkedIn and compare war stories.

  • DNS Records on the F5 GTM

    In a standard environment, DNS is simple. But when you are managing ZoneRunner on an F5 BIG-IP, the stakes are higher. You aren’t just managing names; you’re managing entry points for global traffic. While there are dozens of record types, these are the ones that keep the enterprise running.

    The Essentials: A, AAAA, and CNAME

    These are the bread and butter of your zone files. If you get these wrong, nothing else matters.

    • A (Address): The classic. Maps a hostname to a 32-bit IPv4 address. In F5 terms, this is often the “LBP” (Load Balancing Protocol) target.
    • AAAA (IPv6 Address): The 128-bit counterpart. Essential for modern “Mobile First” deployments.
    • CNAME (Canonical Name): An alias. Pro-Tip: In GTM/DNS setups, we often use CNAMEs to point a user-friendly URL (www.mmooresystems.com) to a GTM Wide IP (www.gslb.mmooresystems.com).

    The “Infrastructure” Records: SOA and NS

    You cannot have a functional zone without these. They define the “Who’s in Charge” logic of your network.

    • SOA (Start of Authority): The first record in any zone file. It tells the world that this BIG-IP is the best source of truth for the domain. It contains your serial numbers and refresh timers.
    • NS (Name Server): Defines the actual servers responsible for the zone. Without an NS record pointing to your Listeners, your GTM will never receive a query.

    The Modern “Service” Stack: MX, SRV, and TXT

    Modern networking relies heavily on these for discovery and security.

    • MX (Mail Exchanger): Tells the world where to send your email.
    • SRV (Service): Used heavily in Active Directory and VoIP (SIP) environments. It doesn’t just point to an IP; it points to a specific Service and Port (e.g., pointing _sip._tcp to your load balancer).
    • TXT (Text): The “junk drawer” that became a security powerhouse. Today, TXT records are primarily used for SPF, DKIM, and DMARC to prevent email spoofing.

    Advanced & Specialized Records

    When things get complex, ZoneRunner supports the heavy hitters:

    Record Usage in BIG-IP DNS
    PTR The “Reverse Lookup.” Used to prove an IP belongs to a name (essential for SMTP).
    NAPTR Name Authority Pointer. Used for URN mapping, often in complex Telecom/IMS environments.
    DNAME Like a CNAME, but for an entire subtree of the DNS tree. Useful for IPv6 reverse lookups.
    HINFO Standard host info (Hardware/OS). Rarely used today for security reasons (don’t give attackers a map!).

    Closing Thought: ZoneRunner vs. Manual BIND

    The beauty of ZoneRunner is that it validates your syntax. If you try to create two SOA records or a CNAME that conflicts with an A-record, ZoneRunner will stop you before you reload the BIND configuration and break your production DNS. It’s the “safety rail” every network engineer needs.

  • F5 BIG-IP DNS: Demystifying ZoneRunner and the BIND Handshake

    If you’ve ever stepped into the F5 BIG-IP DNS (formerly GTM) world, you’ve likely encountered a service called ZoneRunner. To the uninitiated, it looks like a redundant layer of management. To the power user, it is the bridge between standard DNS and F5’s Intelligent Traffic Management. Here is how to understand the “magic” happening under the hood.

    1. The Foundation: What is ZoneRunner?

    At its core, ZoneRunner is a configuration daemon (zrd) that manages a local instance of ISC BIND running on the BIG-IP. F5 didn’t reinvent the wheel for DNS records; they simply packaged BIND and built a management layer to handle the zone files. When you create a record in the F5 GUI under DNS > Zones > ZoneRunner, the F5 is essentially writing a standard BIND zone file for you.

    When Should You Actually Use ZoneRunner?

    In many GSLB (Global Server Load Balancing) environments, the F5 is just a “smart proxy” for a few URLs. But you need ZoneRunner when:

    • The F5 is the Authoritative Master: If the BIG-IP is the “Start of Authority” (SOA) for a specific sub-domain (e.g., gslb.mmooresystems.com).
    • Defining “Glue” Records: When you need static A-records, MX records, or TXT records that don’t require intelligent load balancing.
    • Providing a Safety Net: ZoneRunner acts as the “fallback” answer if the GTM layer doesn’t have a dynamic answer ready.

    2. iQuery: The Nervous System of GTM

    If ZoneRunner is the “Database,” then iQuery is the nervous system. iQuery is a proprietary F5 protocol running over TCP port 4353. It is the “secret sauce” that allows a GTM in one data center to talk to an LTM in another.

    Without iQuery, your GTM is “blind.” It uses this connection to:

    • Monitor Health: Instead of the GTM pinging every server, it asks the local LTM via iQuery: “Are your Virtual Servers healthy?”
    • Exchange Metrics: It shares CPU and connection loads so the GTM can steer traffic to the least-burdened data center.
    • Sync Everything: It ensures that a configuration change on one GTM is instantly replicated to its peers in the Sync Group.

    3. The Handshake: How it All Flows

    The magic happens when a DNS query actually hits your Listener (the Virtual Server waiting on UDP/53). The BIG-IP performs a high-speed logic check:

    1. The GTM Intercept: If the query matches a Wide IP, the GTM layer takes over. It checks the iQuery data for health and path metrics and provides an “Intelligent” answer.
    2. The BIND Fallback: If the query doesn’t match a Wide IP, the F5 hands the request down to the ZoneRunner/BIND backend to see if a static record exists.
    3. The Silence: If neither layer has an answer, it returns NXDOMAIN.

    Pro-Tips for Greenfield Deployments

    Setting this up from scratch? Keep these two “gotchas” in mind:

    Watch Your Clocks: iQuery relies on SSL certificates for the bigip_add / gtm_add handshake. If your NTP isn’t synced, the certificates will be rejected, and your iQuery mesh will fail before it starts.

    The Listener is King: You can have the most perfect ZoneRunner records and iQuery health checks, but without a DNS Listener defined on a Self-IP or Virtual Server, the BIG-IP will never answer the phone.

    Have questions about your GTM mesh or general networking? Reach out!

  • Silence the Noise: A Guide to Zabbix Maintenance Mode

    We’ve all been there. You’ve scheduled a 2:00 AM window to upgrade a core pfSense firewall or a database cluster. You initiate the reboot, and within seconds, your phone is a vibrating brick of Slack notifications, PagerDuty alerts, and automated emails telling you exactly what you already know: The host is down.

    In the world of monitoring, context is everything. Zabbix Maintenance Mode is the feature that gives your monitoring system that context, turning it from a nagging alarm into a professional quiet-period tool.

    Why Use Maintenance Mode?

    The primary goal isn’t just to stop emails; it’s to maintain Data Integrity.

    1. Alert Suppression: Prevent “Action” operations (emails, scripts, webhooks) from triggering for known downtime.
    2. SLA Accuracy: If you report on uptime for clients or management, Maintenance Mode allows you to exclude “Scheduled Downtime” from your availability percentages.
    3. Dashboards with Context: Instead of a red “Problem” state, your Zabbix dashboard shows a blue or orange wrench icon, telling other team members, “Someone is working on this; don’t panic.”

    The Two Types: With vs. Without Data Collection

    When you create a maintenance period in Zabbix, you have a critical choice:

    • With Data Collection: Zabbix continues to poll the host and store history. You can still see CPU spikes during an upgrade or how long the reboot took in your graphs—you just won’t get alerted. (Highly Recommended for Upgrades).
    • No Data Collection: Zabbix stops the pollers entirely for that host. This is best for hardware replacements where the device is physically powered off for a long duration.

    Best Practices for the “Clean Upgrade”

    1. Use the “Buffer” Strategy

    If you think an upgrade will take 15 minutes, set your Maintenance Period for 30. If the upgrade fails (like a kernel memory exhaustion or a slow filesystem check), you don’t want the alerts to start firing while you’re mid-troubleshooting.

    2. Understand “Active Since” vs. “Period”

    This is the most common point of failure for new Zabbix users.

    • Active Since/Till: The “Master Window” (The badge that lets you in the building).
    • Period: The “Execution Time” (The shift you actually work). Your maintenance won’t start unless the current time falls inside both.

    3. Target Host Groups, Not Just Hosts

    Instead of creating a new maintenance entry for every individual server, create a group like “Maintenance_Windows_Sunday.” By simply moving a host into that group, it inherits the maintenance schedule automatically.

    When to Pull the Trigger?

    • OS/Firmware Upgrades: Essential for firewalls (pfSense/OPNsense) and hypervisors.
    • Database Migrations: High-load operations often trigger “Slow Query” or “I/O Wait” alerts.
    • Testing New Triggers: If you’re “tuning” a new Zabbix template and don’t want to spam your team while you find the right thresholds.

    A Real-World Reality Check

    I was actually writing this post while performing a pfSense Plus upgrade. The upgrade hit a snag—a failed to reclaim memory error (Code 137) during the PHP 8.5 package extraction. Because I had Zabbix in Maintenance Mode with Data Collection, I could see the CPU spike and memory flatline in my dashboard without my phone exploding with alerts. It gave me the quiet headspace to jump into the SSH console and fix the dependency issue manually.

    The takeaway: Maintenance mode isn’t just for when things go right; it’s your best friend when things go wrong.

  • Tagged Layer 3 Interfaces vs Router-on-a-Stick: Two Sides of the Same Coin

    Both tagged Layer 3 interfaces and router-on-a-stick use 802.1Q VLAN tagging to multiplex multiple Layer 3 networks over a single physical link. The concepts are nearly identical—the main differences lie in the platform, scale, and typical use cases. Let’s break down what makes them similar and where they diverge.


    The Foundation: 802.1Q VLAN Tagging

    Both designs rely on 802.1Q trunking to carry multiple VLANs across a single physical interface. Each VLAN gets its own Layer 3 subinterface (or logical unit), allowing a single link to handle multiple routed networks simultaneously.

    Think of it like a single fiber optic cable carrying multiple wavelengths of light (DWDM). One physical medium, multiple logical channels.

    Router-on-a-Stick: The Classic Pattern

    How It Works

    Router-on-a-stick connects a router to a Layer 2 switch via a single 802.1Q trunk. The router creates multiple subinterfaces on one physical port, with each subinterface handling routing for a specific VLAN.

    Configuration Example (Cisco Router):

    interface GigabitEthernet0/0
     description Trunk to Layer 2 Switch
     no ip address
    
    interface GigabitEthernet0/0.10
     description VLAN 10 - Finance
     encapsulation dot1Q 10
     ip address 192.168.10.1 255.255.255.0
    
    interface GigabitEthernet0/0.20
     description VLAN 20 - Engineering  
     encapsulation dot1Q 20
     ip address 192.168.20.1 255.255.255.0
    
    interface GigabitEthernet0/0.30
     description VLAN 30 - Guest
     encapsulation dot1Q 30
     ip address 192.168.30.1 255.255.255.0Code language: PHP (php)

    Primary Use Case

    Inter-VLAN routing in small to medium environments:

    • Branch offices with Layer 2 switches
    • Small campus networks
    • Budget-constrained deployments
    • Networks with light to moderate inter-VLAN traffic

    Tagged Layer 3 Interfaces: The Enterprise Pattern

    How It Works

    Tagged Layer 3 interfaces use the same 802.1Q subinterface concept, but typically on enterprise routers or Layer 3 switches connecting to other Layer 3 devices or provider networks. Rather than inter-VLAN routing for local users, these interfaces often carry:

    • Multiple customer connections (ISP/carrier use case)
    • Different VRFs or routing instances
    • Segregated services over shared infrastructure
    • WAN connections with multiple circuits

    Configuration Examples

    Juniper (Logical Units):

    set interfaces et-0/0/1 description "Carrier_Circuit_to_DMZ_Switch"
    set interfaces et-0/0/1 vlan-tagging
    
    set interfaces et-0/0/1 unit 200 description "ATT"
    set interfaces et-0/0/1 unit 200 vlan-id 200
    set interfaces et-0/0/1 unit 200 family inet address 10.23.59.1/30
    
    set interfaces et-0/0/1 unit 308 description "Zayo"
    set interfaces et-0/0/1 unit 308 vlan-id 308
    set interfaces et-0/0/1 unit 308 family inet address 10.23.58.1/30
    
    set interfaces et-0/0/1 unit 322 description "Lumen"
    set interfaces et-0/0/1 unit 322 vlan-id 322
    set interfaces et-0/0/1 unit 322 family inet address 10.23.57.1/30
    
    set interfaces et-0/0/1 unit 337 description "Verizon"
    set interfaces et-0/0/1 unit 337 vlan-id 337
    set interfaces et-0/0/1 unit 337 family inet address 10.23.56.1/30Code language: JavaScript (javascript)

    Arista (Subinterfaces with VRFs):

    interface Ethernet3
       description "Verizon"
       no switchport
    
    interface Ethernet3.3011
       description "Customer1"
       encapsulation dot1q vlan 3011
       vrf Cust1
       ip address 10.140.242.45/31
    
    interface Ethernet3.3012
       description "Customer2"
       encapsulation dot1q vlan 3012
       vrf Cust2
       ip address 10.140.242.49/31
    
    interface Ethernet3.3018
       description "Customer3"
       encapsulation dot1q vlan 3018
       vrf Customer3
       ip address 10.140.242.53/31Code language: JavaScript (javascript)

    Primary Use Cases

    Service multiplexing and network segregation:

    • Carrier/ISP networks serving multiple customers over shared infrastructure
    • Enterprise edge routers with multiple WAN circuits or partners
    • Data center interconnects (DCI) carrying multiple tenants
    • MPLS PE routers with VRF-segregated customers
    • DMZ/extranet environments with strict segmentation requirements

    Key Differences

    FeatureRouter-on-a-StickTagged Layer 3 Interfaces
    Typical PlatformSmall branch routers (ISR, etc.)Enterprise routers (MX, ASR, 7xxx)
    Connected ToLayer 2 access switchLayer 3 device, carrier, or upstream
    Primary PurposeInter-VLAN routing for end usersService multiplexing, WAN aggregation
    Traffic PatternEast-west (VLAN to VLAN)North-south (external connections)
    VRF UsageRarely usedCommon (customer/service isolation)
    ScaleTypically 3-10 VLANsCan support dozens to hundreds
    Port Speed1G typical10G/40G/100G common
    Routing ComplexitySimple (default gateway role)Complex (BGP, OSPF, policy routing)

    The Real Difference: Context and Scale

    Technically, both designs are doing the same thing: using 802.1Q tagging to create multiple Layer 3 interfaces on a single physical port. The distinctions come down to:

    1. Network Location

    • Router-on-a-stick: Access layer, connecting to end-user VLANs
    • Tagged L3 interfaces: Edge/core, connecting to WAN, partners, or other infrastructure

    2. Traffic Type

    • Router-on-a-stick: Internal traffic between VLANs (Finance ↔ Engineering)
    • Tagged L3 interfaces: External services, customers, or carriers (Bank of America, Wells Fargo, Verizon,ATT)

    3. Isolation Requirements

    • Router-on-a-stick: Simple VLAN separation, shared routing table
    • Tagged L3 interfaces: Often uses VRFs for strict routing isolation between customers/services

    4. Performance Expectations

    • Router-on-a-stick: Bandwidth bottleneck is an accepted trade-off for simplicity
    • Tagged L3 interfaces: High-speed links (10G+) with hardware-accelerated forwarding

    Real-World Example: Financial Services Edge Router

    In the Arista example above, a single 10G interface to a carrier (Lumen) carries three completely isolated networks:

    • VLAN 3011: Dedicated Wells Fargo connection (VRF: WellsFargo)
    • VLAN 3012: Shared FIX protocol link (VRF: Shared_Fix)
    • VLAN 3018: Extranet services (VRF: Extranet)

    Each subinterface exists in a separate VRF, ensuring complete routing isolation. Traffic from Wells Fargo can never leak into the Extranet VRF, even though they share the same physical wire.

    This is service multiplexing—using 802.1Q to deliver multiple isolated services over shared infrastructure.

    When to Use Each Design

    Use Router-on-a-Stick When:

    • You need inter-VLAN routing in a small office or branch
    • You have Layer 2 switches and one router
    • Budget constraints prevent Layer 3 switching
    • Inter-VLAN traffic is moderate and predictable

    Use Tagged Layer 3 Interfaces When:

    • Connecting to carriers, partners, or WAN providers
    • You need strict traffic segregation (VRFs)
    • Multiplexing multiple customers or services over shared links
    • Building data center interconnects or MPLS PE infrastructure
    • Working with high-bandwidth circuits (10G+)

    Common Pitfalls and Considerations

    MTU and Fragmentation

    802.1Q adds 4 bytes to the frame. If your physical interface MTU is 1500, your effective Layer 3 MTU per subinterface is 1496. Always verify MTU settings match on both ends to avoid fragmentation issues.

    Native VLAN Considerations

    Some platforms allow a “native” (untagged) VLAN on trunk ports. Be explicit about whether you’re using this feature to avoid misconfigurations and potential security issues.

    Performance Monitoring

    Monitor each subinterface individually—don’t just look at the physical interface utilization. One busy subinterface can saturate the link and affect all others.

    QoS and Traffic Shaping

    When multiplexing critical services, implement QoS policies to ensure high-priority traffic (e.g., VoIP, financial transactions) isn’t starved by bulk data transfers.

    Conclusion

    Router-on-a-stick and tagged Layer 3 interfaces are fundamentally the same technology—802.1Q subinterfaces providing Layer 3 routing over a single physical link. The key differences are:

    • Router-on-a-stick: Small-scale inter-VLAN routing for local users
    • Tagged L3 interfaces: Enterprise-scale service multiplexing with VRF isolation

    Both have their place in modern networks. Understanding when and why to use each pattern is essential for designing efficient, scalable infrastructure—whether you’re building a branch office network or connecting to major financial institutions over carrier circuits.


    Working with VLANs, VRFs, or enterprise routing? Let’s connect on LinkedIn

  • Fixing XCP-ng Live Migration Failures: Mixed CPU Generations in a Homelab Pool

    The Problem: When Your Homelab Becomes a Lesson in Enterprise Architecture

    I recently ran into an interesting issue with my XCP-ng homelab that taught me a valuable lesson about virtualization infrastructure design. If you’re running a mixed-hardware pool and your rolling updates keep failing with cryptic CANNOT_EVACUATE_HOST errors, this post is for you.

    The Setup

    My homelab consists of two hosts in an XCP-ng pool (managed via Xen Orchestra):

    • Hera: HP Z640 with Intel Xeon E5-2670 v3 (Haswell, 12c/24t @ 2.30GHz)
    • Zeus: Dell server with Intel Xeon E5-2650 v2 (Ivy Bridge, 8c/16t @ 2.60GHz)

    Seems reasonable, right? Both are Xeon E5 v2/v3 generation processors, both support virtualization, and they’ve been running happily together in a pool for quite some time.

    The Failure: Rolling Updates Hit a Wall

    When I attempted to perform a rolling pool update through Xen Orchestra, I was greeted with this error:

    CANNOT_EVACUATE_HOST(VM_INCOMPATIBLE_WITH_THIS_HOST,
    OpaqueRef:1de8f41d-c39c-b097-026d-c8b687dee6a1,
    OpaqueRef:4f9c343b-8ebd-9ade-7f9b-eaa22844b7dd,
    VM last booted on a CPU with features this host's CPU does not have.)
    

    Similarly, attempting to put Hera into maintenance mode resulted in:

    VM_INCOMPATIBLE_WITH_THIS_HOST(
    OpaqueRef:dd8ccb61-2e86-4853-880f-49f078b0e10d,
    OpaqueRef:4f9c343b-8ebd-9ade-7f9b-eaa22844b7dd,
    VM last booted on a CPU with features this host's CPU does not have.)
    

    The error message is clear enough: some VMs couldn’t be migrated because they were using CPU features that didn’t exist on the destination host.

    Understanding the Root Cause

    Here’s what was actually happening:

    The CPU Generation Gap

    While both hosts use Intel Xeon E5 processors, they’re from different microarchitecture generations:

    FeatureHera (E5-2670 v3)Zeus (E5-2650 v2)
    ArchitectureHaswell (2014)Ivy Bridge (2013)
    Instruction SetsAVX, AVX2, BMI2, FMA3AVX only
    Cores/Threads12c/24t8c/16t
    L3 Cache30 MB20 MB

    The Haswell architecture (v3) introduced several new instruction sets that Ivy Bridge (v2) doesn’t support, including:

    • AVX2 (Advanced Vector Extensions 2)
    • BMI2 (Bit Manipulation Instructions 2)
    • FMA3 (Fused Multiply-Add 3)

    How VMs Lock to CPU Features

    When a VM boots on a host, it discovers and can utilize all available CPU features. The hypervisor essentially tells the VM: “Here are all the CPU instructions you can use.”

    Once a VM starts using these features, it expects them to remain available. During live migration, XCP-ng checks: “Does the destination host support all the CPU features this running VM is currently using?”

    In my case:

    • VMs booted on Hera discovered and started using AVX2 and other Haswell-specific features
    • When XCP-ng tried to migrate them to Zeus for patching, Zeus said “I don’t have AVX2”
    • Migration blocked → Pool evacuation failed → Rolling update failed

    The Simple Analogy

    Think of it like a phone app that requires iOS 17 trying to run on a phone with iOS 16. The app expects certain APIs to be available, and when they’re not, it simply won’t run. You can’t hot-swap the phone’s OS mid-operation.

    Finding the Problematic VMs

    The OpaqueRefs in the error messages are internal XAPI object references, not directly useful for identifying VMs. Here’s how I tracked down the culprits:

    List VMs by Host

    # Show all running VMs on Hera
    xe vm-list resident-on=$(xe host-list name-label="Hera - z640" --minimal) \
      power-state=running is-control-domain=false params=name-label,uuid
    Code language: PHP (php)

    Trial and Error Method

    Since I had a manageable number of VMs, I:

    1. Identified all VMs running on Hera
    2. Attempted to manually migrate each one to Zeus through XO
    3. The ones that failed were my incompatible VMs

    Through this process, I identified two VMs that couldn’t migrate.

    The Solution: CPU Compatibility Mode

    XCP-ng provides a way to constrain VMs to use only CPU features available across all pool members. This is done via the platform:cpu-type parameter.

    Applying the Fix

    For each problematic VM:

    # Set CPU type to generic (lowest common denominator)
    xe vm-param-set uuid=<VM-UUID> platform:cpu-type=generic
    
    # Verify the setting
    xe vm-param-get uuid=<VM-UUID> param-name=platform
    
    # Reboot the VM for changes to take effect
    xe vm-reboot uuid=<VM-UUID>
    Code language: PHP (php)

    After rebooting, the VMs now only use CPU instructions available on both Haswell (v3) and Ivy Bridge (v2) processors.

    What “Generic” Actually Does

    Setting cpu-type=generic instructs the hypervisor to present the VM with a baseline CPU feature set that’s compatible across all hosts in the pool. The VM essentially runs in “compatibility mode,” using only the CPU features guaranteed to exist everywhere.

    Performance Impact

    For most workloads, the performance impact is negligible:

    • General compute: No noticeable difference
    • I/O-bound workloads: Unaffected
    • Specific AVX2-optimized applications: Minor performance reduction (typically <5%)

    The trade-off of slightly reduced performance for operational flexibility is well worth it in a homelab environment.

    Verification and Testing

    After applying the fix and rebooting the VMs:

    1. Test manual migration: Successfully migrated both VMs from Hera to Zeus
    2. Maintenance mode: Hera successfully evacuated all VMs to Zeus
    3. Rolling pool update: Completed without errors

    Success! The pool is now fully functional for automated updates.

    Prevention: Applying Pool-Wide

    To prevent this issue from occurring with other VMs in the future, you can apply CPU compatibility mode pool-wide:

    # Apply to all VMs in the pool
    for vm in $(xe vm-list is-control-domain=false params=uuid --minimal | tr ',' ' '); do 
      echo "Setting CPU compatibility for: $(xe vm-param-get uuid=$vm param-name=name-label)"
      xe vm-param-set uuid=$vm platform:cpu-type=generic
    done
    Code language: PHP (php)

    Important: VMs must be rebooted for this change to take effect. You can do this gradually during normal maintenance windows.

    The Bigger Lesson: Infrastructure Homogeneity

    This experience reinforced a fundamental principle of enterprise virtualization: infrastructure homogeneity matters.

    Why Matching Hardware is Critical

    Live Migration Requirements:

    • CPU instruction set compatibility
    • Same virtualization extensions (VT-x/AMD-V)
    • Compatible storage and network interfaces

    Operational Simplicity:

    • Predictable performance across the cluster
    • Simplified capacity planning
    • Reduced troubleshooting complexity

    High Availability:

    • VMs can failover to any host without constraints
    • Automated DRS/anti-affinity rules work seamlessly

    Enterprise Best Practices

    In production environments:

    1. Buy in matched sets: Purchase servers in pairs or groups with identical specs
    2. Lifecycle management: Refresh entire clusters together, not piecemeal
    3. Spare parts consistency: Keep compatible spare components
    4. Firmware alignment: Maintain consistent BIOS/firmware versions

    Homelab Reality

    Of course, homelabs are different:

    • We buy what’s affordable or available
    • Hardware comes from various sources (eBay, liquidation sales, hand-me-downs)
    • Mix-and-match is the norm, not the exception

    The good news? XCP-ng provides tools like CPU compatibility mode to work around these limitations.

    Alternative Solutions

    If CPU compatibility mode isn’t acceptable for your use case, consider these alternatives:

    Option 1: Separate Pools

    Run incompatible hosts as separate pools:

    Pros:

    • Each pool runs at full CPU capability
    • No performance compromises

    Cons:

    • No live migration between pools
    • More complex management
    • Reduced flexibility for workload placement

    Option 2: Hardware Standardization

    Upgrade or replace hosts to match specifications:

    Pros:

    • Full feature utilization
    • Operational simplicity
    • Better long-term scalability

    Cons:

    • Higher upfront cost
    • Requires hardware acquisition

    For my homelab, I’m keeping the CPU compatibility mode approach for now. E5-2670 v3 processors are relatively inexpensive on the secondary market (~$20-40), so upgrading Zeus to match Hera is a potential future project.

    Which CPU is Actually Better?

    For those curious, despite Zeus having a higher base clock (2.6 GHz vs 2.3 GHz), Hera is the superior host:

    • 50% more cores: 12c/24t vs 8c/16t = significantly better VM density
    • Newer architecture: Better IPC (instructions per clock)
    • Larger cache: 30MB vs 20MB
    • Advanced instructions: AVX2, BMI2, FMA3 for optimized workloads

    The lesson? More cores and newer architecture generally trump raw clock speed for virtualization workloads.

    Key Takeaways

    1. CPU compatibility matters: Mixed CPU generations in a pool can prevent live migration and automated updates
    2. CPU compatibility mode exists: The platform:cpu-type=generic parameter solves most heterogeneous pool issues
    3. Performance impact is minimal: For most workloads, compatibility mode has negligible performance cost
    4. Homogeneous infrastructure is ideal: Matching hardware simplifies operations and prevents these issues
    5. Homelabs are different: We work with what we have and use workarounds when necessary

    Troubleshooting Checklist

    If you encounter similar issues:

    • ☐ Check CPU models across all pool members
    • ☐ Verify CPU architecture generations match
    • ☐ Review VM placement and migration history
    • ☐ Test manual VM migration to identify incompatible VMs
    • ☐ Apply platform:cpu-type=generic to problematic VMs
    • ☐ Reboot VMs after applying CPU compatibility settings
    • ☐ Consider pool-wide application for future-proofing

    Conclusion

    What started as a frustrating “why won’t my rolling update work?” turned into a valuable learning experience about virtualization architecture fundamentals. The issue was quickly resolved with XCP-ng’s built-in CPU compatibility features, and I gained a deeper appreciation for why enterprise environments invest in hardware consistency.

    For fellow homelabbers running mixed hardware: don’t let CPU generation differences stop you. Apply CPU compatibility mode, reboot your VMs, and get back to the fun stuff—learning, breaking things, and building your infrastructure skills.

    Have you encountered similar issues in your homelab? How did you solve them? Connect with me on LinkedIn and let’s discuss!


    Environment Details:

    • Hypervisor: XCP-ng 8.x
    • Management: Xen Orchestra (latest)
    • Pool: 2 hosts (mixed Intel Xeon E5 v2/v3)
    • Issue: Rolling pool updates failing on CPU incompatibility

    Related Resources:


    Questions or thoughts? Connect with me on LinkedIn | About mmooresystems

  • Welcome to my journey


    After years of tinkering, breaking things, and occasionally fixing them in my homelab, I figured it was time to start documenting the journey.

    This site is where I’ll be sharing the lessons learned from building enterprise-grade infrastructure at home, the networking concepts that keep me up at night (in a good way), and the occasional “why didn’t anyone tell me this sooner?” moment.

    What to expect:

    • Deep dives into networking protocols (because understanding BGP shouldn’t require a PhD)
    • Homelab projects that actually work (and the 17 failed attempts before that)
    • Infrastructure tutorials for building resilient systems
    • The truth about working in network engineering and SRE roles

    First real post coming soon. In the meantime, check out the About Me page to learn more about who’s behind this chaos.

    Thanks for stopping by.

    – Mike


    Questions? Connect with me on LinkedIn.