Category: Virtualization

  • Fixing XCP-ng Live Migration Failures: Mixed CPU Generations in a Homelab Pool

    The Problem: When Your Homelab Becomes a Lesson in Enterprise Architecture

    I recently ran into an interesting issue with my XCP-ng homelab that taught me a valuable lesson about virtualization infrastructure design. If you’re running a mixed-hardware pool and your rolling updates keep failing with cryptic CANNOT_EVACUATE_HOST errors, this post is for you.

    The Setup

    My homelab consists of two hosts in an XCP-ng pool (managed via Xen Orchestra):

    • Hera: HP Z640 with Intel Xeon E5-2670 v3 (Haswell, 12c/24t @ 2.30GHz)
    • Zeus: Dell server with Intel Xeon E5-2650 v2 (Ivy Bridge, 8c/16t @ 2.60GHz)

    Seems reasonable, right? Both are Xeon E5 v2/v3 generation processors, both support virtualization, and they’ve been running happily together in a pool for quite some time.

    The Failure: Rolling Updates Hit a Wall

    When I attempted to perform a rolling pool update through Xen Orchestra, I was greeted with this error:

    CANNOT_EVACUATE_HOST(VM_INCOMPATIBLE_WITH_THIS_HOST,
    OpaqueRef:1de8f41d-c39c-b097-026d-c8b687dee6a1,
    OpaqueRef:4f9c343b-8ebd-9ade-7f9b-eaa22844b7dd,
    VM last booted on a CPU with features this host's CPU does not have.)
    

    Similarly, attempting to put Hera into maintenance mode resulted in:

    VM_INCOMPATIBLE_WITH_THIS_HOST(
    OpaqueRef:dd8ccb61-2e86-4853-880f-49f078b0e10d,
    OpaqueRef:4f9c343b-8ebd-9ade-7f9b-eaa22844b7dd,
    VM last booted on a CPU with features this host's CPU does not have.)
    

    The error message is clear enough: some VMs couldn’t be migrated because they were using CPU features that didn’t exist on the destination host.

    Understanding the Root Cause

    Here’s what was actually happening:

    The CPU Generation Gap

    While both hosts use Intel Xeon E5 processors, they’re from different microarchitecture generations:

    FeatureHera (E5-2670 v3)Zeus (E5-2650 v2)
    ArchitectureHaswell (2014)Ivy Bridge (2013)
    Instruction SetsAVX, AVX2, BMI2, FMA3AVX only
    Cores/Threads12c/24t8c/16t
    L3 Cache30 MB20 MB

    The Haswell architecture (v3) introduced several new instruction sets that Ivy Bridge (v2) doesn’t support, including:

    • AVX2 (Advanced Vector Extensions 2)
    • BMI2 (Bit Manipulation Instructions 2)
    • FMA3 (Fused Multiply-Add 3)

    How VMs Lock to CPU Features

    When a VM boots on a host, it discovers and can utilize all available CPU features. The hypervisor essentially tells the VM: “Here are all the CPU instructions you can use.”

    Once a VM starts using these features, it expects them to remain available. During live migration, XCP-ng checks: “Does the destination host support all the CPU features this running VM is currently using?”

    In my case:

    • VMs booted on Hera discovered and started using AVX2 and other Haswell-specific features
    • When XCP-ng tried to migrate them to Zeus for patching, Zeus said “I don’t have AVX2”
    • Migration blocked → Pool evacuation failed → Rolling update failed

    The Simple Analogy

    Think of it like a phone app that requires iOS 17 trying to run on a phone with iOS 16. The app expects certain APIs to be available, and when they’re not, it simply won’t run. You can’t hot-swap the phone’s OS mid-operation.

    Finding the Problematic VMs

    The OpaqueRefs in the error messages are internal XAPI object references, not directly useful for identifying VMs. Here’s how I tracked down the culprits:

    List VMs by Host

    # Show all running VMs on Hera
    xe vm-list resident-on=$(xe host-list name-label="Hera - z640" --minimal) \
      power-state=running is-control-domain=false params=name-label,uuid
    

    Trial and Error Method

    Since I had a manageable number of VMs, I:

    1. Identified all VMs running on Hera
    2. Attempted to manually migrate each one to Zeus through XO
    3. The ones that failed were my incompatible VMs

    Through this process, I identified two VMs that couldn’t migrate.

    The Solution: CPU Compatibility Mode

    XCP-ng provides a way to constrain VMs to use only CPU features available across all pool members. This is done via the platform:cpu-type parameter.

    Applying the Fix

    For each problematic VM:

    # Set CPU type to generic (lowest common denominator)
    xe vm-param-set uuid=<VM-UUID> platform:cpu-type=generic
    
    # Verify the setting
    xe vm-param-get uuid=<VM-UUID> param-name=platform
    
    # Reboot the VM for changes to take effect
    xe vm-reboot uuid=<VM-UUID>
    

    After rebooting, the VMs now only use CPU instructions available on both Haswell (v3) and Ivy Bridge (v2) processors.

    What “Generic” Actually Does

    Setting cpu-type=generic instructs the hypervisor to present the VM with a baseline CPU feature set that’s compatible across all hosts in the pool. The VM essentially runs in “compatibility mode,” using only the CPU features guaranteed to exist everywhere.

    Performance Impact

    For most workloads, the performance impact is negligible:

    • General compute: No noticeable difference
    • I/O-bound workloads: Unaffected
    • Specific AVX2-optimized applications: Minor performance reduction (typically <5%)

    The trade-off of slightly reduced performance for operational flexibility is well worth it in a homelab environment.

    Verification and Testing

    After applying the fix and rebooting the VMs:

    1. Test manual migration: Successfully migrated both VMs from Hera to Zeus
    2. Maintenance mode: Hera successfully evacuated all VMs to Zeus
    3. Rolling pool update: Completed without errors

    Success! The pool is now fully functional for automated updates.

    Prevention: Applying Pool-Wide

    To prevent this issue from occurring with other VMs in the future, you can apply CPU compatibility mode pool-wide:

    # Apply to all VMs in the pool
    for vm in $(xe vm-list is-control-domain=false params=uuid --minimal | tr ',' ' '); do 
      echo "Setting CPU compatibility for: $(xe vm-param-get uuid=$vm param-name=name-label)"
      xe vm-param-set uuid=$vm platform:cpu-type=generic
    done
    

    Important: VMs must be rebooted for this change to take effect. You can do this gradually during normal maintenance windows.

    The Bigger Lesson: Infrastructure Homogeneity

    This experience reinforced a fundamental principle of enterprise virtualization: infrastructure homogeneity matters.

    Why Matching Hardware is Critical

    Live Migration Requirements:

    • CPU instruction set compatibility
    • Same virtualization extensions (VT-x/AMD-V)
    • Compatible storage and network interfaces

    Operational Simplicity:

    • Predictable performance across the cluster
    • Simplified capacity planning
    • Reduced troubleshooting complexity

    High Availability:

    • VMs can failover to any host without constraints
    • Automated DRS/anti-affinity rules work seamlessly

    Enterprise Best Practices

    In production environments:

    1. Buy in matched sets: Purchase servers in pairs or groups with identical specs
    2. Lifecycle management: Refresh entire clusters together, not piecemeal
    3. Spare parts consistency: Keep compatible spare components
    4. Firmware alignment: Maintain consistent BIOS/firmware versions

    Homelab Reality

    Of course, homelabs are different:

    • We buy what’s affordable or available
    • Hardware comes from various sources (eBay, liquidation sales, hand-me-downs)
    • Mix-and-match is the norm, not the exception

    The good news? XCP-ng provides tools like CPU compatibility mode to work around these limitations.

    Alternative Solutions

    If CPU compatibility mode isn’t acceptable for your use case, consider these alternatives:

    Option 1: Separate Pools

    Run incompatible hosts as separate pools:

    Pros:

    • Each pool runs at full CPU capability
    • No performance compromises

    Cons:

    • No live migration between pools
    • More complex management
    • Reduced flexibility for workload placement

    Option 2: Hardware Standardization

    Upgrade or replace hosts to match specifications:

    Pros:

    • Full feature utilization
    • Operational simplicity
    • Better long-term scalability

    Cons:

    • Higher upfront cost
    • Requires hardware acquisition

    For my homelab, I’m keeping the CPU compatibility mode approach for now. E5-2670 v3 processors are relatively inexpensive on the secondary market (~$20-40), so upgrading Zeus to match Hera is a potential future project.

    Which CPU is Actually Better?

    For those curious, despite Zeus having a higher base clock (2.6 GHz vs 2.3 GHz), Hera is the superior host:

    • 50% more cores: 12c/24t vs 8c/16t = significantly better VM density
    • Newer architecture: Better IPC (instructions per clock)
    • Larger cache: 30MB vs 20MB
    • Advanced instructions: AVX2, BMI2, FMA3 for optimized workloads

    The lesson? More cores and newer architecture generally trump raw clock speed for virtualization workloads.

    Key Takeaways

    1. CPU compatibility matters: Mixed CPU generations in a pool can prevent live migration and automated updates
    2. CPU compatibility mode exists: The platform:cpu-type=generic parameter solves most heterogeneous pool issues
    3. Performance impact is minimal: For most workloads, compatibility mode has negligible performance cost
    4. Homogeneous infrastructure is ideal: Matching hardware simplifies operations and prevents these issues
    5. Homelabs are different: We work with what we have and use workarounds when necessary

    Troubleshooting Checklist

    If you encounter similar issues:

    • ☐ Check CPU models across all pool members
    • ☐ Verify CPU architecture generations match
    • ☐ Review VM placement and migration history
    • ☐ Test manual VM migration to identify incompatible VMs
    • ☐ Apply platform:cpu-type=generic to problematic VMs
    • ☐ Reboot VMs after applying CPU compatibility settings
    • ☐ Consider pool-wide application for future-proofing

    Conclusion

    What started as a frustrating “why won’t my rolling update work?” turned into a valuable learning experience about virtualization architecture fundamentals. The issue was quickly resolved with XCP-ng’s built-in CPU compatibility features, and I gained a deeper appreciation for why enterprise environments invest in hardware consistency.

    For fellow homelabbers running mixed hardware: don’t let CPU generation differences stop you. Apply CPU compatibility mode, reboot your VMs, and get back to the fun stuff—learning, breaking things, and building your infrastructure skills.

    Have you encountered similar issues in your homelab? How did you solve them? Connect with me on LinkedIn and let’s discuss!


    Environment Details:

    • Hypervisor: XCP-ng 8.x
    • Management: Xen Orchestra (latest)
    • Pool: 2 hosts (mixed Intel Xeon E5 v2/v3)
    • Issue: Rolling pool updates failing on CPU incompatibility

    Related Resources:


    Questions or thoughts? Connect with me on LinkedIn | About mmooresystems