Machine and Container updates (Day 19)

Machine and Container updates (Day 19)
Photo by Francesco Ungaro / Unsplash

Today was all about updates.

What started as just routine maintenance turned into a reminder of why we keep backups (and backups of backups).

The Update Plan

  • Update Proxmox nodes
  • Update VMs
  • Update OPNsense to version 25

Things Go Sideways

After the updates and reboots, OPNsense decided to forget about its VLANs and misconfigure WAN and LAN interfaces.

This cascaded into:

  • Everything losing connectivity
  • DNS becoming unreachable
  • General network chaos

Recovery Process

  1. Direct connection to Proxmox node (thank goodness for out-of-band management)
  2. Tried the built-in backup list - no luck
  3. Remembered the lesson from the last time i had to do a reinstall at 1.20am: keep config backups locally
  4. Reset OPNsense, restored from local backup
  5. Fixed an interface mismatch
  6. Network starts coming back to life

DNS

Everything seemed fixed until I noticed I still had no internet.

OPNsense looked good, but the DNS server was unreachable despite appearing online and healthy. So basically "everything's fine but nothing works."

After some troubleshooting and replacing the VMs NIC and re-assigning it the same static ip on OPNSense the node was now reachable and my DNS working.

Most services recovered quickly once DNS and OPNsense were back, though TrueNAS took its time and couldn't update catalogs so added in a Quad 9 as a fallback for next time (because there probably will)

So

  1. Keep multiple backups in different locations (the built-in backups aren't always enough)
  2. Added Quad9 to some nodes like trunas as a fallback DNS for future resilience
  3. When debugging network issues, don't trust what "looks fine" - verify connectivity layer by layer
  4. Updates, while necessary, can turn out to be well ....

And the Kubernetes clusters (both k3s and the HA one) - everything just came back online like nothing had happened, without needing to touch a single node (including the 2 haproxy nodes etc).

At least now I have a fallback plan for DNS issues, and another validation of why Kubernetes is great for self-healing infrastructure.