Nomadin

Moving Homelab SSH and SOPS keys onto a YubiKey

Vince — Sun, 14 Jun 2026 13:49:00 GMT

I wanted my homelab access to depend less on normal private keys sitting around on a laptop.

SSH keys, SOPS age keys, GPG keys, bootstrap keys. It is very easy to grow a pile of local secrets that quietly become part of the machine. That works until I reinstall the OS, move to another laptop or they get potentially leaked.

I wanted to keep the signing and decrypting material on the YubiKey where possible, and leave the laptop with handles, config, and public keys. A new laptop is one case where this matters, but it is not the point of the setup. The point is that the laptop should not be the place where all the long-lived access keys live.

This ended up being two separate tracks.

OpenSSH resident FIDO keys for host access
SOPS access through a YubiKey backed OpenPGP key

I originally wanted the SOPS part to use age-plugin-yubikey, but my YubiKey setup got in the way.

SSH keys on the yubikey

For SSH, I used OpenSSH security key keys.

ssh-keygen -t ed25519-sk \
  -O resident \
  -O verify-required \
  -O application=ssh:realm-nano \
  -f ~/.ssh/id_ed25519_sk_realm_nano \
  -C "lab@realm"

The resident option stores the key handle on the YubiKey, so the local machine does not need a normal exportable SSH private key for this access path. If the local files disappear, I can pull the key-handle files back down from the token. verify-required makes OpenSSH ask for the FIDO PIN before the YubiKey signs.

On macOS I had to use Homebrew OpenSSH. The system OpenSSH could list the sk-* key types, but failed before it reached the YubiKey because the FIDO provider path was not available. Homebrew OpenSSH worked for this.

brew install openssh

which ssh
which ssh-keygen
ssh -V

The commands should resolve to /opt/homebrew/bin/ssh and /opt/homebrew/bin/ssh-keygen, not /usr/bin.

Once the public key was copied to a host, the SSH config only needed a normal alias. This is the shape, using one host as the example.

Host avalon-yk
  HostName 
  User root
  IdentityFile ~/.ssh/id_ed25519_sk_realm_nano
  IdentityFile ~/.ssh/id_ed25519_sk_realm_nfc
  IdentitiesOnly yes
  PreferredAuthentications publickey
  ControlMaster auto
  ControlPath ~/.ssh/control-%C
  ControlPersist 30m

Two identity files, one Yubikey is used as a backup of the other.

I used aliases like this for the Proxmox hosts, OPNsense, TrueNAS, and the DNS host. The hostnames and users change, but the alias shape stays the same.

The first login needs a real local terminal because of the PIN and touch flow.

ssh avalon-yk

After that, multiplexing keeps the next few commands on the same connection.

ssh -O check avalon-yk
ssh avalon-yk 'hostname && whoami'
scp some-file avalon-yk:/tmp/

With ControlPersist 30m, I can unlock one SSH connection with the YubiKey and then run follow-up commands through the same master connection for half an hour.

Pulling resident ssh handles back down

If the local key-handle files are missing, they can be pulled from the YubiKey with ssh-keygen -K.

mkdir -p ~/.ssh
chmod 700 ~/.ssh
cd ~/.ssh

ssh-keygen -K

This works because the keys were created with -O resident. OpenSSH knows how to talk to FIDO authenticators directly, either through its built-in USB HID support or an SSH_SK_PROVIDER / -w provider override. The YubiKey is just the FIDO2 authenticator in this case.

ssh-keygen -K writes public and private key-handle files into the current directory. It is different from ssh-add -K, which loads resident keys into the agent. For this setup I want files back in ~/.ssh, so ssh-keygen -K is the one I care about.

OpenSSH downloads resident keys from the first FIDO authenticator that gets touched. If both YubiKeys are plugged in, I would do one at a time. Plug in the nano, run ssh-keygen -K, rename the files, then repeat with the other key.

After downloading, check the comments and fingerprints.

ls -l id_*_sk*
ssh-keygen -lf ./*.pub

Then rename the local key-handle files to match the SSH config.

mv  ~/.ssh/id_ed25519_sk_realm_nano
mv  ~/.ssh/id_ed25519_sk_realm_nano.pub

chmod 600 ~/.ssh/id_ed25519_sk_realm_nano
chmod 644 ~/.ssh/id_ed25519_sk_realm_nano.pub

The local ed25519-sk private file is not a normal private key in the old SSH sense. It is a key handle, and the YubiKey still has to sign. I would not commit it or pass it around, but having that file by itself is not enough to log in.

Installing a public key on a host

If a host still only has an old bootstrap key, I do not need anything fancy. Show the public key locally.

cat ~/.ssh/id_ed25519_sk_realm_nano.pub

Then SSH in with whatever still works and paste that line into ~/.ssh/authorized_keys on the host.

ssh -i ~/.ssh/ root@

mkdir -p ~/.ssh
chmod 700 ~/.ssh
vi ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

After that, test the YubiKey alias from the laptop.

ssh avalon-yk 'hostname && whoami'

Once the YubiKey alias works, the bootstrap key can stop being the normal path.

Git over ssh

I used the same idea for GitHub and my internal GitLab, but with a separate pair of FIDO SSH keys.

~/.ssh/id_ed25519_sk_git
~/.ssh/id_ed25519_sk_git.pub
~/.ssh/id_ed25519_sk_git_nfc
~/.ssh/id_ed25519_sk_git_nfc.pub

The SSH config for GitHub and GitLab points at those keys instead of the homelab host key.

Host github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_ed25519_sk_git
  IdentityFile ~/.ssh/id_ed25519_sk_git_nfc
  IdentitiesOnly yes
  PreferredAuthentications publickey

Git signing also moved to SSH signatures, using the Git FIDO public key.

git config --global gpg.format ssh
git config --global user.signingkey ~/.ssh/id_ed25519_sk_git.pub
git config --global commit.gpgsign true

For local verification, I added both Git signing public keys to ~/.ssh/allowed_signers and pointed Git at it.

git config --global gpg.ssh.allowedSignersFile ~/.ssh/allowed_signers

That keeps Git auth and Git signing on the YubiKey path too, without reusing the same SSH key I use for homelab hosts.

SOPS was the awkward part

The homelab repo already used SOPS with an age recipient. That still makes sense for GitOps, because Argo CD can use the in-cluster age key when rendering Helmfile.

For my local machine, I wanted a hardware-backed route too. The first attempt was age-plugin-yubikey.

age-plugin-yubikey --list
age-plugin-yubikey --identity

age-plugin-yubikey --generate \
  --serial  \
  --name "sops temp" \
  --pin-policy once \
  --touch-policy cached

That failed on my nano because of the PIV management-key setup.

AES protected management key is unsupported by age-plugin-yubikey

Checking the PIV metadata showed why.

ykman --device  piv info

The management key was AES192, stored on the YubiKey and protected by PIN. I could probably have gone deeper into changing the PIV setup, but that was not what I wanted to do while wiring up repo secrets. The goal was to avoid local secret material lying around, not to turn the PIV management key into its own side quest.

So I switched the SOPS hardware-backed path to OpenPGP/GPG.

PGP on the yubikey

The working SOPS key ended up as a YubiKey backed OpenPGP key.

gpg --list-secret-keys --keyid-format LONG --with-keygrip
gpg --card-status
ykman openpgp info

The key fingerprint is available from GPG

gpg --list-secret-keys --keyid-format LONG

After moving the key to the YubiKey, GPG showed it as card-backed (sec> / ssb>), and a disposable SOPS round trip worked.

sops --encrypt \
  --pgp  \
  /tmp/sops-ykgpg-test.yaml > /tmp/sops-ykgpg-test.enc.yaml

sops --decrypt /tmp/sops-ykgpg-test.enc.yaml

The decrypt prompts for the OpenPGP user PIN. YubiKey PIN retries are limited FYI.

Keeping both age and pgp recipients

I did not replace age with PGP. I added PGP next to age.

The repo-level .sops.yaml now has both recipients.

creation_rules:
  - path_regex: .*\.yaml$
    age: 
    pgp:

The age recipient keeps the existing GitOps path working. The PGP recipient gives me a local YubiKey-backed decrypt path. After changing .sops.yaml, the encrypted file metadata has to be updated.

sops updatekeys secrets/secrets.enc.yaml

GPG agent cache

The YubiKey OpenPGP path works, but the PIN prompts get old quickly when doing several SOPS operations. I set the GPG agent cache to eight hours.

cat > ~/.gnupg/gpg-agent.conf <<'EOF'
default-cache-ttl 28800
max-cache-ttl 28800
default-cache-ttl-ssh 28800
max-cache-ttl-ssh 28800
EOF

chmod 600 ~/.gnupg/gpg-agent.conf
gpgconf --reload gpg-agent

This only controls how long GPG remembers the PIN locally.

Current shape

SSH access is now mostly a YubiKey plus SSH config problem. SOPS access is still age for the cluster, with a YubiKey backed PGP recipient for my local decrypt path.

If I have to set this up again, I would do it in the order of:

install Homebrew OpenSSH, GnuPG, SOPS, and ykman
pull resident SSH keys with ssh-keygen -K
restore the SSH aliases
import or refresh the GPG card stubs for the OpenPGP key
verify sops --decrypt works with the YubiKey

I did not end up with one YubiKey mechanism for everything. SSH uses resident OpenSSH FIDO keys. SOPS still uses age for the cluster, and PGP for my local YubiKey decrypt path.

NVMe Partitions for ZFS SLOG and L2ARC on TrueNAS

Vince — Sun, 08 Mar 2026 20:34:00 GMT

I bought two 512GB NVMe drives (Patriot P320) for my NAS rebuild. Two NVMe slots, and I wanted a SLOG and L2ARC on my ZFS pool.

I actually did this twice.

The first time the pool was a 4-drive RAIDZ1 and I also added a special (metadata) vdev on the NVMe drives. The pool had been running for over a year without one, and after a week with it I didn't notice a difference, probably not enough workload to surface it.

The catch is that a special vdev holds real data and if you lose the mirror you lose the entire pool, which is a steep price for a performance gain I couldn't even perceive.

I was already planning to rebuild as RAIDZ2 (added a fifth drive), so I used TrueNAS replication to rebuild and dropped the special vdev. This time I kept it to just SLOG and L2ARC, both of which are safe to lose without taking the pool with them.

Partition layout

Each drive gets two partitions, the rest is just unused.

per drive (512GB Patriot P320):
  p1: 64GB   -> SLOG (mirrored across both drives)
  p2: 250GB  -> L2ARC (not mirrored)
  ~163GB     -> unused

SLOG buffers sync writes, which is what happens when an app (NFS, databases) waits for confirmation that data hit stable storage before moving on. Without a SLOG, ZFS has to flush all the way to spinning rust before acknowledging. With one, it acknowledges from the NVMe and flushes in the background. If the SLOG dies, ZFS just falls back to writing sync directly to the pool, so it's slower but no data loss.

L2ARC is a read cache that extends the in-memory ARC onto NVMe, so frequently read blocks that don't fit in RAM get served from SSD instead of spinning drives. Same deal if it dies, reads just go back to the pool and nothing is lost.

Partitioning the drives

The DXP8800 Plus ships with a 128GB NVMe (nvme1n1) that I wiped and installed TrueNAS on.

The two drives I added are nvme0n1 and nvme2n1. Worth checking lsblk first so you don't accidentally wipe the boot drive.

lsblk -d -o NAME,SIZE,ROTA,MODEL /dev/nvme0n1 /dev/nvme2n1

sudo smartctl -H -i -A /dev/nvme0n1
sudo smartctl -H -i -A /dev/nvme2n1

Both drives get identical partitions:

sudo sgdisk -Z /dev/nvme0n1
sudo sgdisk \
  -n 1:0:+64G  -t 1:bf01 -c 1:slog \
  -n 2:0:+250G -t 2:bf01 -c 2:l2arc \
  /dev/nvme0n1

sudo sgdisk -Z /dev/nvme2n1
sudo sgdisk \
  -n 1:0:+64G  -t 1:bf01 -c 1:slog \
  -n 2:0:+250G -t 2:bf01 -c 2:l2arc \
  /dev/nvme2n1

-Z wipes existing partition tables. -n creates a partition (number:start:size). -t sets the type (bf01 is Solaris/ZFS). -c gives it a label and verify with sgdisk -p.

Adding them to the pool

I added the vdevs one at a time so I could verify after each. The -f flag forces the add if the drives had previous ZFS labels.

sudo zpool add -f swamp log mirror /dev/nvme0n1p1 /dev/nvme2n1p1
sudo zpool status swamp

sudo zpool add -f swamp cache /dev/nvme0n1p2 /dev/nvme2n1p2
sudo zpool status swamp

The SLOG is a log mirror because you would ideally want it redundant while it's alive, a single drive failure shouldn't degrade sync write performance. The L2ARC is just cache with no mirror since it's throwaway read data anyway.

After both adds, zpool status looked like this:

  pool: swamp
  ...
    raidz2-0        (5 drives)
    logs
      mirror-1      (nvme0n1p1 + nvme2n1p1)
    cache
      nvme0n1p2
      nvme2n1p2

NFS tuning

While testing NFS read/write speeds over 10G, I tried bumping the network buffer sizes and see if i got any improvements. I set them through midclt, TrueNAS Scale's middleware CLI:

sudo midclt call tunable.create \
  '{"var": "net.core.rmem_max", "value": "16777216", "type": "SYSCTL", "enabled": true}'

sudo midclt call tunable.create \
  '{"var": "net.core.wmem_max", "value": "16777216", "type": "SYSCTL", "enabled": true}'

sudo midclt call tunable.create \
  '{"var": "sunrpc.tcp_slot_table_entries", "value": "128", "type": "SYSCTL", "enabled": true}'

You can do the same thing in the GUI under System > Advanced > Sysctl. Either way they persist through reboots (I have been trying to use commands more often because they are easily reproduced in systems without say a terraform provider). The same rmem_max, wmem_max, and sunrpc.tcp_slot_table_entries values need to be set on the NFS clients too (the Proxmox nodes in my case). Both ends need the larger buffers.

If / when a drive dies

If one NVMe fails, the mirrored SLOG degrades but keeps working and the L2ARC just loses that cache device. The replacement process would look something like this:

sudo zpool replace swamp /dev/old_nvmeXn1p1 /dev/new_nvmeXn1p1

sudo zpool remove swamp /dev/old_nvmeXn1p2
sudo zpool add swamp cache /dev/new_nvmeXn1p2

ZFS resilvers the SLOG mirror in-place, for L2 ARC devices you remove and re-add them.

MikroTik VLANs are Six commands from the CLI

Vince — Sat, 28 Feb 2026 21:53:00 GMT

I recently picked up a MikroTik CRS309-1G-8S+ for 10G switching between my Proxmox nodes, NAS, and OPNsense box. Eight SFP+ ports, hardware-offloaded switching, RouterOS.

This replaced two Netgear 1G switches (an 8-port and a 5-port) I'd daisy-chained together, so it was a jump from 1G to 10G for the whole homelab.

Should be straightforward. Well, it was, once I stopped fighting the GUI.

The WebFig GUI is powerful and can do quit a lot, and that's probably the problem. VLAN configuration throws you into bridge ports, VLAN tables, and filtering toggles spread across multiple tabs with no obvious order of operations. After an hour of clicking around I went looking at MikroTik's own wiki docs for VLANs, and even those use CLI commands in their examples. That was the hint. I SSH'd in instead.

After that things started to make sense. RouterOS has a hierarchical CLI that maps directly to the config structure. /interface/bridge/vlan is exactly where VLAN entries live, /interface/bridge/port is where port settings go. Tab completion and ? show you what's available at every level.

The whole VLAN setup, start to finish, was this:

/interface/bridge/port remove [find interface=ether1]

/interface/bridge/vlan add bridge=bridge tagged=sfp-sfpplus1,sfp-sfpplus2,sfp-sfpplus3,sfp-sfpplus4,sfp-sfpplus8 untagged=sfp-sfpplus5 vlan-ids=10
/interface/bridge/vlan add bridge=bridge tagged=sfp-sfpplus1,sfp-sfpplus2,sfp-sfpplus3,sfp-sfpplus4,sfp-sfpplus8 vlan-ids=30
/interface/bridge/vlan add bridge=bridge tagged=sfp-sfpplus1,sfp-sfpplus2,sfp-sfpplus3,sfp-sfpplus4,sfp-sfpplus8 vlan-ids=50

/interface/bridge/port set [find interface=sfp-sfpplus5] pvid=10 frame-types=admit-only-untagged-and-priority-tagged

/interface/bridge set bridge vlan-filtering=yes

That's it. Three VLANs (10, 30, 50) trunked across five ports (each Proxmox host runs VMs on different VLANs, so trunking all three over one SFP+ link beats wasting a NIC per VLAN), one access port for the AP on VLAN 10, and VLAN filtering enabled.

The first command pulls ether1 out of the bridge to keep a dedicated management interface. Misconfigure VLANs and you can still get back in through the management port.

The order matters here. You define the VLAN entries before enabling filtering. If you flip vlan-filtering=yes first, the bridge starts enforcing rules against an empty VLAN table and every port goes dead. Not a fun way to learn that lesson (though the management port would save you).

I also disabled the two spare ports (to be used for LAGG experiments later) since they still have a default PVID of 1 even with no VLAN membership:

/interface/bridge/port remove [find interface=sfp-sfpplus6]
/interface/bridge/port remove [find interface=sfp-sfpplus7]
/interface disable sfp-sfpplus6
/interface disable sfp-sfpplus7

After cabling everything up, here is a quick iperf3 between one of the Proxmox nodes and the NAS:

$ iperf3 -c 192.168.50.43
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.9 GBytes  9.38 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  10.9 GBytes  9.38 Gbits/sec                  receiver

9.38 Gbits/sec, zero retransmits. About as good as 10G gets. The H flag on /interface/bridge/port print confirms hardware offload is active, the switch ASIC forwards at full line rate without touching the CPU.

I will admit coming from two 1G Netgear switches, seeing that number for the first time was pretty satisfying.

A few RouterOS CLI quirks worth knowing: there's no ls, you use print (or print detail for all properties). .. goes up one level, / goes to root. [find ...]locates items by property rather than index number, so your commands survive reordering. /export dumps the current section as re-pasteable commands, which is how I built these notes.

The MikroTik SSH experience turned out to be genuinely enjoyable compared to the Netgear web UIs (or Webfig too) where VLAN config meant several pages of dropdowns and hoping the apply button actually saved. The config hierarchy makes sense, the commands read clearly, and you can get a full VLAN setup done in under five minutes once you know the pattern.

A Year of Homelabbing

Vince — Sun, 04 Jan 2026 23:27:32 GMT

This is a year of building, breaking, and rebuilding my homelab.

Before the homelab

I never liked minikube. A potentially bold statement to make, but something about it felt too abstracted.

I remember installing it, running minikube addons enable ingress, and honestly feeling like something was off. What's actually happening here? What can I mess around with? (For local Kubernetes now, Rancher Desktop is a much better starting point imho.)

So I went straight to kubeadm. First on my Mac, then when I got dedicated hardware. Bash scripts that SSH'd in and ran kubeadm commands. Not elegant, but it taught me what actually happens when you bootstrap a cluster: certificates, etcd, kubelet config etc.

Eventually went HA with HAProxy and Keepalived for a floating VIP.

The Ansible detour

At some point I tried Ansible. Wrote playbooks for HAProxy and kubeadm setup.

It lasted maybe two weeks.

Ansible is good at what it does, but in my case it felt cranky / abstracted for something that I figured a makefile could do and was less convoluted. So I went back to Makefiles.

Multiple clusters

The old repo had three cluster approaches running simultaneously:

Atlas: The kubeadm cluster, my original setup
Prism: A K3s cluster, the idea being it hosts always available components incase I shut down some machines in the other cluster.
Talos: Which I later migrated to and is the currently active

Each had its own directory, its own tooling, its own domain (*.atlas.home.mrdvince.me, *.prism.home.mrdvince.me, *.talos.home.mrdvince.me).

This however did create maintenance overhead, and well to be honest, one doesnt really need 3 clusters in a homelab.

Networking

Networking ended up being the most stable part once it was set up right.

Got a managed switch and OPNsense as the router. Setup VLANs for segmentation: proxmox plus other non-k8s vms on one, storage on another, cluster traffic on a third and home devices on a fourth.

With firewall rules configured, CrowdSec added for intrusion detection the main config was mostly done. Now I just add a new VLAN when I need one among other operational configs.

Tailscale ties it together for access from anywhere.

On the Kubernetes side, started with Cilium and MetalLB. Eventually dropped MetalLB and let Cilium handle LoadBalancer IPs directly. Dropped kube-proxy too, letting Cilium do everything with eBPF.

Storage: the backbone

TrueNAS Scale became the foundation. Started with a USB controller passthrough setup (Day 13-15 in the blog) which was more involved than expected. ZFS with RAIDZ1, eventually extended with VDEV extension (Day 25).

This worked very well for a very long time before I switched to a ugreen NAS and also installed truenas scale on it and have since switched to RAIDZ2.

The storage architecture went through iterations:

Local storage only (early days)
NFS mounts from TrueNAS
Longhorn for distributed block storage
MinIO for S3-compatible object storage
RustFS replaced MinIO (current)

Longhorn on Talos deserves its own mention. Talos is immutable, which means you can't just install packages. Getting Longhorn to work required Talos extensions (iscsi-tools, util-linux-tools) and kubelet mount patches.

The storage layer now handles: Terraform state, database backups via CloudNativePG, GitLab artifacts, container registry storage, and anything else that needs persistence.

GitOps

ArgoCD with the app-of-apps pattern was and still is the deployment model.

The pattern: push to git, ArgoCD syncs, applications deploy.

For secrets, played with sealed-secrets briefly but settled on SOPS with age encryption. It works with the Helmfile plugin to decrypt when ArgoCD applies.

Day 26 and Day 30 in the blog cover the secrets journey in detail.

Talos won

By Day 32, this was after I came back from Kubecon London, Talos had been on my list of things to try out for a month before I went to the conference and decided to give it a go. I set up Talos in HA mode and experimented with it for a while.

I started figuring out extensions, and got Tailscale running on the Talos nodes too. One thing i however needed and was the reason to install Tailscale on the node was join an instance running on the cloud to my cluster, and so I set out to figure out how to join a non-Talos nodes to the cluster. The goal being to try out Kueue but never really got to it.

Talos is opinionated in ways that initially frustrated me. No SSH, and the config through an API for a machine was new. But those constraints made sense once I wrapped my head around it. The cluster is reproducible.

I wrote a Talos module that handles cluster setup and upgrades. The entire setup is now managed by Terragrunt. I can destroy and recreate the cluster and know exactly what I'll get.

The current setup runs Talos v1.12. Control plane on one Proxmox node (avalon), workers on another (elysium). One cluster. Maybe a second for testing.

What actually changed

The main changes over the whole span was going from having multiple active clusters and using makefiles to a single active cluster. Currently in progress of migrating all the apps from the old Talos cluster to the new config.

Tooling shifted from a combination of Makefiles, shell scripts and Terrgrunt to just Terragrunt. Charts moved from ChartMuseum to GitLab's package registry. Container images now sync to a private registry via GitLab CI.

App structure is mostly the same, just switched to ApplicationSets in ArgoCD for discovery.

The rebuild

I'm rebuilding the homelab again now. Not because something broke, but because a second pass lets me incorporate everything a year ago me didn't know. Plus it's homelab'ing after all.

Now I know which apps I actually use, which monitoring metrics matter, which complexity was necessary. Goal this time: two clusters max. One main, one playground.

Current state

The stack as it stands:

Infrastructure: Proxmox VE, Terragrunt/OpenTofu
Kubernetes: Talos, Cilium CNI, Traefik ingress
GitOps: ArgoCD with Helmfile plugin
Observability: Prometheus, Grafana, Loki, Tempo, Alloy, Pyroscope
Storage: Longhorn (block), CloudNativePG (Postgres), RustFS (S3), NFS (csi-driver-nfs)
Auth: Authentik with OIDC for everything
Secrets: SOPS with age

Apps are still being migrated over from the old cluster. Access is behind Tailscale.

I also decided to make the repo public: github.com/mrdvince/homelab

Closing the 100 Days

Vince — Sun, 04 Jan 2026 22:31:15 GMT

Hey folks, it's been a while since I've written anything here.

I started the 100 Days of Homelab challenge back in January 2025. Made it to Day 39, then stopped.

Now it's 2026 and I figured I should close this out properly before moving on.

When I started, I'd already been tinkering for a couple of months. The challenge was meant to force documentation, to stop just doing things and actually write about them.

Finding things to write about wasn't the issue. The issue was that posting "still stress testing my NAS" or "still debugging why this secret won't decrypt" felt like nothing much.

The work wasn't done yet, so the writing felt inconsequential.

There's also the reality of finishing something at 11pm and choosing between writing about it or going to bed because work exists tomorrow.

Bed usually won.

Some people write fast and can bang out a post in 20 minutes (at least in my head that's the perception). That's not me. Turning my rough notes into something readable takes time I didn't always have.

So I grouped days. "Day 5-6" then "Day 11-12". Then weeks would pass because forcing out something half-baked felt pointless.

Thinking about it, I think it's similar to journaling. Some people can journal daily and it works.

I've journaled for five years but only when I feel like it. Daily never stuck. Same angle here. Forcing the cadence created friction that eventually won.

Most of those posts did become very useful references though. Especially Talos extensions guide, joining non-Talos nodes to a Talos cluster.

But 39 days over 6 months isn't 100 days, but it wasn't nothing either, so: series closed.

I'm not stopping writing though. Just dropping the daily format, that I felt guilty not following.

See this post covering a year of building, rebuilding, and what actually changed.

Joining a non-Talos node to a Talos cluster (Day 39)

Vince — Sun, 13 Jul 2025 08:57:22 GMT

I have been meaning to figure out a way to get a non-Talos node to join my Talos cluster for a while, because I have this idea of running GPU machines from the cloud, and I would like them to show up as regular nodes.

Thanks to this GitHub issue, it ended up being surprisingly easy to do

The taskfile

Created a Taskfile to automate the entire process. Here's the key parts:

version: '3'

vars:
  VIP: '{{.VIP | default "10.30.30.155"}}'
  TARGET: '{{.TARGET | default "10.30.10.101"}}'
  SSH_KEY: '{{.SSH_KEY | default "~/.ssh/devkey"}}'
  KUBE_VERSION: '{{.KUBE_VERSION | default ""}}' # Empty means auto-detect

tasks:
  join-ubuntu:
    desc: Join Ubuntu node to Talos cluster
    deps:
      - validate
    cmds:
      - task: prepare-node
      - task: copy-configs
      - task: setup-haproxy
      - task: start-kubelet
      - task: verify

Node prep

The prepare-node task sets up the prerequisites and installs needed packages:

prepare-node:
  desc: Prepare Ubuntu node
  cmds:
    - |
      {{.SSH_CMD}} 'sudo bash -s' << 'EOF'
      # Get node IP (adjust interface name as needed)
      NODE_IP=$(ip -4 addr show enp6s18 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
      swapoff -a
      sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
      modprobe overlay
      modprobe br_netfilter
      cat > /etc/sysctl.d/k8s.conf < /etc/containerd/config.toml
      systemctl restart containerd

      # Install Kubernetes components
      if [ -n "{{.KUBE_VERSION}}" ]; then
        KUBE_VER="{{.KUBE_VERSION}}"
      else
        KUBE_VER=$(curl -L -s https://dl.k8s.io/release/stable.txt | awk 'BEGIN { FS="." } { printf "%s.%s", $1, $2 }')
      fi
      
      mkdir -p /etc/apt/keyrings
      curl -fsSL https://pkgs.k8s.io/core:/stable:/${KUBE_VER}/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --yes
      echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/${KUBE_VER}/deb/ /" > /etc/apt/sources.list.d/kubernetes.list

      apt-get update
      apt-get install -y kubelet kubeadm kubectl
      apt-mark hold kubelet kubeadm kubectl

      crictl config \
          --set runtime-endpoint=unix:///run/containerd/containerd.sock \
          --set image-endpoint=unix:///run/containerd/containerd.sock
      cat > /etc/default/kubelet <

`Talos configs`

Here we get the needed files using talosctl and copy them to the ubuntu machine:

copy-configs:
  desc: Copy Kubernetes configuration from Talos
  vars:
    OUT_DIR: _out
  cmds:
    - mkdir -p {{.OUT_DIR}}
    
    # Copy files from Talos
    - talosctl -n {{.VIP}} cat /etc/kubernetes/kubeconfig-kubelet > {{.OUT_DIR}}/kubelet.conf
    - talosctl -n {{.VIP}} cat /etc/kubernetes/bootstrap-kubeconfig > {{.OUT_DIR}}/bootstrap-kubelet.conf
    - talosctl -n {{.VIP}} cat /etc/kubernetes/pki/ca.crt > {{.OUT_DIR}}/ca.crt
    
    - 'perl -pi -e "s|server:.*|server: https://{{.VIP}}:6443|g" {{.OUT_DIR}}/kubelet.conf'
    - 'perl -pi -e "s|server:.*|server: https://{{.VIP}}:6443|g" {{.OUT_DIR}}/bootstrap-kubelet.conf'
    
    - |
      clusterDomain=$(talosctl -n {{.VIP}} get kubeletconfig -o jsonpath="{.spec.clusterDomain}")
      clusterDNS=$(talosctl -n {{.VIP}} get kubeletconfig -o jsonpath="{.spec.clusterDNS}")
      
      cat > {{.OUT_DIR}}/config.yaml <

`Haproxy`

Talos uses KubePrism, a built-in load balancer that listens on 127.0.0.1:7445 and forwards to the API server.

When you join a non-Talos node, it doesn't have KubePrism, so components like Cilium fail to connect, and the node stays in a NotReady state.

The solution for this, without disabling KubePrism, is to use haproxy on the Ubuntu node to mimic KubePrism:

The HAProxy setup:

setup-haproxy:
  desc: Install and configure HAProxy to mimic KubePrism
  cmds:
    - |
      {{.SSH_CMD}} 'sudo bash -s' << 'EOF'
      apt-get update -qq
      apt-get install -y haproxy
      
      # Check if KubePrism configuration already exists
      if ! grep -q "frontend kubeprism" /etc/haproxy/haproxy.cfg; then
        echo "Adding KubePrism configuration to HAProxy..."
        cat >> /etc/haproxy/haproxy.cfg <

`Usage`

Join an Ubuntu node with auto-detected Kubernetes version:

VIP=10.30.30.155 TARGET=10.30.10.101 task join-ubuntu

Or specify a version to match the Talos cluster:

VIP=10.30.30.155 TARGET=10.30.10.101 KUBE_VERSION=v1.33 task join-ubuntu

`Check status:`

# Check node status
kubectl get nodes

# Check logs
TARGET=10.30.10.101 task logs
TARGET=10.30.10.101 SERVICE=haproxy task logs

Getting the nodes now shows that the Ubuntu node has joined the cluster.

NAME          STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
daedalus-01   Ready    control-plane   71d   v1.32.3   10.30.30.141           Talos (v1.10.4)      6.12.31-talos      containerd://2.0.5
daedalus-02   Ready    control-plane   71d   v1.32.3   10.30.30.142           Talos (v1.10.4)      6.12.31-talos      containerd://2.0.5
daedalus-03   Ready    control-plane   71d   v1.32.3   10.30.30.143           Talos (v1.10.4)      6.12.31-talos      containerd://2.0.5
daedalus-21   Ready              71d   v1.32.3   10.30.30.134           Talos (v1.10.4)      6.12.31-talos      containerd://2.0.5
daedalus-22   Ready              71d   v1.32.3   10.30.30.135           Talos (v1.10.4)      6.12.31-talos      containerd://2.0.5
ubuntu        Ready              77s   v1.33.2   10.30.10.101           Ubuntu 24.04.1 LTS   6.8.0-63-generic   containerd://1.7.27

HAProxy on the node intercepts connections to localhost:7445 and forwards them to the Talos k8s API server, making it work with the existing cluster's KubePrism configuration.



Tailscale on Talos (Day 38)
Vince — Sat, 12 Jul 2025 23:10:14 GMT
Tailscale Extension
First, add the Tailscale system extension to your Talos configuration:
# extensions.yaml
customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/tailscale
Build a custom Talos image with the extension:
# Generate custom image with Tailscale extension
curl -X POST --data-binary @extensions.yaml https://factory.talos.dev/schematics

# Returns a schematic ID like: 8cdf4cd0a3a9fa4771aab65437032804940f2115b1b1ef6872274dde261fa319
Upgrade your Talos nodes to use the custom image:
# Upgrade node with the new image (talosctl manages the Talos OS lifecycle)
talosctl upgrade --preserve --nodes 10.30.30.155 \
  --image factory.talos.dev/installer/8cdf4cd0a3a9fa4771aab65437032804940f2115b1b1ef6872274dde261fa319:v1.10.4
Tailscale Configuration
Configure Tailscale with your auth key (SOPS-encrypted for security):
# tailscale-config.yaml (decrypted view)
apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: tailscale
environment:
    - TS_AUTHKEY=tskey-auth-  # Your Tailscale auth key
    - TS_EXTRA_ARGS=--accept-routes --reset  # Accept subnet routes and reset on conflicts
Apply the configuration to your node:
# Patch machine config to add Tailscale configuration
talosctl -n 10.30.30.155 -e 10.30.30.155 patch mc -p @tailscale-config.yaml
After applying, Tailscale will start automatically and connect your Talos node to your tailnet. The node will appear in your Tailscale admin console with its hostname.
SOPS Encryption
The actual tailscale-config.yaml is SOPS-encrypted to protect the auth key:
# Encrypt your config
sops -e tailscale-config.yaml > tailscale-config.enc.yaml

# Decrypt when applying
sops -d tailscale-config.enc.yaml | talosctl -n  patch mc -p -
This keeps the Tailscale auth keys secure when checked in on git.


Terraform Provider in Rust (Day 36-37)
Vince — Sat, 05 Jul 2025 23:02:00 GMT
I was curious about Terraform providers and wanted to explore if we could write one in a different language other than Go (the language of choice was Rust for no particular reason). This also coincided with my wanting to see what using Claude Code was like.
So I chose to try something I was thinking about, and that was a "let's get rid of those Terragrunt hooks for OIDC configuration and use Terraform resources"
Yeah, another Proxmox provider exists that is more feature-rich compared to the Telmate Proxmox provider; however, I did this solely as an experimentation process.
The Framework First
Started by building tfplug (needs a better name), which is a framework that implements the Terraform Plugin Protocol v6.9. This handles all the gRPC communication and type conversions so providers can focus on their logic.
The framework handles:
Schema builders and Dynamic value handling
Resource and data source traits with full lifecycle support
Plan modifiers, validators, and defaults
Error handling with diagnostics
The framework exposes these core traits:
pub trait Provider: Send + Sync {
    fn type_name(&self) -> &str;

    async fn configure(
        &mut self,
        ctx: Context,
        request: ConfigureProviderRequest,
    ) -> ConfigureProviderResponse;

    fn resources(&self) -> HashMap;
    fn data_sources(&self) -> HashMap;
  // Plus metadata, schema, validate, etc.
}
And resources traits like:
pub trait Resource: Send + Sync {
    fn type_name(&self) -> &str;

    async fn schema(
        &self,
        ctx: Context,
        request: ResourceSchemaRequest,
    ) -> ResourceSchemaResponse;

    async fn create(
        &self,
        ctx: Context,
        request: CreateResourceRequest,
    ) -> CreateResourceResponse;

    async fn read(
        &self,
        ctx: Context,
        request: ReadResourceRequest,
    ) -> ReadResourceResponse;

    async fn update(
        &self,
        ctx: Context,
        request: UpdateResourceRequest,
    ) -> UpdateResourceResponse;

    async fn delete(
        &self,
        ctx: Context,
        request: DeleteResourceRequest,
    ) -> DeleteResourceResponse;
   // Plus metadata, validate
}
The Proxmox Provider
With "framework" handling the protocol, the Proxmox provider just implements the traits:
pub struct ProxmoxProvider {
    client: Option,
}

impl Provider for ProxmoxProvider {
    fn type_name(&self) -> &str {
        "proxmox"
    }

    fn resources(&self) -> HashMap {
        let mut resources = HashMap::new();

        // Register realm resource
        resources.insert(
            "proxmox_realm".to_string(),
            Box::new(|| Box::new(RealmResource::new())),
        );

        // Register VM resource
        resources.insert(
            "proxmox_qemu_vm".to_string(),
            Box::new(|| Box::new(QemuVmResource::new())),
        );

        resources
    }
}
Resources get provider data through a separate trait:
pub trait ResourceWithConfigure: Resource {
    fn configure(&mut self, ctx: Context, data: Arc);
}
and then in the realm resource
impl ResourceWithConfigure for RealmResource {
    fn configure(&mut self, _ctx: Context, data: Arc) {
        if let Some(provider_data) = data.downcast_ref::() {
            self.provider_data = Some(provider_data.clone());
        }
    }
}
Realm Resource
The schema definition shows all the OIDC fields we needed:
async fn schema(&self, _ctx: Context, _request: ResourceSchemaRequest) -> ResourceSchemaResponse {
    let schema = SchemaBuilder::new()
        .version(0)
        .description("Manages authentication realms in Proxmox VE")
        .attribute(
            AttributeBuilder::new("realm", AttributeType::String)
                .description("The realm identifier")
                .required()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("type", AttributeType::String)
                .description("The authentication type")
                .required()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("issuer_url", AttributeType::String)
                .optional()
                .build(),
        )
        // ... more OIDC fields
        .build();

    ResourceSchemaResponse { schema, diagnostics: vec![] }
}
The CRUD operations integrate with the Proxmox API:
async fn create(
      &self,
      _ctx: Context,
      request: CreateResourceRequest,
  ) -> CreateResourceResponse {
      let mut diagnostics = vec![];

      let provider_data = match &self.provider_data {
          Some(data) => data,
          None => {
              diagnostics.push(Diagnostic::error(
                  "Provider not configured",
                  "Provider data was not properly configured",
              ));
              return CreateResourceResponse {
                  new_state: request.planned_state,
                  private: vec![],
                  diagnostics,
              };
          }
      };

      // Extract realm configuration from request
      match self.extract_realm_config(&request.config) {
          Ok(realm_config) => {
              // Build and send create request to API
              let create_request = CreateRealmRequest {
                  realm: realm_config.realm.clone(),
                  realm_type: realm_config.realm_type.clone(),
                  issuer_url: realm_config.issuer_url.clone(),
                  client_id: realm_config.client_id.clone(),
                  client_key: realm_config.client_key.clone(),
                  // ... other fields
              };

              match provider_data.client.access().realms().create(&create_request).await {
                  Ok(()) => CreateResourceResponse {
                      new_state: request.planned_state,
                      private: vec![],
                      diagnostics,
                  },
                  Err(e) => {
                      diagnostics.push(Diagnostic::error(
                          "Failed to create realm",
                          format!("API error: {}", e),
                      ));
                      CreateResourceResponse {
                          new_state: request.planned_state,
                          private: vec![],
                          diagnostics,
                      }
                  }
              }
          }
          Err(diag) => {
              diagnostics.push(diag);
              CreateResourceResponse {
                  new_state: request.planned_state,
                  private: vec![],
                  diagnostics,
              }
          }
      }
  }
And now in Terraform:
resource "proxmox_realm" "authentik" {
  realm             = "authentik"
  type              = "openid"
  issuer_url        = "https://auth.example.com/application/o/proxmox/"
  client_id         = var.client_id
  client_key        = var.client_key
  username_claim    = "username"
  autocreate        = true
  default           = true
}
No more hooks! The realm resource handles all the OIDC configuration directly.
VM Resource
Since we were building a provider anyway, why stop at realms? The VM resource supports full QEMU configuration with a comprehensive schema:

async fn schema(
    &self,
    _ctx: Context,
    _request: ResourceSchemaRequest,
) -> ResourceSchemaResponse {
    let schema = SchemaBuilder::new()
        .version(0)
        .description("Manages QEMU/KVM virtual machines in Proxmox VE")
        .attribute(
            AttributeBuilder::new("node", AttributeType::String)
                .description("The name of the Proxmox node where the VM will be created")
                .required()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("vmid", AttributeType::Number)
                .description("The VM identifier")
                .required()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("name", AttributeType::String)
                .description("The VM name")
                .required()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("cores", AttributeType::Number)
                .description("Number of CPU cores per socket")
                .optional()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("sockets", AttributeType::Number)
                .description("Number of CPU sockets")
                .optional()
                .build(),
        )
        .attribute(
            AttributeBuilder::new("memory", AttributeType::Number)
                .description("Memory size in MB")
                .optional()
                .build(),
        )
        // ... more fields

        .build();

    ResourceSchemaResponse {
        schema,
        diagnostics: vec![],
    }
}
resource "proxmox_qemu_vm" "control_plane" {
  node      = "mjolnir"
  vmid      = 9001
  name      = "k8s-master"
  cores     = 2
  memory    = 4096
  
  scsi0     = "local-lvm:20,format=raw"
  net0      = "virtio,bridge=vmbr0,tag=30"
  
  ciuser    = "ubuntu"
  sshkeys   = file("~/.ssh/id_rsa.pub")
  ipconfig0 = "ip=dhcp"
  
  start     = true  # Auto-start after creation
}
Current State
Is it production-ready? No (and never will, please use something like Telmate or bpg/proxmox). Does it work for a homelab? Pretty much. (I have started dogfooding it slowly for the realms and VMs)
The code is available at https://github.com/mrdvince/surtr


Proxmox OIDC integration and terragrunt hooks (Day 36)
Vince — Sun, 01 Jun 2025 21:09:51 GMT
Turns out the Telmate Proxmox provider doesn't have resource support for creating authentication realms or configuring OIDC. 
But since Proxmox has a REST API, I could work around the provider limitations, and so I ended up with:
terraform {
  source = "."

  after_hook "create_realm" {
    commands = ["apply"]
    execute = ["bash", "-c", <<-BASH
        # Proxmox returns a 500 if the user doesn't exist, anyway just check for 200
        STATUS=$(curl -k -s -o /dev/null -w "%%{http_code}" \
        "${local.pm_api_url}/access/domains/authentik" \
        -H "Authorization: PVEAPIToken=${local.pm_api_token_id}=${local.pm_api_token_secret}")
        
        if [ "$STATUS" = "200" ]; then
            echo "Realm 'authentik' already exists"
        else
            echo "Creating realm 'authentik'"
            curl -k -X POST "${local.pm_api_url}/access/domains" \
                -H "Authorization: PVEAPIToken=${local.pm_api_token_id}=${local.pm_api_token_secret}" \
                -H "Content-Type: application/x-www-form-urlencoded" \
                --data-urlencode "realm=authentik" \
                --data-urlencode "type=openid" \
                --data-urlencode "issuer-url=${local.issuerurl}" \
                --data-urlencode "client-id=${dependency.authentik.outputs.client_id["prx-avalon"]}" \
                --data-urlencode "client-key=${dependency.authentik.outputs.client_secret["prx-avalon"]}" \
                --data-urlencode "username-claim=username" \
                --data-urlencode "autocreate=1" \
                --data-urlencode "default=1"
        fi
    BASH
    ]
  }

  before_hook "delete_realm" {
    commands = ["destroy"]
    execute = ["bash", "-c", <<-BASH
      curl -k -X DELETE "${local.pm_api_url}/access/domains/authentik" \
        -H "Authorization: PVEAPIToken=${local.pm_api_token_id}=${local.pm_api_token_secret}"
    BASH
    ]
  }
}
The after_hook runs after apply and creates the OIDC realm in Proxmox if it doesn't exist. 
The before_hook cleans it up on destroy. The client ID and secret come from the Authentik module outputs, which keep everything connected.
It's not pretty, but it works.


Authentik OAuth2 with Terraform (Day 35)
Vince — Fri, 30 May 2025 20:08:00 GMT
I recently started using Authentik to provide auth for my services and applications in the homelab.
Authentik is an open-source identity provider that supports OAuth2, SAML, and more, and comes with a Terraform provider, so naturally, I defaulted to managing everything that way.
This means I no longer need to deal with multiple logins. Authentik acts as a single sign-on solution, letting me authenticate once and access everything.
See https://goauthentik.io/ for more info.
Set up
Create a token on Authentik's console under "Tokens and App passwords" and use it to set up the terraform provider
provider "authentik" {
  url   = ""
  token = "" 
}
Remember to use something like sops to encrypt your token / secrets.
Get the existing flows that Authentik provides out of the box:
data "authentik_flow" "default-authorization-flow" {
  slug = "default-provider-authorization-explicit-consent"
}

data "authentik_flow" "default-invalidation-flow" {
  slug = "default-provider-invalidation-flow"
}
These flows handle the authorization process and session invalidation. One can still create other flows, but there are plenty of preconfigured flows that just work.
Creating OAuth2 providers
Then create the authentik_provider_oauth2 resource for each application that needs and supports it.
resource "authentik_provider_oauth2" "this" {
  for_each               = var.authentik_application
  name                   = each.key
  client_id              = random_string.client_id[each.key].id
  client_secret          = random_password.client_secret[each.key].result
  authorization_flow     = data.authentik_flow.default-authorization-flow.id
  invalidation_flow      = data.authentik_flow.default-invalidation-flow.id
  refresh_token_validity = var.refresh_token_validity
  allowed_redirect_uris  = each.value.allowed_redirect_uris
  property_mappings      = var.property_mappings
  sub_mode               = var.sub_mode
}
Adding access policies
Add expression policies and bind them to the application. expression policies control who can access what, and are Python expressions that evaluate to true or false:
resource "authentik_policy_expression" "policy" {
  name       = var.policy_expression.name
  expression = var.policy_expression.expression
}

resource "authentik_policy_binding" "app-access" {
  for_each = var.authentik_application
  target   = authentik_application.this[each.key].uuid
  policy   = authentik_policy_expression.policy.id
  order    = 0
}
Creating applications
The applications themselves are straightforward and tie everything together:
resource "authentik_application" "this" {
  for_each          = var.authentik_application
  name              = try(each.value.name, each.key)
  slug              = each.key
  meta_icon         = var.app_meta_icon
  protocol_provider = authentik_provider_oauth2.this[each.key].id
}
Property mappings
These tripped me up initially. They define what user information gets passed to the application during authentication:
variable "property_mappings" {
  # authentik default OAuth Mapping: OpenID 'email' 
  # authentik default OAuth Mapping: OpenID 'openid 
  # authentik default OAuth Mapping: OpenID 'profile'
  default = [
    "4c94fd1d-1655-498f-94dc-e3be8506e0ec",
    "8bb80d61-1994-4538-9942-633b45ecd879",
    "660390cb-184a-4260-a4f0-7d69488a3037",
  ]
}
These UUIDs correspond to the default mappings in Authentik. Without them, authentication might work, but e.g, in Proxmox you may get a lot of 401. I suppose that's because it wasn't receiving the required user info.
Next Steps
Integrating more services and maybe exploring SAML for applications that don't play nice with OAuth2.
Figure out how to get Cilium ingress and Authentik working together


Talos extensions & Longhorn (Day 34)
Vince — Sun, 04 May 2025 19:32:00 GMT
I wanted to install Longhorn on my Talos cluster and found out how involved it can be, especially if you are not used to the whole immutable OS's.
Longhorn needs iscsi-tools and util-linux-tools . This is how I ended up installing them
Extensions
First create an extensions.yaml (can be any name)
customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/iscsi-tools
      - siderolabs/util-linux-tools
      - siderolabs/qemu-guest-agent
I also added qemu-guest-agent Makes the nics and other info available on Proxmox.
Get an image ID 
curl -X POST --data-binary @extensions.yaml https://factory.talos.dev/schematics
{"id":"613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245"}
Apply the image
Using the ID from the previous response, run the following Talos upgrade command.
 talosctl upgrade --preserve --nodes  --image factory.talos.dev/installer/e187c9b90f773cd8c84e5a3265c5554ee787b2fe67b508d9f955e90e7ae8c96c:v1.10.0
Check extensions were applied
 talosctl get extensions --nodes 10.30.30.141
NODE           NAMESPACE   TYPE              ID   VERSION   NAME               VERSION
10.30.30.141   runtime     ExtensionStatus   0    1         iscsi-tools        v0.2.0
10.30.30.141   runtime     ExtensionStatus   1    1         util-linux-tools   2.40.4
10.30.30.141   runtime     ExtensionStatus   2    1         qemu-guest-agent   9.2.3
Here, checking one of my nodes, it shows the extensions were added okay.
Mounts
Next, create a patch file, e.g, longhorn.patch.yaml with the following:
machine:
  kubelet:
    extraMounts:
      - destination: /var/lib/longhorn
        type: bind
        source: /var/lib/longhorn
        options:
          - bind
          - rshared
          - rw
And apply the patch to your Talos nodes using:
talosctl -n 10.30.30.143 patch machineconfig -m reboot -p @longhorn.patch.yaml
Once the reboot is done, you can proceed with the Longhorn installation (either through Argo or helm, or installation of your choice).


VLAN Trunking and Proxmox Clusters (Day 33)
Vince — Sun, 04 May 2025 10:22:02 GMT
Following the post systemd-and-proxmox-day-3, I eventually declustered my Proxmox nodes. Now with the addition of a new MinisForum MS01, I did some rebuilding and went back to Proxmox clusters.
The current setup includes both PCs in a cluster, deliberately excluding the OPNsense Proxmox node, with the main advantage of clustering being how easy VM migration is between nodes.
For this cluster, I decided to also try VLAN trunking on Proxmox, allowing a single NIC to handle multiple network segments. Here's how to set it up:
Step 1: Configure Proxmox's VLAN-aware bridge
Edit the network configuration by modifying /etc/network/interfaces:
auto vmbr0
iface vmbr0 inet static
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr0.50 # Creates VLAN 50 interface
iface vmbr0.50 inet static
        address 192.168.50.240/24 # Your Proxmox IP 
        gateway 192.168.50.1
Step 2: Reboot Proxmox
After rebooting, note that you won't immediately regain access to Proxmox's web interface. This is expected behavior since the node now requires properly tagged VLAN 50 traffic.
Step 3: Configure your network switch
Set up the switch port connected to Proxmox as a trunk port, and configure it to carry all VLANs you plan to use with Proxmox and its VMs, with VLAN tagging enabled. 
This allows the Proxmox host to receive properly tagged traffic for each network segment. The main advantage here is that each VM instance can operate on separate VLANs while using only a single physical port on your switch, reducing the number of switch ports needed.


Setting up Talos in HA Mode (Day 32)
Vince — Sat, 19 Apr 2025 14:17:29 GMT
I decided to migrate from kubeadm and ansible playbooks and switch to talos (mostly out of curiosity and it looks like an easier way to manage and do cluster upgrades)
Why Talos?
What makes Talos interesting:
Immutable infrastructure (no SSH, no shell)
API-driven configuration
Designed from the ground up for Kubernetes
Also I did say I would try it after this years KubeCon EU so...
Setting Up HA Control Plane
I didn't want to setup an external haproxy load balancer (though I plan to use opnsense instead, a bit different from my existing clusters), I defaulted to using talos's inbuilt VIP support.
Here's how I approached it:
First, create a controlplane patch file for configuration overrides:
machine:
  network:
    interfaces:
      - interface: enp6s18 # Use talosctl -n  get links --insecure
        dhcp: true
        vip:
          ip: 10.30.30.135
cluster:
  apiServer:
    certSANs:
      - 10.30.30.135
      - 10.30.30.131
      - 10.30.30.132
      - 10.30.30.133
    admissionControl:
      - name: PodSecurity
        configuration:
          defaults:
            audit: privileged
            audit-version: latest
            enforce: privileged
            enforce-version: latest
            warn: privileged
            warn-version: latest
  network:
    cni:
      name: none
    podSubnets:
      - 10.244.0.0/16
    serviceSubnets:
      - 10.96.0.0/16
  proxy:
    disabled: true
The patch disables the CNI and kubeproxy as I plan to use Cilium as a replacement for these two later.
Configuration Generation
Generate configs for your HA setup with the VIP:
talosctl gen config daedalus https://10.30.30.135:6443 \ # Use the VIP
  --output-dir _out \
  --with-cluster-discovery \
  --config-patch-control-plane @controlplane.yaml \
  --config-patch-worker @worker.yaml # If you have worker patches apply them too
Applying Configurations
Apply to control plane nodes:
talosctl apply-config --insecure --nodes 10.30.30.131 --file _out/controlplane.yaml
talosctl apply-config --insecure --nodes 10.30.30.132 --file _out/controlplane.yaml
talosctl apply-config --insecure --nodes 10.30.30.133 --file _out/controlplane.yaml
Apply to worker nodes:
talosctl apply-config --insecure --nodes 10.30.30.134 --file _out/worker.yaml
After applying the config, the nodes reboot, wait for the reboot and do a bootstrap on one of the controlplane nodes.
Bootstrapping
After Talos installs, and reboots run:
export TALOSCONFIG=$(pwd)/_out/talosconfig
talosctl config endpoint 10.30.30.131 10.30.30.132 10.30.30.133
talosctl config node 10.30.30.131
talosctl bootstrap
Health Check and Kubeconfig
Check cluster health:
talosctl health
This command might stall at waiting for all k8s nodes to report ready if you set CNI to none in your config. 
As long as the kubelet, apiserver, controller-manager, and scheduler are ready, you can proceed to install a CNI plugin, I went with Cilium as always.
Generate kubeconfig:
talosctl kubeconfig --nodes 10.30.30.131 --endpoints 10.30.30.135 -f
talosctl config endpoint 10.30.30.135
Automating
I created an Ansible playbook to automate this entire process, but I just found there's a terraform provider for talos, so I may be switching to that instead.
UPDATE:  While switching I instead ended up with makefiles, the amount recreate i was doing needed something to just run all the terragrunt, helmfile etc commands.
First Impressions
The biggest challenge was understanding the bootstrapping process and how the VIP gets managed, but once configured, I pointed the deployed Argo instance and had my deployments up and running.


TIL: DNS Search Domains (Day 31)
Vince — Mon, 24 Mar 2025 21:00:00 GMT
What Are Search Domains?
Search domains are DNS suffixes automatically appended to unqualified hostnames to help resolve local network resources. When you type server1instead of server1.home.network, your system will try both.
The Problem
When combined with wildcard DNS records (*.domain.tld), search domains can cause external domains to incorrectly resolve to internal IPs.
I needed internal pods in my clusters to resolve dns using my self hosted DNS resolver which is adguard home.
After different attempts I settled for modifying core dns and having it use adguard for certain domains:
forward . 192.168.50.120
This worked initially however I immediately noticed argo was "broken" everything was stuck in unknown, then an error came up it couldn't resolve github (somehow github was being resolved to my loadbalancer which isn't right)
# from argo 
Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: 
code = Unknown desc = failed to list refs: dial tcp .....: connect: connection timed out
Some Diagnostics
Test DNS Resolution
k run dnstest --image=nicolaka/netshoot -it --rm --restart=Never -- nslookup github.com/mrdvince
Server:        10.96.0.10
Address:    10.96.0.10#53
Non-authoritative answer:
Name:    github.com/mrdvince.home.mrdvince.me 
Address: 192.168.50.10
Notice the github.com/mrdvince.home.mrdvince.me so I go "what even is this ? how did this come to be?"
I try the same thing but on the k8s node which resolved just fine:
nslookup github.com
Server:        192.168.50.120
Address:    192.168.50.120#53
Non-authoritative answer:
Name:    github.com
Address: 140.82.121.4
the next thing is to check what the resolv.conf contains:
k run dnstest --image=nicolaka/netshoot -it --rm --restart=Never -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local home.mrdvince.me
nameserver 10.96.0.10
options ndots:5
pod "dnstest" deleted
It turns out that due to the ndots: 5 domains with fewer than 5 dots are considered unqualified and trigger search domain appending.
And for the unqualified domains, Kubernetes tries all search domains including home.mrdvince.me the issue then becomes since I have a wildcard for this mapping the home DNS resolves to an internally set IP.
So where is this search domain coming from?. Turns it's coming from OPNsense and there's no way to disable it (The default is to use the domain name of this system as the default domain name provided by DHCP, but you can specify a different one, however you can't fully get rid of it)
Solutions
Use specific DNS records instead of wildcards
Prioritize external DNS servers (e.g., forward . 1.1.1.1 local_dns_server)
Use more specific wildcard patterns
Change the system domain to something non-conflicting
The eventual fix while I redo dns records and replace the record additions with external dns was to prioritize external dns and fallback to internal one.
The quick dirty fix:
.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    # Prioritize external DNS to avoid search domain problems
    forward . 1.1.1.1 192.168.50.120
    cache 30
    loop
    reload
    loadbalance
}
So looks like the meme was right, it's always dns after all.


Automating Kubernetes Secrets with ArgoCD and SOPS (Day 30)
Vince — Sun, 23 Mar 2025 20:32:00 GMT
So I've been using SOPS to encrypt secrets and storing them in git but the catch was I had to manually decrypt and pipe them to kubectl each time I needed to apply them to the cluster.
# The manual way
sops -d secret.yaml | kubectl apply -f -
Secret Management 
ArgoCD is unopinionated about secret management, which is both a blessing and a curse. The blessing: flexibility. The curse: you have to figure it out yourself.
Since I was already using Helmfile, I decided to leverage the helm-secrets plugin that comes bundled with the Helmfile container image.
Setting It Up
First, I checked what plugins were available in the Helmfile container:
docker run --rm -it ghcr.io/helmfile/helmfile:v0.171.0 helm plugin list
NAME    	VERSION	DESCRIPTION
diff    	3.9.14 	Preview helm upgrade changes as a diff
helm-git	0.16.0 	Get non-packaged Charts directly from Git.
s3      	0.16.2 	Provides AWS S3 protocol support for charts and repos.
secrets 	4.6.0  	This plugin provides secrets values encryption for Helm charts
Perfect! The secrets plugin is already there.
Step 1: Create an AGE Key
Generate an age key (you could use your existing keys too):
age-keygen > key.txt
kubectl -n argocd create secret generic age --from-file=./key.txt
Step 2: Register the Plugin with ArgoCD
ArgoCD uses ConfigManagementPlugin system that can be configured in the helm values:
configs:
  cmp:
    create: true
    plugins:
      helmfile:
        allowConcurrency: true
        discover:
          fileName: helmfile.yaml
        generate:
          command:
            - bash
            - "-c"
            - |
              if [[ -v ENV_NAME ]]; then
                helmfile -n "$ARGOCD_APP_NAMESPACE" -e $ENV_NAME template --include-crds -q
              elif [[ -v ARGOCD_ENV_ENV_NAME ]]; then
                helmfile -n "$ARGOCD_APP_NAMESPACE" -e "$ARGOCD_ENV_ENV_NAME" template --include-crds -q
              else
                helmfile -n "$ARGOCD_APP_NAMESPACE" template --include-crds -q
              fi
        lockRepo: false
This config tells ArgoCD: "If you find a helmfile.yaml, use the helmfile command to process it."
Step 3: Add the Helmfile Container to the Repo Server
Then I added the helmfile container to ArgoCD's repo server:
repoServer:
  extraContainers:
    - name: helmfile
      image: ghcr.io/helmfile/helmfile:v0.171.0
      command: ["/var/run/argocd/argocd-cmp-server"]
      env:
        - name: SOPS_AGE_KEY_FILE
          value: /app/config/age/key.txt
        - name: HELM_CACHE_HOME
          value: /tmp/helm/cache
        - name: HELM_CONFIG_HOME
          value: /tmp/helm/config
        - name: HELMFILE_CACHE_HOME
          value: /tmp/helmfile/cache
        - name: HELMFILE_TEMPDIR
          value: /tmp/helmfile/tmp
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
      volumeMounts:
        - mountPath: /var/run/argocd
          name: var-files
        - mountPath: /home/argocd/cmp-server/plugins
          name: plugins
        - mountPath: /home/argocd/cmp-server/config/plugin.yaml
          subPath: helmfile.yaml
          name: argocd-cmp-cm
        - mountPath: /tmp
          name: cmp-tmp
        - mountPath: /app/config/age/
          name: age
Note the SOPS_AGE_KEY_FILE and the mounted the age secret. SOPS checks for this environment variable when decrypting secrets.
I also had to direct all the cache folders to /tmp, otherwise I'd get:
Error: mkdir /helm/.config: permission denied COMBINED OUTPUT: Error: mkdir /helm/.config: permission denied
Managing the Secrets in Helmfile
With the ArgoCD setup complete, I then structured my Helmfile to handle the secrets. In the releases block add a secrets section:
releases:
  - name: grafana
    namespace: monitoring
    createNamespace: true
    chart: grafana/grafana
    version: 8.10.4
    values:
      - ./values.yaml.gotmpl
    needs:
      - monitoring/grafana-auth
      
  - name: grafana-auth
    namespace: monitoring
    createNamespace: true
    chart: ../../../../../charts/secrets/
    version: 0.1.0
    secrets:
      - ../../../../secrets/sealed-grafana-auth-secret.yaml
The trick is to have a minimal Helm chart that takes these decrypted values and creates a Kubernetes secret:
apiVersion: v1
kind: Secret
type: Opaque
metadata:
    name: {{ include "secrets.fullname" . }}
    namespace: {{ .Release.Namespace }}
    labels:
      {{- include "secrets.labels" . | nindent 6}}
{{- with .Values.data }}
data:
  {{- range $key, $value := .}}
  {{$key }}: {{ $value | b64enc }}
  {{- end }}
{{- end }}
{{- with .Values.stringData }}
stringData:
  {{- range $key, $value := .}}
  {{$key }}: {{ $value | b64enc }}
  {{- end }}
{{- end }}
Then in the Grafana admin block values, I reference this secret:
existingSecret: grafana-auth-secret
userKey: admin-user
passwordKey: admin-password
So now
No more manual decryption: ArgoCD now handles the secret decryption and application automatically.
GitOps all the things: Everything—including secrets—is now managed declaratively through git.