When Simple Solutions Become Complex (Day 5 & 6)
2025-01-15
After the whole VM boot ordering not working the way i wanted, I finally decided to uncluster the nodes.
(Also very likely a skill issue).
To be fair having the nodes in a cluster made it easier for centralized management and easy VM migrations.
From LXC to K3s
This is more of day 6 now
Now remember this experiment running Traefik in an LXC container? Well, I thought - why not make this more interesting? Instead of systemd services, why not run it on k3s? it would make it easier to manage traefik and run some other containers without reaching for portainer plus anyway i already have Helm charts from the other kubeadm setup.
Here’s how it went/going (depending on when you read this) down:
First, the k3s install: Disabling kube-proxy and using cilium for networking (almost muscle memory at this point):
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='--disable=traefik,disable-kube-proxy,disable-network-policy --flannel-backend=none --write-kubeconfig-mode=644 --etcd-expose-metrics true' sh -
As expected, pods not running yet - no networking configured, easy fix (as always).
I won’t bore you with the details but let’s just say it wasn’t as straight forward as i thought it would be. But anyway got Cilium running after some debugging, all I needed was this (ipam.operator.clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
).
Final helmfile config section:
repositories:
- name: cilium
url: https://helm.cilium.io/
releases:
- name: cilium
namespace: kube-system
chart: cilium/cilium
version: 1.16.5
values:
- ipam.operator.clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
- operator.replicas: 1
Secrets
Next up was getting the secrets in place. I pulled my age keys from Bitwarden secrets manager:
export SOPS_AGE_KEY=$(bws secret get <uuid> | jq .value | xargs)
export AGE_PUBLIC_KEY=$(bws secret get <uuid> | jq ".value" | xargs)
Applied them to the cluster (well, it’s just one node, but hey I don’t know what to call it):
sops --decrypt --encrypted-regex '^(data|stringData)$' ../atlas/manifests/secrets/traefik-auth-secret.yaml | k apply -f -
sops --decrypt --encrypted-regex '^(data|stringData)$' ../atlas/manifests/secrets/cloudflare-token-secret.yaml | k apply -f -
These are necessary for Traefik auth middleware and the cluster issuer’s Cloudflare dns solvers integration stuff in cert-manager.
The Certificate
Then came the interesting part. Cert-manager is failing to verify certificates with an interesting error:
Warning ErrInitIssuer 7m27s (x8 over 22m) cert-manager-clusterissuers Error initializing issuer: Get "https://acme-staging-v02.api.letsencrypt.org/directory": tls: failed to verify certificate: x509: certificate is valid for OPNsense.localdomain, not acme-staging-v02.api.letsencrypt.org
Warning ErrInitIssuer 2m27s (x2 over 2m27s) cert-manager-clusterissuers Error initializing issuer: Get "https://acme-staging-v02.api.letsencrypt.org/directory": tls: failed to verify certificate: x509: certificate is valid for 01d3241326f8a773d75e9f119eb7de02.2a95698c2bc1a035feb46fa4cfb29c0f.traefik.default, not acme-staging-v02.api.letsencrypt.org
Strange, okay so lets see, do i have any special filtering on my opnsense, hmm!!! and why OPNsense.localdomain and traefik.default?
Some DNS debugging:
nslookup acme-staging-v02.api.letsencrypt.org
Looked fine…
curl -v https://acme-staging-v02.api.letsencrypt.org/directory
All seems fine ….
Then aha!!! Turned out to be an interesting DNS loop - the instance uses OPNsense which in turn i had set to only use AdGuard DNS for the VLAN the instance was running on, i had a record for the domain in adguard rewrites rules pointing back to traefik. Sort of a circular dependency.
After fixing the DNS setup, finally the certificate is ready:
kubectl get certificates
NAMESPACE NAME READY SECRET AGE
kube-system traefik-certificate True traefik-tls-staging 40m
But hold up!, while the certificate is issued and ready, Traefik isn’t picking it up. it’s still defaulting to the self-signed certificate when accessing the URL.
At this point, it is approaching midnight and i’m tired (this could be the issue).
To be continued…
Note: For this k3s setup, I’m keeping it simpler and using helmfile to deploy the charts instead of Argo CD