When Simple Solutions Become Complex (Day 5 & 6)

󰃭 2025-01-15

After the whole VM boot ordering not working the way i wanted, I finally decided to uncluster the nodes.

(Also very likely a skill issue).

To be fair having the nodes in a cluster made it easier for centralized management and easy VM migrations.

From LXC to K3s

This is more of day 6 now

Now remember this experiment running Traefik in an LXC container? Well, I thought - why not make this more interesting? Instead of systemd services, why not run it on k3s? it would make it easier to manage traefik and run some other containers without reaching for portainer plus anyway i already have Helm charts from the other kubeadm setup.

Here’s how it went/going (depending on when you read this) down:

First, the k3s install: Disabling kube-proxy and using cilium for networking (almost muscle memory at this point):

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='--disable=traefik,disable-kube-proxy,disable-network-policy --flannel-backend=none --write-kubeconfig-mode=644 --etcd-expose-metrics true' sh -

As expected, pods not running yet - no networking configured, easy fix (as always).

I won’t bore you with the details but let’s just say it wasn’t as straight forward as i thought it would be. But anyway got Cilium running after some debugging, all I needed was this (ipam.operator.clusterPoolIPv4PodCIDRList: "10.42.0.0/16").

Final helmfile config section:

repositories:
  - name: cilium
    url: https://helm.cilium.io/

releases:
  - name: cilium
    namespace: kube-system
    chart: cilium/cilium
    version: 1.16.5
    values:
      - ipam.operator.clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
      - operator.replicas: 1

Secrets

Next up was getting the secrets in place. I pulled my age keys from Bitwarden secrets manager:

export SOPS_AGE_KEY=$(bws secret get <uuid> | jq .value | xargs)
export AGE_PUBLIC_KEY=$(bws secret get <uuid> | jq ".value" | xargs)

Applied them to the cluster (well, it’s just one node, but hey I don’t know what to call it):

sops --decrypt --encrypted-regex '^(data|stringData)$' ../atlas/manifests/secrets/traefik-auth-secret.yaml | k apply -f -
sops --decrypt --encrypted-regex '^(data|stringData)$' ../atlas/manifests/secrets/cloudflare-token-secret.yaml | k apply -f -

These are necessary for Traefik auth middleware and the cluster issuer’s Cloudflare dns solvers integration stuff in cert-manager.

The Certificate

Then came the interesting part. Cert-manager is failing to verify certificates with an interesting error:

Warning  ErrInitIssuer  7m27s (x8 over 22m)    cert-manager-clusterissuers  Error initializing issuer: Get "https://acme-staging-v02.api.letsencrypt.org/directory": tls: failed to verify certificate: x509: certificate is valid for OPNsense.localdomain, not acme-staging-v02.api.letsencrypt.org
  Warning  ErrInitIssuer  2m27s (x2 over 2m27s)  cert-manager-clusterissuers  Error initializing issuer: Get "https://acme-staging-v02.api.letsencrypt.org/directory": tls: failed to verify certificate: x509: certificate is valid for 01d3241326f8a773d75e9f119eb7de02.2a95698c2bc1a035feb46fa4cfb29c0f.traefik.default, not acme-staging-v02.api.letsencrypt.org

Strange, okay so lets see, do i have any special filtering on my opnsense, hmm!!! and why OPNsense.localdomain and traefik.default?

Some DNS debugging:

nslookup acme-staging-v02.api.letsencrypt.org

Looked fine…

curl -v https://acme-staging-v02.api.letsencrypt.org/directory

All seems fine ….

Then aha!!! Turned out to be an interesting DNS loop - the instance uses OPNsense which in turn i had set to only use AdGuard DNS for the VLAN the instance was running on, i had a record for the domain in adguard rewrites rules pointing back to traefik. Sort of a circular dependency.

After fixing the DNS setup, finally the certificate is ready:

kubectl get certificates
NAMESPACE     NAME                  READY   SECRET                 AGE
kube-system   traefik-certificate   True    traefik-tls-staging    40m

But hold up!, while the certificate is issued and ready, Traefik isn’t picking it up. it’s still defaulting to the self-signed certificate when accessing the URL.

At this point, it is approaching midnight and i’m tired (this could be the issue).

To be continued…

Note: For this k3s setup, I’m keeping it simpler and using helmfile to deploy the charts instead of Argo CD

Nomadin

When Simple Solutions Become Complex (Day 5 & 6)

From LXC to K3s

Secrets

The Certificate

More posts like this

Complex Simple Solutions Continued (Day 7)

Traefik, Reverse Proxies and Lxc Containers (Day 4)

What’s a Reverse Proxy?