Nomadin

Setting up Talos in HA Mode (Day 32)

Vince — Sat, 19 Apr 2025 14:17:29 GMT

I decided to migrate from kubeadm and ansible playbooks and switch to talos (mostly out of curiosity and it looks like an easier way to manage and do cluster upgrades)

Why Talos?

What makes Talos interesting:

Immutable infrastructure (no SSH, no shell)
API-driven configuration
Designed from the ground up for Kubernetes
Also I did say I would try it after this years KubeCon EU so...

Setting Up HA Control Plane

I didn't want to setup an external haproxy load balancer (though I plan to use opnsense instead, a bit different from my existing clusters), I defaulted to using talos's inbuilt VIP support.

Here's how I approached it:

First, create a controlplane patch file for configuration overrides:

machine:
  network:
    interfaces:
      - interface: enp6s18 # Use talosctl -n  get links --insecure
        dhcp: true
        vip:
          ip: 10.30.30.135
cluster:
  apiServer:
    certSANs:
      - 10.30.30.135
      - 10.30.30.131
      - 10.30.30.132
      - 10.30.30.133
    admissionControl:
      - name: PodSecurity
        configuration:
          defaults:
            audit: privileged
            audit-version: latest
            enforce: privileged
            enforce-version: latest
            warn: privileged
            warn-version: latest
  network:
    cni:
      name: none
    podSubnets:
      - 10.244.0.0/16
    serviceSubnets:
      - 10.96.0.0/16
  proxy:
    disabled: true

The patch disables the CNI and kubeproxy as I plan to use Cilium as a replacement for these two later.

Configuration Generation

Generate configs for your HA setup with the VIP:

talosctl gen config daedalus https://10.30.30.135:6443 \ # Use the VIP
  --output-dir _out \
  --with-cluster-discovery \
  --config-patch-control-plane @controlplane.yaml \
  --config-patch-worker @worker.yaml # If you have worker patches apply them too

Applying Configurations

Apply to control plane nodes:

talosctl apply-config --insecure --nodes 10.30.30.131 --file _out/controlplane.yaml
talosctl apply-config --insecure --nodes 10.30.30.132 --file _out/controlplane.yaml
talosctl apply-config --insecure --nodes 10.30.30.133 --file _out/controlplane.yaml

Apply to worker nodes:

talosctl apply-config --insecure --nodes 10.30.30.134 --file _out/worker.yaml

After applying the config, the nodes reboot, wait for the reboot and do a bootstrap on one of the controlplane nodes.

Bootstrapping

After Talos installs, and reboots run:

export TALOSCONFIG=$(pwd)/_out/talosconfig
talosctl config endpoint 10.30.30.131 10.30.30.132 10.30.30.133
talosctl config node 10.30.30.131
talosctl bootstrap

Health Check and Kubeconfig

Check cluster health:

talosctl health

This command might stall at waiting for all k8s nodes to report ready if you set CNI to none in your config.

As long as the kubelet, apiserver, controller-manager, and scheduler are ready, you can proceed to install a CNI plugin, I went with Cilium as always.

Generate kubeconfig:

talosctl kubeconfig --nodes 10.30.30.131 --endpoints 10.30.30.135 -f
talosctl config endpoint 10.30.30.135

Automating

I created an Ansible playbook to automate this entire process, but I just found there's a terraform provider for talos, so I may be switching to that instead.

UPDATE: While switching I instead ended up with makefiles, the amount recreate i was doing needed something to just run all the terragrunt, helmfile etc commands.

First Impressions

The biggest challenge was understanding the bootstrapping process and how the VIP gets managed, but once configured, I pointed the deployed Argo instance and had my deployments up and running.

TIL: DNS Search Domains (Day 31)

Vince — Mon, 24 Mar 2025 21:00:00 GMT

What Are Search Domains?

Search domains are DNS suffixes automatically appended to unqualified hostnames to help resolve local network resources. When you type server1instead of server1.home.network, your system will try both.

The Problem

When combined with wildcard DNS records (*.domain.tld), search domains can cause external domains to incorrectly resolve to internal IPs.

I needed internal pods in my clusters to resolve dns using my self hosted DNS resolver which is adguard home.

After different attempts I settled for modifying core dns and having it use adguard for certain domains:

forward . 192.168.50.120

This worked initially however I immediately noticed argo was "broken" everything was stuck in unknown, then an error came up it couldn't resolve github (somehow github was being resolved to my loadbalancer which isn't right)

# from argo 
Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: 
code = Unknown desc = failed to list refs: dial tcp .....: connect: connection timed out

Some Diagnostics

Test DNS Resolution

k run dnstest --image=nicolaka/netshoot -it --rm --restart=Never -- nslookup github.com/mrdvince
Server:        10.96.0.10
Address:    10.96.0.10#53
Non-authoritative answer:
Name:    github.com/mrdvince.home.mrdvince.me 
Address: 192.168.50.10

Notice the github.com/mrdvince.home.mrdvince.me so I go "what even is this ? how did this come to be?"

I try the same thing but on the k8s node which resolved just fine:

nslookup github.com
Server:        192.168.50.120
Address:    192.168.50.120#53
Non-authoritative answer:
Name:    github.com
Address: 140.82.121.4

the next thing is to check what the resolv.conf contains:

k run dnstest --image=nicolaka/netshoot -it --rm --restart=Never -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local home.mrdvince.me
nameserver 10.96.0.10
options ndots:5
pod "dnstest" deleted

It turns out that due to the ndots: 5 domains with fewer than 5 dots are considered unqualified and trigger search domain appending.

And for the unqualified domains, Kubernetes tries all search domains including home.mrdvince.me the issue then becomes since I have a wildcard for this mapping the home DNS resolves to an internally set IP.

So where is this search domain coming from?. Turns it's coming from OPNsense and there's no way to disable it (The default is to use the domain name of this system as the default domain name provided by DHCP, but you can specify a different one, however you can't fully get rid of it)

Solutions

Use specific DNS records instead of wildcards
Prioritize external DNS servers (e.g., forward . 1.1.1.1 local_dns_server)
Use more specific wildcard patterns
Change the system domain to something non-conflicting

The eventual fix while I redo dns records and replace the record additions with external dns was to prioritize external dns and fallback to internal one.

The quick dirty fix:

.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    # Prioritize external DNS to avoid search domain problems
    forward . 1.1.1.1 192.168.50.120
    cache 30
    loop
    reload
    loadbalance
}

So looks the meme was right, it's always dns after all.

Automating Kubernetes Secrets with ArgoCD and SOPS (Day 30)

Vince — Sun, 23 Mar 2025 20:32:00 GMT

So I've been using SOPS to encrypt secrets and storing them in git but the catch was I had to manually decrypt and pipe them to kubectl each time I needed to apply them to the cluster.

# The manual way
sops -d secret.yaml | kubectl apply -f -

Secret Management

ArgoCD is unopinionated about secret management, which is both a blessing and a curse. The blessing: flexibility. The curse: you have to figure it out yourself.

Since I was already using Helmfile, I decided to leverage the helm-secrets plugin that comes bundled with the Helmfile container image.

Setting It Up

First, I checked what plugins were available in the Helmfile container:

docker run --rm -it ghcr.io/helmfile/helmfile:v0.171.0 helm plugin list
NAME    	VERSION	DESCRIPTION
diff    	3.9.14 	Preview helm upgrade changes as a diff
helm-git	0.16.0 	Get non-packaged Charts directly from Git.
s3      	0.16.2 	Provides AWS S3 protocol support for charts and repos.
secrets 	4.6.0  	This plugin provides secrets values encryption for Helm charts

Perfect! The secrets plugin is already there.

Step 1: Create an AGE Key

Generate an age key (you could use your existing keys too):

age-keygen > key.txt
kubectl -n argocd create secret generic age --from-file=./key.txt

Step 2: Register the Plugin with ArgoCD

ArgoCD uses ConfigManagementPlugin system that can be configured in the helm values:

configs:
  cmp:
    create: true
    plugins:
      helmfile:
        allowConcurrency: true
        discover:
          fileName: helmfile.yaml
        generate:
          command:
            - bash
            - "-c"
            - |
              if [[ -v ENV_NAME ]]; then
                helmfile -n "$ARGOCD_APP_NAMESPACE" -e $ENV_NAME template --include-crds -q
              elif [[ -v ARGOCD_ENV_ENV_NAME ]]; then
                helmfile -n "$ARGOCD_APP_NAMESPACE" -e "$ARGOCD_ENV_ENV_NAME" template --include-crds -q
              else
                helmfile -n "$ARGOCD_APP_NAMESPACE" template --include-crds -q
              fi
        lockRepo: false

This config tells ArgoCD: "If you find a helmfile.yaml, use the helmfile command to process it."

Step 3: Add the Helmfile Container to the Repo Server

Then I added the helmfile container to ArgoCD's repo server:

repoServer:
  extraContainers:
    - name: helmfile
      image: ghcr.io/helmfile/helmfile:v0.171.0
      command: ["/var/run/argocd/argocd-cmp-server"]
      env:
        - name: SOPS_AGE_KEY_FILE
          value: /app/config/age/key.txt
        - name: HELM_CACHE_HOME
          value: /tmp/helm/cache
        - name: HELM_CONFIG_HOME
          value: /tmp/helm/config
        - name: HELMFILE_CACHE_HOME
          value: /tmp/helmfile/cache
        - name: HELMFILE_TEMPDIR
          value: /tmp/helmfile/tmp
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
      volumeMounts:
        - mountPath: /var/run/argocd
          name: var-files
        - mountPath: /home/argocd/cmp-server/plugins
          name: plugins
        - mountPath: /home/argocd/cmp-server/config/plugin.yaml
          subPath: helmfile.yaml
          name: argocd-cmp-cm
        - mountPath: /tmp
          name: cmp-tmp
        - mountPath: /app/config/age/
          name: age

Note the SOPS_AGE_KEY_FILE and the mounted the age secret. SOPS checks for this environment variable when decrypting secrets.

I also had to direct all the cache folders to /tmp, otherwise I'd get:

Error: mkdir /helm/.config: permission denied COMBINED OUTPUT: Error: mkdir /helm/.config: permission denied

Managing the Secrets in Helmfile

With the ArgoCD setup complete, I then structured my Helmfile to handle the secrets. In the releases block add a secrets section:

releases:
  - name: grafana
    namespace: monitoring
    createNamespace: true
    chart: grafana/grafana
    version: 8.10.4
    values:
      - ./values.yaml.gotmpl
    needs:
      - monitoring/grafana-auth
      
  - name: grafana-auth
    namespace: monitoring
    createNamespace: true
    chart: ../../../../../charts/secrets/
    version: 0.1.0
    secrets:
      - ../../../../secrets/sealed-grafana-auth-secret.yaml

The trick is to have a minimal Helm chart that takes these decrypted values and creates a Kubernetes secret:

apiVersion: v1
kind: Secret
type: Opaque
metadata:
    name: {{ include "secrets.fullname" . }}
    namespace: {{ .Release.Namespace }}
    labels:
      {{- include "secrets.labels" . | nindent 6}}
{{- with .Values.data }}
data:
  {{- range $key, $value := .}}
  {{$key }}: {{ $value | b64enc }}
  {{- end }}
{{- end }}
{{- with .Values.stringData }}
stringData:
  {{- range $key, $value := .}}
  {{$key }}: {{ $value | b64enc }}
  {{- end }}
{{- end }}

Then in the Grafana admin block values, I reference this secret:

existingSecret: grafana-auth-secret
userKey: admin-user
passwordKey: admin-password

So now

No more manual decryption: ArgoCD now handles the secret decryption and application automatically.
GitOps all the things: Everything—including secrets—is now managed declaratively through git.

Chartmuseum Repository continued ... (Day 29)

Vince — Sun, 16 Mar 2025 21:02:55 GMT

Continuing from the previous ChartMuseum setup entry, today was about solving the DNS resolution and network connectivity issues (some of which had to do with my firewall blocking traffic across certain VLANs).

The DNS Challenge

Yesterday ended with this error:

DEBUG Fetching chart list from storage
ERROR RequestError: send request failed
caused by: Get "https://helm-charts.minio-s3.home.mrdvince.me/?prefix=": 
dial tcp: lookup helm-charts.minio-s3.home.mrdvince.me on 10.96.0.10:53: no such host

The error revealed two important things:

ChartMuseum was trying to access helm-charts.minio-s3.home.mrdvince.me (bucket name + endpoint)
Kubernetes CoreDNS (10.96.0.10) couldn't resolve this hostname (expected and makes sense)

S3/minio has two URL addressing styles:

Path-style: https://endpoint/bucket-name/
Virtual-hosted style: https://bucket-name.endpoint/

I had DNS set up for the virtual-hosted style (as shown by the dig output):

$ dig helm-charts.minio-s3.home.mrdvince.me
...
;; ANSWER SECTION:
helm-charts.minio-s3.home.mrdvince.me. 10 IN A 192.168.50.10

But Kubernetes pods couldn't access my home DNS server so the solution:

Configure CoreDNS to forward queries for my domain to my home DNS:

e.g
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        # existing config...
    }
    home.mrdvince.me:53 {
        forward . 192.168.50.120
    }

Fallback to Direct IP

I went with option 2 after trying the CoreDNS approach and realizing it broke some syncs on argo based on how my DNS rewrites are set up on Adguard (something to fix another day)

However, this did bring up a TLS verification error:

caused by: Get "https://192.168.50.190/helm-charts?prefix=": tls: failed to verify certificate: 
x509: cannot validate certificate for 192.168.50.190 because it doesn't contain any IP SANs

Looking at the comments, and issues on Chart Museum's repo I found setting AWS_INSECURE_SKIP_VERIFY to true allows skipping self-signed certificate verifications.

Final config values:

- args:
    - --port=8080
    - --storage-amazon-endpoint=https://192.168.50.190
    - --storage-amazon-force-path-style=true
    - --disable-api=false
    - --debug
- env:
    - name: STORAGE
      value: "amazon"
    - name: STORAGE_AMAZON_BUCKET
      value: "helm-charts"
    - name: STORAGE_AMAZON_PREFIX
      value: ""
    - name: STORAGE_AMAZON_REGION
      value: "eu-west-1"
    - name: AWS_INSECURE_SKIP_VERIFY
      value: "true"
- envFrom:
    - secretRef:
        name: minio-chartmuseum-secret

Using the Helm Repository

Now with everything working, I can manage my Helm charts using:

# Install the push plugin
helm plugin install https://github.com/chartmuseum/helm-push

# Add the private repository
helm repo add local-charts https://chartmuseum.atlas.home.mrdvince.me

# Package and push a chart
helm package ./chartmuseum
helm cm-push ./chartmuseum-0.1.0.tgz local-charts

# Or push directly from directory
helm cm-push ./chartmuseum local-charts

# Update and search
helm repo update
helm search repo local-charts

Next Steps

Set up automated chart builds with CI/CD
Potentially try to implement chart testing before publishing

Private Helm Chart Repository with ChartMuseum (Day 28)

Vince — Sat, 15 Mar 2025 20:17:00 GMT

I've been accumulating custom Helm charts, and figured I could use something with functionality to docker registries, a place that gives me:

Some kind of versioning
Central place to distribute charts
Avoid some manual inconsistencies while updating charts

The Solution: ChartMuseum

I decided to set up a self-hosted Helm chart repository using:

Minio as a S3-compatible storage alternative
ChartMuseum as the Helm repository server
Traefik (as always) for service routing

ChartMuseum artifact hub Chart (Initial Attempt)

The initial plan was simple, use the available ChartMuseum chart with Minio as the backend and override the following values:

env:
  open:
    STORAGE: amazon
    STORAGE_AMAZON_BUCKET: helm-charts
    STORAGE_AMAZON_PREFIX: ""
    STORAGE_AMAZON_REGION: eu-west-1
    STORAGE_AMAZON_ENDPOINT: https://minio-s3.home.mrdvince.me 
    AWS_SDK_LOAD_CONFIG: "1"
    DISABLE_API: false
  # Trying to use existing secrets
  existingSecret: minio-secret
  existingSecretMappings:
    AWS_ACCESS_KEY_ID: minio_access_key
    AWS_SECRET_ACCESS_KEY: minio_secret_key
extraArgs:
  - --storage-amazon-force-path-style=true

But this immediately ran into issues:

The chart's existingSecretMappings mechanism was finicky
The secret injection wasn't working as expected
Logs showed credential errors despite having the correct secret:

ERROR: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Building a Custom Chart

After hours of troubleshooting, I decided to build a simpler, custom deployment. I ran helm create chartmuseum which starts you off with a template chart.

Here's the basic structure:

├── Chart.yaml
├── rendered-app.yaml
├── templates
│   ├── NOTES.txt
│   ├── _helpers.tpl
│   ├── deployment.yaml
│   ├── hpa.yaml
│   ├── ingress.yaml
│   └── service.yaml
└── values.yaml

The Configuration

After several iterations, here's a rendered deployment snippet of the configuration I landed on:

# Source: chartmuseum/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chartmuseum
  labels:
    helm.sh/chart: chartmuseum-0.1.0
    app.kubernetes.io/name: chartmuseum
    app.kubernetes.io/instance: chartmuseum
    app.kubernetes.io/version: "v0.16.2"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: chartmuseum
      app.kubernetes.io/instance: chartmuseum
  template:
    metadata:
      labels:
        helm.sh/chart: chartmuseum-0.1.0
        app.kubernetes.io/name: chartmuseum
        app.kubernetes.io/instance: chartmuseum
        app.kubernetes.io/version: "v0.16.2"
        app.kubernetes.io/managed-by: Helm
    spec:
      containers:
        - name: chartmuseum
          args:
            - --port=8080
            - --storage-amazon-endpoint=https://minio-s3.home.mrdvince.me
            - --storage-amazon-force-path-style=false
            - --disable-api=false
            - --debug
          image: "ghcr.io/helm/chartmuseum:v0.16.2"
          imagePullPolicy: IfNotPresent
          env:
            - name: STORAGE
              value: amazon
            - name: STORAGE_AMAZON_BUCKET
              value: helm-charts
            - name: STORAGE_AMAZON_PREFIX
              value: ""
            - name: STORAGE_AMAZON_REGION
              value: eu-west-1
          envFrom:
            - secretRef:
                name: minio-secret
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP

But when attempting to connect to Minio, I hit more errors:

DEBUG Fetching chart list from storage
ERROR RequestError: send request failed
caused by: Get "https://helm-charts.minio-s3.home.mrdvince.me/?prefix=": 
dial tcp: lookup helm-charts.minio-s3.home.mrdvince.me on 10.96.0.10:53: no such host

And well, this was where I left things at the end of the Day - with a DNS resolution issue to solve the next day.

Installing CrowdSec on OPNsense (Day 27)

Vince — Sun, 02 Mar 2025 18:45:00 GMT

CrowdSec is a security tool that detects and blocks malicious IPs using a collaborative approach to share threat intelligence across users.

I initially planned to run CrowdSec just on Traefik, but having it at the firewall level provides more protection for all devices on the network.

Installation

CrowdSec has a convenient plugin for OPNsense that makes installation straightforward:

Navigate to System → Firmware → Plugins
Search for and install os-crowdsec

Configuration

Once installed, you'll find CrowdSec under the Services tab:

Go to Services → CrowdSec → Settings
Enable the following options:
1. Enable Log Processor (IDS) - This is the detection component
2. Enable LAPI - Unless you're connecting to LAPI on another machine
3. Enable Remediation Component (IPS) - This actively blocks detected threats
4. Enable log for rules - Optional, but useful for troubleshooting

Rules

By default, CrowdSec creates floating rules to block incoming connections from malicious IP addresses.

However, we can use the automatically created crowdsec_blacklists and crowdsec6_blacklists aliases to create custom floating rules that block all outgoing connections to malicious IPs.

This is useful in case a device on the network is already compromised and tries to connect back to the IP blocklisted.

Testing the Setup

To verify that CrowdSec is working properly, you can temporarily ban an IP address:

cscli decisions add -t ban -d 1m -i

This will ban the specified IP for one minute.

If you use your own IP, expect your connection to freeze, confirming that the ban is working.

To view active decisions (bans):

cscli decisions list

Todo

CrowdSec also has a Prometheus endpoint for metrics collection, so will look into integrating with Grafana for visualization.

Secure Secret Management with SOPS, Age, and Bitwarden (Day 26)

Vince — Fri, 28 Feb 2025 21:18:00 GMT

When working with infrastructure as code and Kubernetes, you inevitably face the challenge of managing secrets securely.

API tokens, and other sensitive information shouldn't be stored in plain text in your Git repositories, but they still need to be accessible for deployments.

Enter SOPS and Age

SOPS (Secrets OPerationS) is a powerful tool that supports multiple encryption providers including AWS KMS, GCP KMS, Azure Key Vault, age, and PGP.

But for those of us without cloud provider resources, age offers a lightweight, modern alternative for encryption.

Key Management with Bitwarden

The first question with age is: where do you store your keys securely, especially when you might need to access them across multiple machines? or what happens you reset or loose your machine and loose the keys.

I went looking for options and found Bitwarden Secrets Manager, it offers an elegant solution to securely store cryptographic keys and access them, if you're also already using Bitwarden for password management then why not try it.

Setting Up the Infrastructure

Install the Required Tools

First, install age and SOPS:

# Install age and SOPS (commands will vary by OS)
# on mac you can use brew

Generate Your Age Key

age-keygen -o key.txt

The generated file contains two important pieces:

A public key (starts with age1...)
A secret key (starts with AGE-SECRET-KEY-...)

Store Keys in Bitwarden Secrets Manager

Follow the Bitwarden Secrets Manager guide to set up your account and store both keys.

Use the Bitwarden CLI

Install the Bitwarden Secrets CLI (bws) and set up your access token:

export BWS_ACCESS_TOKEN=

Load keys as env variables

Add this function to your shell profile (e.g., .zshrc) to easily load keys when needed:

load_age_secrets() {
    export SOPS_AGE_KEY=$(bws secret get  | jq .value | xargs)
    export AGE_PUBLIC_KEY=$(bws secret get  | jq ".value" | xargs)
    echo "export SOPS_AGE_KEY='$SOPS_AGE_KEY'" > /tmp/.secrets_exports
    echo "export AGE_PUBLIC_KEY='$AGE_PUBLIC_KEY'" >> /tmp/.secrets_exports
    echo "Secrets loaded"
}

# this loads the keys as env variables if the values exist
[[ -f /tmp/.secrets_exports ]] && source /tmp/.secrets_exports

the is the secret uuid that can be found by running bws secret list.

Terragrunt and Proxmox

An example of using encrypted secrets with Terragrunt for infra provisioning (in this case, for Proxmox):

Create a YAML file with your secrets

# auth.yaml
proxmox_token: "your_super_secret_token_that_no_one_should_know"
proxmox_user_id: "your_user_id_which_we_also_encrypted_because_why_not"

Using YAML because we will use Terragrunt's yamldecode function for parsing decrypted secret later.

Encrypt it

sops --encrypt --age $AGE_PUBLIC_KEY --in-place auth.yaml

--in-place overwrites the file with a new encrypted file.

Reference it in your Terragrunt file

secret_vars = yamldecode(sops_decrypt_file(find_in_parent_folders("sample.yaml")))

# ...

pm_api_token_secret = local.secret_vars.proxmox_token
pm_api_token_id = local.secret_vars.proxmox_user_id

Use the variables in your provider block

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <

Terragrunt commands should work as usual i.e e.g when running terragrunt apply the secret gets decrypted and used to authenticate with the proxmox API.

`Kubernetes Secrets`

You can also encrypt Kubernetes secrets

Create a secret file

# secret.yaml
apiVersion: v1
data:
  key1: c3VwZXJzZWNyZXQ=
  key2: dG9wc2VjcmV0
kind: Secret
metadata:
  name: my-secret

Encrypt it, specifying only certain fields to encrypt

sops --encrypt --age $AGE_PUBLIC_KEY --encrypted-regex '^(data|stringData)$' secret.yaml

To apply it you can pipe the decrypted output to kubectl e.g

sops --decrypt --encrypted-regex '^(data|stringData)$' sample.yaml | k apply -f -

`Benefits of This Approach`

Security: Secrets never appear in plain text in your repositories
Convenience: Access to secrets across multiple machines or if something happens to the data locally
Auditability: Changes to encrypted files are tracked in git

So this setup gives you a nice foundation for secret management.



ZFS RAIDZ VDEV Extension (Day 25)
Vince — Wed, 26 Feb 2025 19:31:14 GMT
Got a new drive to add to my storage pool and TrueNAS Scale now supports RAIDZ VDEV extension. Which is a relatively new feature, introduced in TrueNAS 24.10 (Electric Eel).
Steps:
Navigate to the storage pool
Select "Manage Devices"
Choose the VDEV you want to extend
Click "Extend"
Select the new drive
Confirm and wait
See the docs here, the plus side of this process is that your NAS remains fully functional during the extension. 
You can continue using all services while the VDEV rebuilds in the background.
Important Notes on Capacity
There's an interesting caveat with the extension process:
The expanded vdev uses the pre-expanded parity ratio, which reduces the total vdev capacity. To reset the vdev parity ratio and fully use the new capacity, manually rewrite all data in the vdev. This process takes time and is irreversible.
In practical terms, this means you won't immediately get the full theoretical capacity increase. The system recovers this "lost headroom" over time as data naturally gets modified or deleted. From the TrueNAS docs:
Extended VDEVs recover lost headroom as existing data is read and rewritten to the new parity ratio. This can occur naturally over the lifetime of the pool as you modify or delete data. To manually recover capacity, simply replicate and rewrite the data to the extended pool.
However, scripts like this exits for ZFS balancing can be done using e,g the linked one does it in place.
For those wanting to calculate potential capacity gains, TrueNAS provides a handy Extension Calculator.
Performance Impact
The extension process is fairly time-consuming - my current extension has been running for over 5 hours. 
However, the entire system remains usable during this time, with all services continuing to function (that said there's not much load on the NAS, can't say the same if it was under heavy use)


Sharing Services Across Tailnets (Day 24)
Vince — Thu, 20 Feb 2025 04:39:00 GMT
So I needed to share some services with friends outside my tailnet, but:
Didn't want to add users directly to my tailnet
Preferred not to add public records to Cloudflare (keep using my self-hosted DNS)
The Setup
The current infrastructure setup includes:
Traefik as the root reverse proxy in Kubernetes
A LoadBalancer service (traefik's) using a private, non-tailscale-routable IP
Self-hosted DNS server
The Solution
After doing some research and trials, the approach I went with was to expose the Traefik LoadBalancer service directly to Tailscale using their Kubernetes operator.
And tailscale does include a blog post on how to do this, I recommend checking it out.
1. Installing the Operator
Important: Create an OAuth client in the Tailscale console with Devices Core and Auth Keys write scopes first, (see the full post here).
Install the Tailscale operator using Helm:
Add https://pkgs.tailscale.com/helmcharts to your local Helm repositories:
helm repo add tailscale https://pkgs.tailscale.com/helmcharts
Update your local Helm cache:
helm repo update
Install the operator passing the OAuth client credentials:
helm upgrade \
  --install \
  tailscale-operator \
  tailscale/tailscale-operator \
  --namespace=tailscale \
  --create-namespace \
  --set-string oauth.clientId="" \
  --set-string oauth.clientSecret="" \
  --wait
Can be added to be part of helmfile template or argo deployment
2. Exposing the Service
With the operator running, exposing a service is as simple as adding an annotation:
annotations:
  tailscale.com/expose: "true"
3. Sharing Access
From the Tailscale console:
Share both the Traefik node and DNS server
Users can then set their Tailscale DNS to use the shared DNS server
For more granular control, users can use Tailscale's split DNS to route only specific domains
4. ACL Configuration
One useful thing is to set up ACLs to restrict what autogroup:shared and specific tags (the operator is a tagged device) can access. 
This ensures users only have access to the services you explicitly want to share.
So
The benefits are essentially:
Users don't need direct access to your tailnet
DNS resolution works seamlessly (I deployed a new separate DNS server with a limited amount of records just for tailnet shares)
No need for public DNS records
Fine-grained access control through Tailscale ACLs
Services remain secure behind Tailscale's encryption


Navidrome on TrueNAS Scale (Day 23)
Vince — Sun, 16 Feb 2025 07:10:44 GMT
I found some old hard drives from my campus days (surprisingly still working) with a bunch of songs. Rather than letting these sit idle, figured it was time to make this collection accessible on the go.
So I went looking for something I could use and found Navidrome.
Setting Up TrueNAS Datasets
First, create two datasets through the TrueNAS GUI (I prefer this over regular folders for better permission control):
navidrome
├── data
└── music
Installing Navidrome
TrueNAS Scale's ElectricEel release moved to Docker for apps (instead of Kubernetes)
Install:
Create the necessary datasets
Install Navidrome from the apps catalog
Configure:Set environment variables if needed
Point to your data and music folders
Set user/group IDs
Configure resource
Adding Traefik Routing
This part assumes you already have Traefik set up as a reverse proxy with cert-manager and HTTPS redirect middleware configured. 
If you're starting fresh, you'll want to get those pieces in place first.
Once Navidrome is running, we need to make it accessible through a reverse proxy. This requires two pieces:
an external service
ingress route
An external service definition:
apiVersion: v1
kind: Service
metadata:
  name: navidrome
  namespace: routes
spec:
  ports:
    - port: 
      targetPort: 
  type: ExternalName
  externalName: 
HTTP redirect route:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: navidrome-redirect
  namespace: routes
spec:
  entryPoints:
    - web
  routes:
    - match: Host(``)
      kind: Rule
      middlewares:
        - name: https-redirect
      services:
        - name: noop@internal
          kind: TraefikService
And  the actual route over HTTPS:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: navidrome
  namespace: routes
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(``)
      kind: Rule
      services:
        - name: navidrome
          port: 
          scheme: http
  tls:
    secretName: 
Beyond the Web UI
While Navidrome's web interface is solid, one of its strengths is Subsonic API compatibility. 
This means you can use various Subsonic-compatible apps as front-ends for your music collection.
I chose to use Symfonium as my client of choice and it's been impressive. 
Went for a photo walk and with gapless playback and a smart queue that keeps playing similar tracks (like Spotify's song radio'ish), it basically works, I forgot it was a self hosted thing. Also thanks to Tailscale, I can stream my music anywhere without noticing any difference from other services.
Now this isn't a replacement and I will still most definitely keep my spotify playlists


K3s upgrade to 1.31.5 (Day 22)
Vince — Thu, 13 Feb 2025 03:21:00 GMT
Upgrading K3s is remarkably straightforward. You just use the same install command you used when first creating your cluster for me that's:
NB: This is a single node k3s cluster
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_EXEC='--disable=traefik,disable-kube-proxy,disable-network-policy --flannel-backend=none --write-kubeconfig-mode=644 --etcd-expose-metrics true' sh -
About Those Flags
Looking at the install command, you might notice several flags:
--disable=traefik: Disabled because I'm running my own managed version of Traefik
--disable-kube-proxy,--flannel-backend=none: Both disabled as Cilium handles these functions (CNI and service networking)
--write-kubeconfig-mode=644: Sets readable permissions on the kubeconfig file right from the start
--etcd-expose-metrics true: Exposes etcd metrics.
In my case, the output showed:
[INFO]  Finding release for channel stable
[INFO]  Using v1.31.5+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.31.5+k3s1/sha256sum-amd64.txt
[INFO]  Skipping binary downloaded, installed k3s matches hash
[INFO]  Skipping installation of SELinux RPM
...
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
...
Verifying the Upgrade
After the upgrade, your workloads should continue running without interruption. You can verify the new version with:
k3s -version
That's really all there is to it - K3s keeps things refreshingly simple.


Traefik Timeouts and Immich (Day 21)
Vince — Tue, 11 Feb 2025 20:25:45 GMT
I have been setting up Immich, i.e destroying and recreating it until I got my photo organization and volumes just right.
The next step was to do a mass import after testing, the CLI tool makes mass imports pretty straightforward, but I kept having an issue with certain files failing to import especially when trying to import or upload large files (like 2GB).
I would generally get a very generic upload failed
Everything was working fine for regular photos, and my phone syncing without issues.
Did some digging and I found that the issue was to do with Traefik time outs, see the traefik page here
The Fix
So I adjusted the timeouts in the config i.e.: 
ports:
  web:
    redirections:
..............
  websecure:
    tls:
      enabled: true
    transport:
      respondingTimeouts:
        readTimeout: 20m
        writeTimeout: 20m    # 20 minutes - adjust based on your needs
And while at it updated traefik to v3.3.3 (helm chart 34.3.0) and changed from redirectTo to redirections (diff):
ports:
  web:
-    redirectTo:
-      port: websecure
+    redirections:
+      entryPoint:
+        to: websecure
+        scheme: https
+        permanent: true
I now got the large file uploads peachy.
PPS: Planning to move the days of homelab to a dedicated page and reduce the amount of logs on the homepage


Bufferbloat (Day 20)
Vince — Sat, 08 Feb 2025 20:27:11 GMT
While browsing and editing large RAW photos over SMB, I noticed some high kind of latency and got asking if this could be reduced. 
Some research later i found something called bufferbloat.
What's Bufferbloat? 
I found this analogy of bufferbloat from Waveform.com that's worth sharing here:
Think of your internet connection like a sink with a narrow drain (your bandwidth limit). When someone downloads a large file, it's like dumping a bucket of water into the sink. Now if you try to do something time-sensitive - like gaming or a video call - those packets are like drops of oil trying to get through a sink full of water. They have to wait for all that "water" to drain first, causing lag and delays. That's bufferbloat.
Check out Waveform's bufferbloat test tool, and a more detailed ELI5 explanation.
In OPNsense, we can address this using traffic shaping - setting up pipes and queues with FlowQueue-CoDel, it ensures that packets from small flows are sent in a timely fashion, while large flows share the bottleneck’s capacity.
Preliminaries 
I initially found guides for pfSense, but OPNsense has its own really nice guide on how to address bufferbloat here
Before messing around with creating pipes, queues, and rules, it's advisable to ran some tests to establish a baseline.
Before Optimization
My initial bufferbloat grade was a B, with some concerning latency spikes:
Download speed: 712.9 Mbps
Upload speed: 654.1 Mbps
Latency under load: +28ms download, +48ms upload
as seen in the screenshot below:
After Optimization
After setting up and tuning the traffic shaping rules:
Download speed: 588.8 Mbps
Upload speed: 516.9 Mbps
Latency under load: +12ms download, +14ms upload
Bufferbloat grade improved to A
as seen in the screenshot below:
So traded some raw speed for consistency.
Internal Network Optimization
After seeing improvements on the WAN side, I got more specific with my internal network. 
I set it up for:
SMB traffic (port 445) where I needed lower latency for raw files
Specific devices on certain VLANs that needed more controlled latency
I really just needed it for low latency especially when editing raw files attached on over smb and it did help, not a drastic difference but something noticeable, and so I disabled the WAN side optimizations.



Machine and Container updates (Day 19)
Vince — Wed, 05 Feb 2025 20:50:30 GMT
Today was all about updates. 
What started as just routine maintenance turned into a reminder of why we keep backups (and backups of backups).
The Update Plan
Update Proxmox nodes
Update VMs
Update OPNsense to version 25
Things Go Sideways
After the updates and reboots, OPNsense decided to forget about its VLANs and misconfigure WAN and LAN interfaces. 
This cascaded into:
Everything losing connectivity
DNS becoming unreachable
General network chaos
Recovery Process
Direct connection to Proxmox node (thank goodness for out-of-band management)
Tried the built-in backup list - no luck
Remembered the lesson from the last time i had to do a reinstall at 1.20am: keep config backups locally
Reset OPNsense, restored from local backup
Fixed an interface mismatch
Network starts coming back to life
DNS
Everything seemed fixed until I noticed I still had no internet. 
OPNsense looked good, but the DNS server was unreachable despite appearing online and healthy. So basically "everything's fine but nothing works."
After some troubleshooting and replacing the VMs NIC and re-assigning it the same static ip on OPNSense the node was now reachable and my DNS working.
Most services recovered quickly once DNS and OPNsense were back, though TrueNAS took its time and couldn't update catalogs so added in a Quad 9 as a fallback for next time (because there probably will)
So
Keep multiple backups in different locations (the built-in backups aren't always enough)
Added Quad9 to some nodes like trunas as a fallback DNS for future resilience
When debugging network issues, don't trust what "looks fine" - verify connectivity layer by layer
Updates, while necessary, can turn out to be well ....
And the Kubernetes clusters (both k3s and the HA one) - everything just came back online like nothing had happened, without needing to touch a single node (including the 2 haproxy nodes etc). 
At least now I have a fallback plan for DNS issues, and another validation of why Kubernetes is great for self-healing infrastructure.


Homepage Dashboard (Day 18)
Vince — Sun, 02 Feb 2025 18:37:18 GMT
Spent today setting up Homepage to organize access to all the URLs I currently have. 
Current dashboard 
Basic Configuration
The base setup is defined in the settings.yaml file and looks like this:
title: 
theme: dark
color: slate
background:
  image: 
  blur: sm
  saturate: 50
  brightness: 50
  opacity: 50
  ....
Service Organization
Services are grouped logically, making it easy to find what you need. 
For icons, you can use:
Dashboard Icons Repository
Material Icons
Simple Icons
When using material icons or simple icons prefix the icon with e.g mdi- or si-
An example definition in the services.yaml
- Hypervisor:
      - Avalon:
            icon: proxmox.svg
            href: 
            description: Main Compute Node
      - Aegis:
            icon: proxmox.svg
            href: 
            description: Network Node

- Network:
      - Vale:
            icon: opnsense.svg
            href: 
            description: OPNsense
      - DNS:
            icon: adguard.svg
            href: 
            description: DNS Management
The page auto-refreshes as you edit the config, you see the changes in real-time.
Widgets and Extras
You can also add widgets. e.g showing the time:
- datetime:
    text_size: xl
    format:
      timeStyle: short
And bookmarks for quick access:
- Blogs:
    - Local Blog:
        - abbr: Blog
          href: 
I still need to:
Connect and collect metrics
And add more detailed service information
For now, it's a functional start that makes navigating between services easier, and the cool thing is you can set it as the default browser landing page.