Debugging and General tips (Day 10)

Common debugging patterns and tips.

Debugging and General tips (Day 10)
Photo by Christophe Hautier / Unsplash

After setting up and debugging various parts, I thought I'd share some basic tips that have helped me along the way.

Managing Multiple Clusters

Here's how to merge multiple kubeconfig files:

KUBECONFIG=~/.kube/config:~/.kube/config.cluster2 kubectl config view --flatten > ~/.kube/config.merged
cp ~/.kube/config ~/.kube/config.backup
mv ~/.kube/config.merged ~/.kube/config

You can then rename contexts for better clarity:

kubectl config rename-context default prism
kubectl config rename-context kubernetes-admin@kubernetes atlas

And set proper permissions on your kube config:

chmod 600 ~/.kube/config

Node Scheduling Issues

If pods aren't scheduling on control plane nodes (I'm using 3 control plane nodes), check for taints:

kubectl get nodes -o json | jq '.items[].spec.taints'

To remove control-plane taints if needed:

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Troubleshooting Tips

In general, most issues can be found and solved by following a pattern:

  • Get the resource
  • Describe it
  • And follow the trail of related resources
  • Check the related logs

An example of a certificate issue:

Certificate Issues

Follow the chain of resources when debugging cert-manager:

kubectl get certificate -n argocd
kubectl -n argocd describe certificate argocd-certificate
kubectl -n argocd describe certificaterequests.cert-manager.io argocd-certificate-1
kubectl -n argocd describe order argocd-certificate-1-1494176820

kubectl -n cert-manager logs pods/cert-manager-<some-hash>

Other times just deleting a resource and having it get recreated solves the issue, for example, switching from staging to production Let's Encrypt, you may need to delete the old secrets or the orders and they should be recreated:

e,g kubectl -n argocd delete secrets argocd-tls

Network Debugging

When services aren't reachable:

  • Check firewall rules and network policies between VLANs
  • Use dig or nslookup to verify DNS resolution
  • Verify LoadBalancer IP assignments
  • Use tcpdump and netstat for network debugging:
# Check listening ports
netstat -tlpn

# Monitor ARP requests
tcpdump -i any -n arp  

LoadBalancer Configuration

If setting up a new cluster using kubeadm (not on the cloud) use Metalb or Cilium to give load balancer IP addresses.

If using Cilium, here's a sample configuration:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "lb-pool"
spec:
  blocks:
    - cidr: "192.168.30.140/30"
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: cilium-l2-announce
spec:
  externalIPs: true
  loadBalancerIPs: true
  interfaces:
    - eth0
All services run through traefik so a few loadbalancer IPs are plenty.

Helm and Argo CD Debugging

Debug Argo CD applications, you can render out the chart:

helm template . -f values.yaml > rendered-app.yaml

And for helmfile:

helmfile template > rendered.yaml