Systemd and Proxmox (Day 3)
2025-01-12
It turns out that Proxmox’s quorum requirements are not as “simple” as I thought.
The initial solution of setting quorum expectations to 1 worked… sort of. Here’s what happened:
When a node booted up (remember it can’t initially “see” the other node), OPNsense would start (great!), provide DHCP and network connectivity (also great!), but then things got interesting. Once the network was up and the Proxmox nodes could talk to each other, the other VMs would fail to start with cryptic errors like:
generating cloud-init ISO TASK ERROR: start failed: command ........ <very long kvm command> failed: got timeout
The issue? Trying to set quorum to 1 in a cluster with both nodes available fails.
The solution ended up being two-fold:
- Set OPNsense VM to start last in the
Start/Shutdown order
settings - Make the systemd service smarter about when to adjust quorum expectations
Here’s the updated service that checks node count before trying to set quorum:
[Unit]
Description=Set Proxmox quorum expectations
After=corosync.service pve-cluster.service
Requires=corosync.service pve-cluster.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/bin/sleep 10
ExecStart=/bin/bash -c 'if [ "$(/usr/bin/pvecm status | grep "Nodes:" | awk "{print \$2}")" = "1" ]; then /usr/bin/pvecm expected 1; fi'
[Install]
WantedBy=multi-user.target
Now the cluster behaves as expected: VMs start properly whether we’re running on one node or two, and OPNsense starts when it should.