Storage (TrueNas, and USB controllers) (Day 13-15)

What started as needing storage turned into a deep dive into USB and storage performances, and lessons about things like UASP, unstable drive paths, and temperamental USB controllers.

Storage (TrueNas, and USB controllers) (Day 13-15)
Photo by Lia Trevarthen / Unsplash

This started as a simple "let's set up a container registry".

Instead, it turned into a deep dive into USB architecture, storage performance, and a lesson in why not all USB controllers are created equal, that lasted a whole weekend.

The Registry

I needed to set up a container registry, which meant thinking about storage. The options were:

  • Use storage on the K8s VMs and let Longhorn handle replication
  • Go with external storage

The registry actually supports using S3 as a backend for image storage. But well cloud providers aren't in the budget, so the next best thing is Minio. I had recently got my hands on two 12TB drives (waiting on a third for RAIDZ1), so this seemed like a perfect use case.

But there was a catch - I'm running mini PCs. No fancy drive bays, or PCIE expansion slots etc, just USB. And this marks the beginning of a very long rabbit hole learning about USB and PCIE devices and things i had no idea about.

UASP

The first bit of good news was that my external enclosure supports USB 3.2 Gen 2 and UASP (USB Attached SCSI Protocol). If you are like me you had no idea what UASP was:

  • It's basically a protocol that gets you closer to native SCSI/SATA performance.
  • And supports parallel commands and better command queuing, making it behave more like direct disk access.

You can see this in with lsusb -t:

/:  Bus 10.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 10000M
    | Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 10000M
        | Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 10000M
        | Port 2: Dev 4, If 0, Class=Mass Storage, Driver=uas, 10000M
        | Port 4: Dev 5, If 0, Class=Mass Storage, Driver=uas, 10000M

See that Driver=uas?.

USB Controllers

Digging deeper into the USB setup with lspci -k | grep -i usbwe see:

06:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Rembrandt USB4 XHCI controller #3
06:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Rembrandt USB4 XHCI controller #4
07:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Rembrandt USB4 XHCI controller #8
07:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Rembrandt USB4 XHCI controller #5
07:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Rembrandt USB4 XHCI controller #6

Okay so I have USB 4 ports, cool, but little did I know these weren't all created equal...

TrueNAS

Getting TrueNAS set up was straightforward enough - grab ISO, upload to Proxmox, create VM (using q35, UEFI, disabled memory ballooning), install.

Now the "fun" started when trying to get the drives to the VM.

Attempt 1: The PCIe passthrough

First attempt: pass through the USB controller as a PCIe device.

So looking at our lsusb -t command we see we see we have:

  • Root hub (Bus 10)
  • A physical USB hub connected to it
  • Three storage devices through that hub

To find which PCI device controls a USB bus we do a readlink /sys/bus/usb/devices/usb10 which shows:

../../../devices/pci0000:00/0000:00:08.3/0000:07:00.4/usb10

This path shows:

  • USB Bus 10
  • Controlled by PCI device 07:00.4

IOMMU Groups

IOMMU groups show which devices can be passed through independently.

So checking IOMMU groups with find /sys/kernel/iommu_groups/ -type l | grep "0000:07:00":

/sys/kernel/iommu_groups/26/devices/0000:07:00.0
/sys/kernel/iommu_groups/27/devices/0000:07:00.3
/sys/kernel/iommu_groups/28/devices/0000:07:00.4

So if I understand this correctly it means each device in its own group can be passed through independently.

I passed through the device in group 28 with "all functions" enabled. The drives disappeared from Proxmox (expected), but so did a bunch of other stuff (kind of makes sense given the PCI thing has other things attached to it not just the USB controller).

However the VM wouldn't even boot (interesting):

error writing '1' to '/sys/bus/pci/devices/0000:07:00.0/reset': Inappropriate ioctl for device failed to reset PCI device '0000:07:00.0', but trying to continue as not all devices need a reset error writing '1' to '/sys/bus/pci/devices/0000:07:00.3/reset': Inappropriate ioctl for device failed to reset PCI device '0000:07:00.3', but trying to continue as not all devices need a reset error writing '1' to '/sys/bus/pci/devices/0000:07:00.4/reset': Inappropriate ioctl for device failed to reset PCI device '0000:07:00.4', but trying to continue as not all devices need a reset kvm: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed. TASK ERROR: start failed: QEMU exited with code 1

looking at what "all functions" is about it looks like it basically says "Pass everything about this device through"

So I unchecked "all functions" (which gave the host some of the previous "disappeared" devices) and the VM started.

Created a pool, everything seemed fine until:

Critical
Pool test state is SUSPENDED: One or more devices are faulted in response to IO failures

Well, that's not good, this usually suggests disk or hardware issues.

TrueNAS UI also shows the pool is in an unhealthy state

I go to the console and check the status zpool status test and

and check the errors on dmesg dmesg | grep -i error

so:

  • I've got a lot of I/O errors on the device sdc, both reads and writes
  • The pool is in a SUSPENDED state with 9366 data errors
  • The mirror seems to still be ONLINE but there are serious I/O issues

I try a smartctl and lsblk, to see what's up with sdc and the drive has disappeared (this also happened for the other drives later).

No sdc and the errors are full of references to sdc. This suggests the drive was present (as sdc) when the errors started occurring, but has since completely disappeared from the system - which could mean:

  • The drive physically disconnected (Which I didn't do)
  • The USB/SATA connection failed
  • The drive completely failed and is no longer being recognized (This would suck)

I do an ls -l /dev/disk/by-id/ and confirm the drives are still there, using different names.

usb-ASMT_<serial>-0:0 -> sde
usb-ASMT_<serial>-0:0 -> sdf
usb-ASMT_<serial>-0:0 -> sdg

So remember when I said all USB and USB controllers are not made the same?

After hours of debugging it turns out that under heavy load the USB controller on the port I was using would "crash" and the drives would "shift around", getting remounted with different paths. (There's is more to this)

Not exactly ideal for a storage system.

So I went looking for ways to have TrueNAS use drives wwn instead of the dev path but could not find anything that helped.

Attempt 2: Direct Disk Passthrough

Time for Plan B. Instead of passing through the controller, pass through individual drives.

Install lshw and do an lshw -class disk -class storage to see the drives and their serial numbers.

Do an ls -l /dev/disk/by-id and copy the wwn path or the ata path of the drives.

1. Get drive info:

➜  ~ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 Jan 26 20:16 ata-<name-with-serial-number> -> ../../sde
lrwxrwxrwx 1 root root  9 Jan 26 20:16 ata-<name-with-serial-number> -> ../../sdd
lrwxrwxrwx 1 root root  9 Jan 26 20:16 ata-<name-with-serial-number> -> ../../sdf

2. Set up SCSI drives:

then using the wwn path or the ata path set scsi drives with the following command:

qm set <vm-id> -scsi1 /dev/disk/by-id/ata-<name-with-serial-number>
qm set <vm-id> -scsi2 /dev/disk/by-id/ata-<name-with-serial-number>
qm set <vm-id> -scsi3 /dev/disk/by-id/ata-<name-with-serial-number>

3. Add serial numbers to config:

➜  ~ vim /etc/pve/qemu-server/<vm-id>.conf
... other config
scsi1: /dev/disk/by-id/ata-<name-with-serial-number>,size=11176G,serial=<serial-number>
scsi2: /dev/disk/by-id/ata-<name-with-serial-number>,size=11176G,serial=<serial-number>
scsi3: /dev/disk/by-id/ata-<name-with-serial-number>,size=250059096K,serial=<serial-number>
... other config

on the proxmox GUI, you should see the drives are attached and the serial numbers are set under the hardware tab.

And then found if I moved the USB cable to the USB-C port it fixed the flakiness seen when using the other albeit still USB 4 ports.

Performance Testing

To make sure this works I did some stress tests with fio and well the results speak for themselves:

Cache performance

Reading a 10G file in 3.89 seconds (2632MiB/s throughput):

fio --name=test --rw=read --bs=1m --size=10g --filename=./testfile
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][75.0%][r=2639MiB/s][r=2638 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=7607: Sun Jan 26 05:46:15 2025
  read: IOPS=2632, BW=2632MiB/s (2760MB/s)(10.0GiB/3890msec)
    clat (usec): min=352, max=1613, avg=379.02, stdev=30.05
     lat (usec): min=352, max=1613, avg=379.07, stdev=30.06
    clat percentiles (usec):
     |  1.00th=[  359],  5.00th=[  363], 10.00th=[  367], 20.00th=[  367],
     | 30.00th=[  371], 40.00th=[  371], 50.00th=[  375], 60.00th=[  379],
     | 70.00th=[  383], 80.00th=[  388], 90.00th=[  396], 95.00th=[  404],
     | 99.00th=[  441], 99.50th=[  465], 99.90th=[  562], 99.95th=[ 1090],
     | 99.99th=[ 1467]
   bw (  MiB/s): min= 2604, max= 2650, per=100.00%, avg=2633.71, stdev=16.18, samples=7
   iops        : min= 2604, max= 2650, avg=2633.71, stdev=16.18, samples=7
  lat (usec)   : 500=99.79%, 750=0.12%, 1000=0.03%
  lat (msec)   : 2=0.07%
  cpu          : usr=0.28%, sys=99.49%, ctx=86, majf=0, minf=266
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2632MiB/s (2760MB/s), 2632MiB/s-2632MiB/s (2760MB/s-2760MB/s), io=10.0GiB (10.7GB), run=3890-3890msec

Bigger file and Direct I/O

A bigger file and adding --direct=1 to the command to go around the RAM cache

Reading a 50G file in 129 seconds (396MiB/s throughput):

fio --name=test --rw=read --bs=1m --size=50g --filename=./bigtest --direct=1
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 51200MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=267MiB/s][r=267 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=9526: Sun Jan 26 05:54:51 2025
  read: IOPS=396, BW=396MiB/s (415MB/s)(50.0GiB/129239msec)
    clat (usec): min=133, max=180737, avg=2522.21, stdev=3642.09
     lat (usec): min=133, max=180738, avg=2522.37, stdev=3642.10
    clat percentiles (usec):
     |  1.00th=[   149],  5.00th=[   194], 10.00th=[   212], 20.00th=[   260],
     | 30.00th=[   799], 40.00th=[  1500], 50.00th=[  1876], 60.00th=[  2245],
     | 70.00th=[  2737], 80.00th=[  3458], 90.00th=[  5997], 95.00th=[  8029],
     | 99.00th=[ 12387], 99.50th=[ 16712], 99.90th=[ 34341], 99.95th=[ 45876],
     | 99.99th=[116917]
   bw (  KiB/s): min=51200, max=2932736, per=100.00%, avg=406019.97, stdev=193023.81, samples=258
   iops        : min=   50, max= 2864, avg=396.50, stdev=188.50, samples=258
  lat (usec)   : 250=17.75%, 500=11.38%, 750=0.74%, 1000=1.46%
  lat (msec)   : 2=21.94%, 4=30.99%, 10=13.34%, 20=2.05%, 50=0.31%
  lat (msec)   : 100=0.02%, 250=0.02%
  cpu          : usr=0.14%, sys=10.08%, ctx=37023, majf=0, minf=269
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=51200,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=396MiB/s (415MB/s), 396MiB/s-396MiB/s (415MB/s-415MB/s), io=50.0GiB (53.7GB), run=129239-129239msec

Okay, I could live with those numbers as long as they are stable and consistent.

Sustained Operations

So I do 5 runs of reading an 800G file (which does include a write during initial file creation) and and writing 900G file, with a mix of both reading and writing at the same time.

The idea is to see if something breaks, so i'm also monitoring the logs and drive temps

I will maybe never experience this kind of thing in one go unless during a resilver so if this is stable I'm good with that

I leave these running and go have a drink with a some friends, it's the weekend after all.

Reading an 800G file:

fio --name=read --rw=read --bs=1m --size=800g --filename=./bigtest
read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=310MiB/s][r=310 IOPS][eta 00m:01s]
read: (groupid=0, jobs=1): err= 0: pid=10516: Sun Jan 26 21:32:56 2025
  read: IOPS=255, BW=256MiB/s (268MB/s)(800GiB/3205833msec)
    clat (usec): min=356, max=303293, avg=3911.61, stdev=6703.53
     lat (usec): min=356, max=303293, avg=3911.78, stdev=6703.52
    clat percentiles (usec):
     |  1.00th=[   416],  5.00th=[   429], 10.00th=[   437], 20.00th=[   457],
     | 30.00th=[   506], 40.00th=[   603], 50.00th=[   676], 60.00th=[  3982],
     | 70.00th=[  4555], 80.00th=[  5276], 90.00th=[ 11731], 95.00th=[ 12387],
     | 99.00th=[ 20579], 99.50th=[ 24773], 99.90th=[ 77071], 99.95th=[125305],
     | 99.99th=[198181]
   bw (  KiB/s): min=43008, max=555008, per=100.00%, avg=261727.05, stdev=73610.29, samples=6411
   iops        : min=   42, max=  542, avg=255.54, stdev=71.87, samples=6411
  lat (usec)   : 500=29.19%, 750=21.95%, 1000=0.45%
  lat (msec)   : 2=1.53%, 4=7.32%, 10=24.63%, 20=13.79%, 50=0.95%
  lat (msec)   : 100=0.12%, 250=0.07%, 500=0.01%
  cpu          : usr=0.10%, sys=13.54%, ctx=402475, majf=0, minf=269
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=819200,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=256MiB/s (268MB/s), 256MiB/s-256MiB/s (268MB/s-268MB/s), io=800GiB (859GB), run=3205833-3205833msec

and writing a 900G file:

fio --name=write --rw=write --bs=1m --size=900g --filename=./big2testfile
write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=216MiB/s][w=216 IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=24687: Sun Jan 26 23:32:36 2025
  write: IOPS=179, BW=180MiB/s (188MB/s)(900GiB/5127844msec); 0 zone resets
    clat (usec): min=118, max=400029, avg=5542.57, stdev=5208.15
     lat (usec): min=120, max=400047, avg=5561.90, stdev=5208.03
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    6], 80.00th=[    6], 90.00th=[    7], 95.00th=[    8],
     | 99.00th=[   22], 99.50th=[   35], 99.90th=[   87], 99.95th=[  101],
     | 99.99th=[  144]
   bw (  KiB/s): min= 6144, max=2981888, per=100.00%, avg=184084.36, stdev=65812.35, samples=10254
   iops        : min=    6, max= 2912, avg=179.74, stdev=64.27, samples=10254
  lat (usec)   : 250=0.10%, 500=0.02%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=1.37%, 10=95.23%, 20=2.08%, 50=0.88%
  lat (msec)   : 100=0.20%, 250=0.07%, 500=0.01%
  cpu          : usr=0.45%, sys=3.40%, ctx=931630, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,921600,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), io=900GiB (966GB), run=5127844-5127844msec

From the TrueNAS GUI

What I Learned

In general, I think I learned a whole lot of stuff about USB somethings i had no idea about before.
  1. Not all USB controllers handle sustained operations equally - the USB-C controller in my setup proved more stable
  2. It's worth checking for UASP when buying enclosures
  3. When passing through USB storage to VMs, direct disk passthrough is maybe more reliable than attempting to pass through the whole controller.
  4. Monitor temperatures during stress testing (I had journalctl and temp monitoring running throughout) something like for drive in sda sdb sdc sdd; do echo "=== /dev/$drive ==="; smartctl -A /dev/$drive | grep -i temp; done for temps.