Figured out the IO pressure: Just sharing into the void
from GeekyOnion@lemmy.world to selfhosted@lemmy.world on 11 Nov 16:21
https://lemmy.world/post/38657883

I’ve been rebuilding all my content hosted on a Synology NAS + Proxmox installed on a NUC, and moving it to a dedicated box with beefy/brutal stats. I was messing around with Proxmox and unprivileged LXC containers for a while, using a ZFS pool on the host, and passing though using mount points while mapping the users in the container to groups on the host. It was going pretty well except I had (what I thought) was insanely odd and inconsistent behavior. In summary, in the same LXC, I could pass through two mount points with the same users and permissions (etc.) and one would show up mapped correctly, and the other wouldn’t.

I gave up on that approach after a few unhelpful responses of “you’re doing it wrong.” That may be the case, but I was more focused on why the issue was inconsistent rather than just failing.

I’m now running an Unraid VM, with my HBA (and USB stick) passed through, lots of RAM, and an 8-pack of processors. I thought Unraid was pretty slick when I ran the trial a while ago, and was kind of unimpressed with it’s performance in this configuration. After getting all the drives configured correctly (made the mistake of mixing up “array” and “pool” after my initial foray into zfs), and weeding out three bad drives from ServerPartDeals, I had a stable array, all my LXC containers configured on Proxmox, NFS going over a dedicated, local bridge (10.10 for the WIN!), and my data moved over from the old NAS, I was pretty happy.

During the whole process, I had been watching/monitoring lots of odd behavior on Proxmox, with Unraid, and with my data transfers. My Pihole instance was going crazy with load averages, even though it was reserved for the LXCs on the host, rather than for the whole house, and the IO pressure stall was over 90% constantly. Given that I had several bad disks out of what I ordered from the supplier, I thought I was dealing with some crazy stuff. I was taking down the LXCs and VM one-by-one, trying to find where that stall pressure was coming from.

As I was troubleshooting, I was wondering if it was maybe IO pressure on the host OS disks (NVMe drives directly attached to the motherboard, zfs mirrored), and did a quick “zpool list.” Hmm. That’s funny. Why is my old destroyed (or so I thought) pool still showing up??? When I first switched to Unraid, I exported my pool (doom-pool) and then imported it in Unraid after I passed through the HBA. After deciding that ZFS was nice, but not necessary, I destroyed the pool in Unraid, and reconfigured for a standard xfs array. It looked like, somehow, the export of the pool, import, and destroy did something strange, and the drives were showing up as online and in use on the host still. I tried to kill the pool again on the host, and everything would sit and spin.

I ended up shutting down the host and needing to cut power (zfs services were hung for about 12 minutes before I decided it was ok), and when I rebooted, the old pool was gone from the host, and (holy moly) everything was working better. The IO pressure was gone. The CPU spikes and lags were gone. Pihole wasn’t going nuts any more.

The one thing I haven’t tried yet is to do some disk-to-disk copies on Unraid. This was one of those places where I saw aberrant behavior, and transfer limited to 120MB/s (I have 14TB SAS 12Gb drives in my array), but I don’t have any heavy files I need to move. Right now I’m just happy that it wasn’t more bogus hardware, or a problem with my HBA or motherboard or something. Anywho, just wanted to share.

#selfhosted

threaded - newest

ragingHungryPanda@piefed.keyboardvagabond.com on 11 Nov 18:11 collapse

what an experience! thank you for sharing :)