Filesystem mounting ro with heavy NVMe I/O

Krait@discuss.tchncs.de · edit-2 11 months ago

Filesystem mounting ro with heavy NVMe I/O

Krait@discuss.tchncs.de · 11 months ago

Happens with both drives, I have tried each possible permutation (Samsung in slot 1 and 2, kioxia in slot 1 and 2, and even only installing one drive at a time)

Atemu@lemmy.ml · 11 months ago

Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

Krait@discuss.tchncs.de · 11 months ago

Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

Which tests are you referring to exactly? I have read about badblocks for example, and it not being much use for ssds in general, due to their automatic bad-block-remapping, so they remain invisible to the OS as all remapping happens in the drive’s controller. Smart values look great for both drives, about 20TBW on the Samsung drive, and a lot less on the Kioxia drive.

Atemu@lemmy.ml · 11 months ago

I’d start by generating some synthetic workloads such as writing some sequential data to it and then reading it back a few times.

badblocks concerns partial failure of the device where (usually) just a few blocks misbehave while the rest remains accessible. The failure mode seen here is that the entire drive becomes inaccessible and it’s likely not due to the drive itself but how it’s connected.

If synthetic loads fail to reproduce the error, I’d put a filesystem on it and copy over some real data perhaps. Put on some load that mimics a real system somehow to try and get it to fail without the OS actually being ran off the drive.

Krait@discuss.tchncs.de · 11 months ago

Thanks, I’ll try that. I loaded the drive using dd a couple of times, and that did bring the system down a couple of times. I was writing to the filesystem though, while the system was booted

Atemu@lemmy.ml · 11 months ago

Did you boot with the kernel flags from the log?

Could you show the dmesg from the point onwards when the drive dropped out?

Krait@discuss.tchncs.de · 11 months ago

I did, yes, but no avail. The dmesg output I posted is after the drive was mounted as ro, and is the best i could get. After some time, the system stops responding completely

Atemu@lemmy.ml · 11 months ago

Your system stops responding even if it’s not booted from those drives but a live ISO?

bazsy@lemmy.world · 11 months ago

Are both drives fully encrypted with LUKS? Is trim enabled in both crypttab and fstab?

Krait@discuss.tchncs.de · 11 months ago

Both drives were encrypted (Samsung as root drive, encrypted except for the efi partition, and kioxia fully encrypted and mounted via crypttab and a key file residing on the encrypted Samsung partition for automatic unlock), although now as I have been reinstalling quite often, and couldn’t be bothered to set up the encryption for the second drive so it stays unused atm. Trim is enabled via a kernel parameter, but not in the fstab directly anymore (as I’m running BTRFS now, and from what I’ve gathered passing the ssd option to BTRFS is enough to enable trim, verified with lsblk --discard)

Filesystem mounting ro with heavy NVMe I/O

Filesystem mounting ro with heavy NVMe I/O

IMG 20240205 180054 2 — Postimages