Hi everyone,

I have been experiencing some weird problems lately, starting with the OS becoming unresponsive “randomly”. After reinstalling multiple times (different filesystems, tried XFS and BTRFS, different nvme slots with different nvme drives, same results) I have narrowed it down to heavy IO operations on the nvme drive. Most of the time, I can’t even pull up dmesg, and force shutdown, as ZSH gives an Input/Output error no matter the command. A couple of times I was lucky enough for the system to stay somewhat responsive, so that I could pull up dmesg.

It gives a controller is down, resetting message, which I’ve seen on archwiki for some older Kingston and Samsung nvmes, and gives Kernel parameters to try (didn’t help much, they pretty much disable aspm on pcie).

What did help a bit was reverting a recent bios upgrade on my MSI Z490 Tomahawk, causing the system to not crash immediately with heavy I/O, but rather mount as ro, but the issue still persists. I have additionally run memtest86 for 8 passes, no issues there.

I have tried running the lts Kernel, but this didn’t help. The strange thing is, this error does not happen on Windows 11.

Has anyone experienced this before, and can give some pointers on what to try next? I’m at my wits end here. EDIT: When this issue first appeared, I assumed the Kioxia drive was defective, which the manufacturer replaced after. This issue still happens with the new replacement drive too, as well as the Samsung drive. I thus assume, that neither drives are defective (smartctl also seems to think so)

Here are hardware and software details:

  • Arch with latest Zen Kernel, 6.7.4, happened with other, older kernels too though, tried regular, lts and zen
  • BTRFS on LUKS
  • i9-10850k
  • MSI z490 Tomahawk
  • GSkill 3200 MHz RAM, 32GB, DDR4
  • Samsung 970 Evo 1TB & Kioxia Exceria G2 1TB (tested both drives, in both slots each, over multiple installs)
  • Vega 56 GPU
  • Be quiet Straight Power 11 750W PSU
  • Krait@discuss.tchncs.deOP
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago

    Happens with both drives, I have tried each possible permutation (Samsung in slot 1 and 2, kioxia in slot 1 and 2, and even only installing one drive at a time)

    • Atemu@lemmy.ml
      link
      fedilink
      arrow-up
      4
      ·
      11 months ago

      Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

      • Krait@discuss.tchncs.deOP
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

        Which tests are you referring to exactly? I have read about badblocks for example, and it not being much use for ssds in general, due to their automatic bad-block-remapping, so they remain invisible to the OS as all remapping happens in the drive’s controller. Smart values look great for both drives, about 20TBW on the Samsung drive, and a lot less on the Kioxia drive.

        • Atemu@lemmy.ml
          link
          fedilink
          arrow-up
          4
          ·
          11 months ago

          I’d start by generating some synthetic workloads such as writing some sequential data to it and then reading it back a few times.

          badblocks concerns partial failure of the device where (usually) just a few blocks misbehave while the rest remains accessible. The failure mode seen here is that the entire drive becomes inaccessible and it’s likely not due to the drive itself but how it’s connected.

          If synthetic loads fail to reproduce the error, I’d put a filesystem on it and copy over some real data perhaps. Put on some load that mimics a real system somehow to try and get it to fail without the OS actually being ran off the drive.

          • Krait@discuss.tchncs.deOP
            link
            fedilink
            arrow-up
            1
            ·
            11 months ago

            Thanks, I’ll try that. I loaded the drive using dd a couple of times, and that did bring the system down a couple of times. I was writing to the filesystem though, while the system was booted

            • Atemu@lemmy.ml
              link
              fedilink
              arrow-up
              1
              ·
              11 months ago

              Did you boot with the kernel flags from the log?

              Could you show the dmesg from the point onwards when the drive dropped out?

              • Krait@discuss.tchncs.deOP
                link
                fedilink
                arrow-up
                1
                ·
                11 months ago

                I did, yes, but no avail. The dmesg output I posted is after the drive was mounted as ro, and is the best i could get. After some time, the system stops responding completely

                • Atemu@lemmy.ml
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  11 months ago

                  Your system stops responding even if it’s not booted from those drives but a live ISO?

    • bazsy@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Are both drives fully encrypted with LUKS? Is trim enabled in both crypttab and fstab?

      • Krait@discuss.tchncs.deOP
        link
        fedilink
        arrow-up
        3
        ·
        11 months ago

        Both drives were encrypted (Samsung as root drive, encrypted except for the efi partition, and kioxia fully encrypted and mounted via crypttab and a key file residing on the encrypted Samsung partition for automatic unlock), although now as I have been reinstalling quite often, and couldn’t be bothered to set up the encryption for the second drive so it stays unused atm. Trim is enabled via a kernel parameter, but not in the fstab directly anymore (as I’m running BTRFS now, and from what I’ve gathered passing the ssd option to BTRFS is enough to enable trim, verified with lsblk --discard)