Spoiler warning: the problem is not yet solved though a few attempted measures are in place, and being tested at moment of writing. There is also a question for the alarm maintainers about the builds for this board, repeated here for convenience: Why is the default value of vm.min_free_kbytes so low?
My Exotic Setup
I have an Odroid-XU4 inside a CloudShell2 case. As a reminder, the CloudShell2 has one 15V/4A PSU, and the board powers its 2 SATA ports and the XU4 over the GPIO pins.
Inside the CloudShell2 I have two 3.5" (7200rpm) HDDs. These are connected to the XU4 via one of its USB3 ports.
Additionally I have some "Orico" 4-bay USB3-SATA docks. These are connected to the XU4 via a (powered) USB3 hub connected to the XU4's other USB3 port. These docks contain a few 3.5" 7200rpm disks, and also recently a few 2.5" 5400rpm disks.
I have a couple of mdadm arrays spanning the disk pool. In total:
- 3x6TB (2 of which are in the CloudShell2)
- 3x6TB
- 5x4TB
Each mdadm array is formatted to an XFS volume (of 16TB or under, because of the 32bit limitation of the XU4). Each volume is made available as NFS over LAN and is populated with various files and media. Transmission-daemon uses a Downloads folder on one volume and actively reads/writes to it. Most other disk access is NAS reads from various devices on LAN.
There's also a mumble (umurmur) server and a ZNC session, the latter of which causing a small amount of logging to microSD, but not much. SSHD is accessible over WAN but only accepts key logins, and I have a working fail2ban jail setup.
For completion, I'm running 4.14.180-1-ARCH and the system is up to date. It seems we're still waiting for a 5.x kernel on this board, which is a shame, but warrants a separate discussion entirely.
The Problem Symptoms
For quite some time, the server would just... hang. Usually, this would prevent any new connections from forming, but open connections would sometimes persist, especially NAS reads. For a while anyway, before it died entirely, and required a hard reboot (and usually reassembling the mdadm volumes, and checking the filesystems on them). Other times it would just outright hang.
In this broken state it was difficult to get any information about what went wrong. The closest I ever got was using the UART cable - almost nothing worked, but I was able to see that no volumes were mounted at all, not even root.
Eventually I set up the watchdog to reboot the system under various conditions, which works as expected. Whenever the problem does occur, within a minute the system is back online, though I still tend to need to reassemble and recheck the mdadm arrays in an SSH session.
When the problem occurs, the UART cable never seems to "see" what actually goes wrong, and there's never any trace whatsoever on the system that there was a problem, other than the fact that the system started booting again. I suppose it's the sort of thing you would expect if either all the power was cut at once, or if all storage devices were yanked at once. Maddening.
Miraculously, this does not seem to have ever resulted in any actual data loss, but that doesn't mean the situation is acceptable. It is my intention to leave this server running at my parents' place as I move to a foreign country.
Attempts to Solve the Problem
For most of the time, I suspected the microSD card as being the primary culprit. I am aware that microSD cards can fail, and can have other problems under load. My line of reasoning was that, if the microSD is at fault, the root device could be unmounting, not only preventing any logs from being written, but also causing most services and connections to break.
So, I tried migrating my install to several microSD cards (carefully creating an offset partition table as per the install guide, using dd for the first so many sectors, then using the ext4 tools to copy and resize my root partition onto the new card). No dice - I kept getting the same problem.
My next idea was that it wasn't the microSD card itself that was the problem, but rather how much I was accessing it. To test this, I reduced the amount of persistent systemd journal logging and ZNC logging. Unfortunately, this also had no effect on the problem.
From here, it was becoming clearer that the problem seemed to occur specifically during periods of high I/O access, though it was still unclear whether it was mainly sustained, burst, or some more complex I/O scenario that was responsible. When trying to deliberately trigger the issue (e.g. via a huge approx 10TB rsync from one array onto a fresh one, or even the process of creating that fresh array via mdadm), it was still essentially random when and whether it actually happened. Typically, though, it occurs within 6 to 10 hours of I/O, especially if other disks on the system are active, e.g. Transmission is busy downloading or seeding.
One thing I did discover, though, is that if the I/O is throttled for some reason, the bug is sidestepped, or at least occurs much less frequently. I discovered this by accident while trying to trigger the bug using an rsync copy to my mdadm array, formatted to btrfs instead of my usual xfs. The write performance of btrfs was so slow that, even though only a couple of terabytes had transferred over a three day period, the system was still running after three days of constant I/O.
Current Status
I started asking around in various IRC communities for brainstorming purposes.
One of the most promising suggestions was from a user in the #odroid freenode room. They asked me to check the value of "vm.min_free_kbytes". Since I hadn't changed it, it was still the default value, which on my setup seems to have been 3454. Politely put, they were horrified, and said that with ~ 10 disks attached I should set this to something much higher, for example 65536.
According to https://www.kernel.org/doc/Documentation/sysctl/vm.txt
$this->bbcode_second_pass_quote('', 'S')ome minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads.