[ODROID-XU4] [CloudShell2] Instability under high I/O loads

This forum is for supported devices using an ARMv7 Samsung SoC.

[ODROID-XU4] [CloudShell2] Instability under high I/O loads

Postby aphirst » Wed Mar 03, 2021 12:27 pm

I've been having this issue for at least a year, but due to its sporadic nature it's been difficult to realise that it's happening, and even more difficult to attribute it to specific causes or circumstances. Here follows a (mostly) comprehensive write-up.

Spoiler warning: the problem is not yet solved though a few attempted measures are in place, and being tested at moment of writing. There is also a question for the alarm maintainers about the builds for this board, repeated here for convenience: Why is the default value of vm.min_free_kbytes so low?

My Exotic Setup

I have an Odroid-XU4 inside a CloudShell2 case. As a reminder, the CloudShell2 has one 15V/4A PSU, and the board powers its 2 SATA ports and the XU4 over the GPIO pins.

Inside the CloudShell2 I have two 3.5" (7200rpm) HDDs. These are connected to the XU4 via one of its USB3 ports.

Additionally I have some "Orico" 4-bay USB3-SATA docks. These are connected to the XU4 via a (powered) USB3 hub connected to the XU4's other USB3 port. These docks contain a few 3.5" 7200rpm disks, and also recently a few 2.5" 5400rpm disks.

I have a couple of mdadm arrays spanning the disk pool. In total:
  • 3x6TB (2 of which are in the CloudShell2)
  • 3x6TB
  • 5x4TB
(It's worth bearing in mind that this issue was present before the 4TB disks were ever attached to the system.)

Each mdadm array is formatted to an XFS volume (of 16TB or under, because of the 32bit limitation of the XU4). Each volume is made available as NFS over LAN and is populated with various files and media. Transmission-daemon uses a Downloads folder on one volume and actively reads/writes to it. Most other disk access is NAS reads from various devices on LAN.

There's also a mumble (umurmur) server and a ZNC session, the latter of which causing a small amount of logging to microSD, but not much. SSHD is accessible over WAN but only accepts key logins, and I have a working fail2ban jail setup.

For completion, I'm running 4.14.180-1-ARCH and the system is up to date. It seems we're still waiting for a 5.x kernel on this board, which is a shame, but warrants a separate discussion entirely.

The Problem Symptoms

For quite some time, the server would just... hang. Usually, this would prevent any new connections from forming, but open connections would sometimes persist, especially NAS reads. For a while anyway, before it died entirely, and required a hard reboot (and usually reassembling the mdadm volumes, and checking the filesystems on them). Other times it would just outright hang.

In this broken state it was difficult to get any information about what went wrong. The closest I ever got was using the UART cable - almost nothing worked, but I was able to see that no volumes were mounted at all, not even root.

Eventually I set up the watchdog to reboot the system under various conditions, which works as expected. Whenever the problem does occur, within a minute the system is back online, though I still tend to need to reassemble and recheck the mdadm arrays in an SSH session.

When the problem occurs, the UART cable never seems to "see" what actually goes wrong, and there's never any trace whatsoever on the system that there was a problem, other than the fact that the system started booting again. I suppose it's the sort of thing you would expect if either all the power was cut at once, or if all storage devices were yanked at once. Maddening.

Miraculously, this does not seem to have ever resulted in any actual data loss, but that doesn't mean the situation is acceptable. It is my intention to leave this server running at my parents' place as I move to a foreign country.

Attempts to Solve the Problem

For most of the time, I suspected the microSD card as being the primary culprit. I am aware that microSD cards can fail, and can have other problems under load. My line of reasoning was that, if the microSD is at fault, the root device could be unmounting, not only preventing any logs from being written, but also causing most services and connections to break.

So, I tried migrating my install to several microSD cards (carefully creating an offset partition table as per the install guide, using dd for the first so many sectors, then using the ext4 tools to copy and resize my root partition onto the new card). No dice - I kept getting the same problem.

My next idea was that it wasn't the microSD card itself that was the problem, but rather how much I was accessing it. To test this, I reduced the amount of persistent systemd journal logging and ZNC logging. Unfortunately, this also had no effect on the problem.

From here, it was becoming clearer that the problem seemed to occur specifically during periods of high I/O access, though it was still unclear whether it was mainly sustained, burst, or some more complex I/O scenario that was responsible. When trying to deliberately trigger the issue (e.g. via a huge approx 10TB rsync from one array onto a fresh one, or even the process of creating that fresh array via mdadm), it was still essentially random when and whether it actually happened. Typically, though, it occurs within 6 to 10 hours of I/O, especially if other disks on the system are active, e.g. Transmission is busy downloading or seeding.

One thing I did discover, though, is that if the I/O is throttled for some reason, the bug is sidestepped, or at least occurs much less frequently. I discovered this by accident while trying to trigger the bug using an rsync copy to my mdadm array, formatted to btrfs instead of my usual xfs. The write performance of btrfs was so slow that, even though only a couple of terabytes had transferred over a three day period, the system was still running after three days of constant I/O.

Current Status

I started asking around in various IRC communities for brainstorming purposes.

One of the most promising suggestions was from a user in the #odroid freenode room. They asked me to check the value of "vm.min_free_kbytes". Since I hadn't changed it, it was still the default value, which on my setup seems to have been 3454. Politely put, they were horrified, and said that with ~ 10 disks attached I should set this to something much higher, for example 65536.

According to https://www.kernel.org/doc/Documentation/sysctl/vm.txt
Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads.

I will admit that I don't fully understand why this value in particular might be causing problems, since my XU4's RAM usage seems to never go over 400M or so, leaving 1600M free. Nonetheless, I have set vm.min_free_kbytes to 65536 using sysctl and am once again trying to trigger the problem. The system has been alive for about 1day 3hours, which is making me cautiously optimistic, though it's important to remember how nondeterministic the problem has been, and it will be difficult to ever be sure that it's fixed.

My question for the alarm maintainers: Why is the default value of vm.min_free_kbytes so low?

Future Diagnostics

If setting vm.min_free_kbytes ultimately fails to resolve my issue, I will have to shift my focus elsewhere. Presently my only likely candidates are:

  • Disastrous hardware fault with my XU4 (esp. the USB3 controller/ports)
  • Disastrous kernel bug
  • Disastrous hardware fault with my powered USB3 hub
  • Disastrous hardware fault with my USB3-SATA docks
  • 3.5" 7200rpm HDDs draw too much power from the CloudShell2, starving the XU4 of power, causing kernel panics or other instabilities

I have used the USB3-SATA docks with my laptops quite extensively, so I am inclined to trust them. The easiest possibility to test here would be the latter point, as I could move the disks around so that the CloudShell2 instead contains 2 of the 2.5" 5400rpm HDDs (via some appropriate caddies that I fortunately already own).

Closing Thoughts

I would appreciate any and all feedback on this topic. I may realise later that there were significant omissions in my report or thought process, and I will try to edit those back in if I can.
Posts: 14
Joined: Wed Sep 06, 2017 10:09 am

Return to Samsung

Who is online

Users browsing this forum: No registered users and 3 guests