Elusive Raspberry Pi Problem

Ask questions about Arch Linux ARM. Please search before making a new topic.

Elusive Raspberry Pi Problem

Postby ecod00m » Thu Dec 26, 2024 3:57 am

[b]UPDATE[/b]: I finally caught the little bastard red-handed! See addendum.

Hi. Ordinarily I don't post on forums, because almost always I can find the answer I am looking for without needing to create yet another question. After decades, I guess it was inevitable I would one day have to join one and post. So here goes...

A stock Raspberry Pi 4 (Model B Rev 1.4),
with 2GiB of RAM,
running archlinuxarm, booting via u-boot,
using an external USB-connected NVMe drive as its only drive,
is every day, or every few days,
becoming unresponsive in a number of ways, requiring a power-cycle to resume operations.

This particular Pi acts as a headless server, with the following (and several more undisclosed) responsibilities:
[list]
- DNS Server (PiHole + Cloudflared)
- DHCP Server (PiHole/dnsmasq)
- Prometheus Server + Alertmanager
- Loki DB Server (log shipping via Vector.dev, including self-shipping)
- Mail Server (Stalwart-Mail)
- Reverse Proxy (Caddy)
[/list]

Needless to say, it is an important little device, doing an important little job for our network. And when it goes down, everything grinds to a halt, computer-wise. Up until recent weeks, it had been doing extremely well, reliably. On this small network, I've used this hardware arrangement for many years. Over the last several months, I have worked very hard to completely wipe out all but Arch Linux on our servers and infrastructure devices. All servers now run Arch Linux or Arch Linux ARM.

The Problem:
This problem is actually hard to target the nature of, so I will have to list the symptoms and see what you think about it. It's been a real head-scratcher for me, and a real frustration too.

Initial Symptoms:
[list]
- Some ordinarily reliable network request doesn't work, or,
- Some ordinarily reliable external or internal site is inaccessible, or,
- NFS mounts suddenly cause workstations to lock up, or,
- .... etc etc etc
[/list]

Initial Response:
[list]
- I check grafana first - I have extensive cgroup reporting on all servers among much else
- Ping the device
- I try to SSH in to the device (which sits on my desk, but SSH access is my primary for everything
- Usually failing on one or both, I pull the power out, wait, then put it back in, then I'll
- Check the journal messages since boot,
- Check the journal messages in reverse from the last boot
[/list]

Initial Findings:
[list]
- There is usually always a large memory spike to max RAM (2GiB) for about 2mins just prior to the event
- The swap at this time does not appear to be utilised as one would expect if RAM was so low...
- This seems to be caused by PiHole rebuilding its gravity database (run by systemd)
- No OOM events are written to the journal
- nohang (installed and active) does not fire
- other journal records seem to indicate the system is continuing just fine, however, due to the lack of a hardware clock in the raspberry pi, this could be journal data from the **next** boot which has been backdated before the fakehw clock is loaded
[/list]

Physical (Pi) Symptoms:
[list]
- The power light is on
- The usually regularly flashing green access light is unchangingly OFF
- The PWM-controlled cooling fan that I designed myself, is not spinning. This is actually an indicator, as it is designed such that if there is a hi-z state in the control pin, or the control pin is HIGH, it will run at full speed. Most of the time it is OFF, indicating that the control pin is being held LOW, meaning the system is still booted and ON, and hasn't shut down (due to thermal overload or whatever).
- The usually actively flashing blue access light on the USB-connected NVMe device is OFF
- The ethernet port is still flashing as it would during peak activity (normal)
- Plugging a USB keyboard in does not activate the keyboard
- Plugging a screen in does not activate the screen
[/list]

Inferences:
[list]
- The USB bus could have been powered down
- The drive off means that the swap and all disk access is instantly removed (so no journal to figure things out with)
[/list]

Things Tried:
[list]
- Removing perceived memory hogs (Infisical, for instance, was bloating to 300MiB while idle - wtf)
- Installing nohang
- Setting up cgroup accounting at the kernel line, logging via prometheus
- Learning more about swap handling (with view to tune, but I feel it MAY not be the whole issue)
- Turning on a monitor and leaving it connected overnight (didn't catch it, thought it was fixed, damn it)
[/list]

Next Steps:
[list=]
[*] 1. Join this forum
[*] 2. Write this post
3. Add an additional ship destination to Vector.dev on the device to ship the journal to another machine, on which I'll set up Loki
[*] 4. Plug the monitor back in to the device
[/list]

Have Considered:
[list]
- Wipe a decent micro-sd card and set it up as an extra swap space and journal space (or loki space or...) for the device, but the root mount is the external USB drive, so if the root mount goes, can anything even cd to anywhere else?
- Flash an ESP32 with a serial logger and connect it to the Pi UART
[/list]

[code]
# we're running vanilla kernel here - no linux-rpi as of yet
> uname -a
Linux arthur 6.12.1-2-aarch64-ARCH #1 SMP PREEMPT_DYNAMIC Tue Dec 3 19:28:08 MST 2024 aarch64 GNU/Linux
[/code]

[code]
# I have since set this to 100, but at the time this was the setting
vm.swappiness = 10
[/code]

This has got me pretty flustered.
Please assist in preserving what's left of my sanity.

P.S. My native language is Markdown, and I can't write BBCode for sh_t it seems?

[b]UPDATE![/b]

I left the screen attached for a few days and just this morning an event occurred. I checked the device. The green light was off, but the USB drive was still powered ON. So it's not the USB system powering down. It was a KERNEL PANIC. I have uploaded the photo I took (optimised) of the screen output.

[url]https://imgdrop.io/image/caught-it50pct.Uym6d[/url]
ecod00m
 
Posts: 12
Joined: Thu Dec 26, 2024 2:43 am

Return to User Questions

Who is online

Users browsing this forum: No registered users and 15 guests