Troubleshooting crashes / panics / etc

This forum is for topics dealing with problems with software specifically in the ARMv7h repo.

Troubleshooting crashes / panics / etc

Postby andrewboring » Wed Jun 20, 2018 7:56 pm

Hi all.

I'm a bit new to Arch Linux and the ARM architecture in general, and I am working with a client on some basic systems/OS work as they develop a node.js application on a Raspberry Pi 3B. They're using Benja, which combines Arch Linux Arm, Electron, and some other stuff for kiosk applications.

I was originally brought in to troubleshoot spotty network connectivity issues, but it looks like the root cause is random system crashes bringing these units offline. Unfortunately, they are headless units driving a display that is not going to show a kernel panic or other info on the screen. Whenever they go offline, we have to ask someone on-site to power cycle them so we can troubleshoot afterward. The few times I've personally been onsite, plugging in a keyboard/monitor doesn't work until *after* the power cycle. As best i can tell, the system journal just stops logging and shows the reboot. The logs from the application simply stop, too, including a script running via cron job that checks various activities and restart services as needed.

After reading through https://wiki.archlinux.org/index.php/Co ... ng_systemd and taking a look at some core files left there, they're all application-specific and do not coincide with any hard system stops.

The SD card has no swap partition, and no swap file is configured. There's space available so I'm willing to create one if it gets me a core dump or other useful info, but I don't want to the system itself to start using swap and slow anything down. I'm thinking any out-of-memory issues should be noted in the system journal.

I have two constraints at the moment:
    1. updating the system is not possible just yet (it breaks the application), but I have a separate workstream to change that. I suspect some problems will disappear (and probably find some new ones) once we get to the latest and greatest.

    $this->bbcode_second_pass_code('', '$ uname -a
    Linux [hostname redacted] 4.14.34-1-ARCH #1 SMP Mon Apr 16 19:15:19 UTC 2018 armv7l GNU/Linux')

    2. I cannot reproduce this in our lab. It only seems to happen at the client's site, and is random enough across a dozen of these that I don't think it's a hardware issue with any given RPI.

So my question is: what other steps can I take to debug/troubleshoot these systems? Coming from the datacenter world, I've always had physical access and crash carts (or even a serial port) to see what's going on. Any options I can enable to get better kernel info after a crash in a headless, embedded system?

Guidance, suggestions, pointers, and links to appropriate docs are, welcome.
andrewboring
 
Posts: 13
Joined: Wed Jun 20, 2018 5:07 pm

Re: Troubleshooting crashes / panics / etc

Postby andrewboring » Fri Jun 22, 2018 1:14 pm

Ah. I believe this was the info i was looking for:
https://wiki.archlinux.org/index.php/Kdump
andrewboring
 
Posts: 13
Joined: Wed Jun 20, 2018 5:07 pm


Return to ARMv7h

Who is online

Users browsing this forum: No registered users and 11 guests