After about five months or so I have finally accepted that this issue is not going to magically fix itself and it has stubbornly resisted all my attempts to resolve it so far... time to ask for help.
I have a RPi4 8Gb instance called "cakey" that lives behind the TV in my workspace and runs Kodi 24/7 - it's probably my favourite computer out of the 50 or so I own and keeps me entertained all day while I work. About a year (?) ago I experimented with setting it up as root-on-ZFS booting aarch64 from a USB SSD. Worked perfectly and has remained in constant daily use ever since with absolutely no problems: except one. cakey will not upgrade to any kernel past linux-raspberrypi4-5.10.77-1-aarch64 and I really do mean any kernel.
cakey is upgraded daily (I like software updates) and is always completely up to date except for the kernel + matching headers package which I have had to blacklist since linux-raspberrypi4-5.10.77-1 and remain "stuck" on to this day. The problem first manifested on the day linux-raspberrypi4-5.10.78-1-aarch64 was released (2021-11-09 I think) when cakey failed to boot post-upgrade. Rolling back to the previous 5.10.77-1-aarch64 kernel + headers restored normal operation.
Since then once a week or so I remove the IgnorePkg directive blacklisting both packages and let cakey upgrade them. Not a single kernel past 5.10.77-1-aarch64 has worked. Every other package on the system (including the bootloader) is still updated daily however. My most recent test was today (2022-04-01) with kernel + headers linux-rpi-5.15.32-3, same result.
The issue occurs early in boot after the system has successfully initialized the kernel and initramfs from the USB SSD. I've transcribed the relevant chunk of output from a phone pic for legibility omitting timestamps:
$this->bbcode_second_pass_code('', 'Freeing unused kernel memory: 2624K
Run /init as init process
:: running early hook [udev]
mmc1: new high speed SDIO card at address 0001
usb 1-1: new high speed USB device number 2 using xhci_hcd
Starting version 250.4-2-arch
:: running hook [udev]
:: Triggering uvents...
usb 1-1: New USB device found, idVendor=2109, idProduct=3431, bcdDevice= 4.21
usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0
usb 1-1: Product USB2.0 Hub
hub 1-1:1.0: USB hub found
hub 1-1:1.0: 4 ports detected
usb 2-1: new SuperSpeed USB device number 2 using xhci_hcd
usb 2-1: New USB device found, idVendor=174c, idProduct=55aa, bcdDevice= 1.00
usb 2-1: New USB strings: Mfr=2, Product=3, SerialNumber=1
usb 2-1: Product: ASM105x
usb 2-1: Manufacturer: ASMT
usb 2-1: SerialNumber: 12345678E9FC
scsi host0: uas
scsi 0:0:0:0: Direct-Access ASMT ASM105x 0 PQ: 0 ANSI: 6
sd 0:0:0:0: [sda] 117231408 512-byte logical blocks: (60.0 GB/55 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes
:: running hook [zfs]
sda: sda1 sda2
usb 1-1.3: new low speed USB device number 3 using xhci_hcd
sd 0:0:0:0: [sda] Attached SCSI disk
usb 1-1.3: New USB device found, idVendor=413c, idProduct=2105, bcdDevice= 3.52
usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 1-1.3: Product Dell USB Keyboard
usb 1-1.3: Manufacturer: Dell
ZFS: Importing pool rpool.
The ZFS modules are not loaded.
Try running '/sbin/modprobe zfs' as root to load them.
/init: line 51: die: not found
The ZFS modules are not loaded.
Try running '/sbin/modprobe zfs' as root to load them.
:: running late hook [zfs]
input: Dell Dell USB Keyboard as /devices/platform/scb/fd500000,pcie,blahblah
The ZFS modules are not loaded.
Try running '/sbin/modprobe zfs' as root to load them.
:: running cleanup hook [udev]
spl: loading out-of-tree module taints kernel.
icp: module license 'CDDL' taints kernel.
Disabling lock debugging due to kernel taint
ERROR: failed to mount the real root device.
Bailing out, you are on your own, Good luck.
sh: can't access tty: job control turned off')
I'm extremely familiar with linux + zfs and troubleshooting it when it goes wrong: normally at this point dropping to the emergency prompt and issuing the following will load the modules, set up the system and resume a normal boot:
$this->bbcode_second_pass_code('', 'modprobe zfs
zpool import rpool
zfs mount -a
exit')
However on cakey with any effected kernel this will work until exiting (modules are loaded correctly, zpool is imported and zfs datasets mounted, filesystem mounted) but cakey instead faults immediately with "cannot find root filesystem".
I've spent a fair bit of time on/off probing this issue and trying to resolve it with absolutely zero progress so far. I have tried:
Modifying config.txt and cmdline.txt with every variation known to man
Switching bootloaders
Testing the latest available kernel every week or two for five months
Blacklisting or upgrading only certain packages (bootloader, etc) while trying upgrades in stages
Manually rebuilding the zfs modules via dkms and rebuilding the initramfs
Combed through this forum, google, the github issues pages for relevant projects and everywhere else for hints
Changed hooks order in /etc/mkinitcpio.conf
Trying different compatible adapters, SSDs and another RPi4 8Gb to rule out individual units
After installing a "bad" kernel I can move the SSD to a handy RPi3B running exactly the same software stack but booting from a regular ext4 partition to repair it. The RPi3B is also aarch64 with the same zfs variant installed and has no issues with the same kernel upgrades: there I can chroot in, roll back the kernel to linux-raspberrypi4-5.10.77-1-aarch64 and restore normal working order. I don't normally bother as I'm so used to this now I image the entire SSD first to a file or spare SSD and then post-install failure I dd the clone back - it's faster and easier.
If I examine the initramfs from a failed kernel install I can see all the relevant zfs bits are indeed present:
$this->bbcode_second_pass_code('', 'comrade@failbot:~/RPIKODI$ lsinitramfs FAILEDBOOTDUMP/initramfs-linux.img | egrep -i 'zfs|zpool'
etc/zfs/
etc/zfs/zed.d/
hooks/zfs
usr/bin/fsck.zfs
usr/bin/mount.zfs
usr/bin/zfs
usr/bin/zpool
usr/lib/libzfs_core.so.3
usr/lib/libzfs.so.4
usr/lib/libzpool.so.5
usr/lib/modules/5.15.32-3-rpi-ARCH/kernel/zfs.ko
usr/lib/udev/rules.d/90-zfs.rules')
This correctly matches the output of the same command run against a working initramfs from the same system.
So it seems that although all the right pieces are in place for a normal boot, something is wrong. I would very much appreciate any pointers and have made a bet with myself that it's going to be something really, really simple that I am overlooking. Anyone?
As this post has somehow got rather long I'll follow up in a minute with the details of the relevant software, bootloader, firmware etc that cakey is currently using plus the boot stanzas from cmdline.txt and config.txt.