bcmgenet transmit queue timed out

by **clorden** » Sat Jul 22, 2023 4:08 pm

Hello All,
Since 4 days my raspberry Pi 4 is experiencing a NIC connectivity problem. The NIC stops working with the following backtrace:

[code]
Jul 21 06:02:17 xxx.xxxxx kernel: ------------[ cut here ]------------
Jul 21 06:02:17 xxx.xxxxx kernel: NETDEV WATCHDOG: end0 (bcmgenet): transmit queue 0 timed out
Jul 21 06:02:17 xxx.xxxxx kernel: WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x278/0x280
Jul 21 06:02:17 xxx.xxxxx kernel: Modules linked in: xt_LOG nf_log_syslog xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack iptable_filter xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 brcmfmac brcmutil hci_uart btbcm bluetooth rpivid_hevc(C) bcm2835_codec(C) bcm2835_v4l2(C) >
Jul 21 06:02:17 xxx.xxxxx kernel: CPU: 3 PID: 0 Comm: swapper/3 Tainted: G C 6.1.38-2-rpi-ARCH #1
Jul 21 06:02:17 xxx.xxxxx kernel: Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
Jul 21 06:02:17 xxx.xxxxx kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jul 21 06:02:17 xxx.xxxxx kernel: pc : dev_watchdog+0x278/0x280
Jul 21 06:02:17 xxx.xxxxx kernel: lr : dev_watchdog+0x278/0x280
Jul 21 06:02:17 xxx.xxxxx kernel: sp : ffffffc00801bd90
Jul 21 06:02:17 xxx.xxxxx kernel: x29: ffffffc00801bd90 x28: ffffffd83475b274 x27: ffffffc00801beb0
Jul 21 06:02:17 xxx.xxxxx kernel: x26: ffffffd834e18008 x25: 0000000000000000 x24: ffffffd83514d860
Jul 21 06:02:17 xxx.xxxxx kernel: x23: ffffffd835146000 x22: 0000000000000000 x21: ffffff8102fe83dc
Jul 21 06:02:17 xxx.xxxxx kernel: x20: ffffff8102fe8000 x19: ffffff8102fe8488 x18: 0000000000000006
Jul 21 06:02:17 xxx.xxxxx kernel: x17: ffffffa9ca1aa000 x16: 0000000000000010 x15: 0000000000000001
Jul 21 06:02:17 xxx.xxxxx kernel: x14: 0000000020000000 x13: 0000000000000002 x12: 0000000000000000
Jul 21 06:02:17 xxx.xxxxx kernel: x11: 0000000000000000 x10: ffffffd8351c3740 x9 : ffffffd833cfa564
Jul 21 06:02:17 xxx.xxxxx kernel: x8 : 00000000ffffefff x7 : ffffffd8351c3740 x6 : 80000000fffff000
Jul 21 06:02:17 xxx.xxxxx kernel: x5 : ffffff81fefc4990 x4 : 0000000000000003 x3 : 0000000000000004
Jul 21 06:02:17 xxx.xxxxx kernel: x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff81002c3e00
Jul 21 06:02:17 xxx.xxxxx kernel: Call trace:
Jul 21 06:02:17 xxx.xxxxx kernel: dev_watchdog+0x278/0x280
Jul 21 06:02:17 xxx.xxxxx kernel: call_timer_fn+0x3c/0x1cc
Jul 21 06:02:17 xxx.xxxxx kernel: __run_timers+0x258/0x314
Jul 21 06:02:17 xxx.xxxxx kernel: run_timer_softirq+0x38/0x60
Jul 21 06:02:17 xxx.xxxxx kernel: __do_softirq+0x198/0x4d8
Jul 21 06:02:17 xxx.xxxxx kernel: ____do_softirq+0x18/0x24
Jul 21 06:02:17 xxx.xxxxx kernel: call_on_irq_stack+0x24/0x54
Jul 21 06:02:17 xxx.xxxxx kernel: do_softirq_own_stack+0x24/0x3c
Jul 21 06:02:17 xxx.xxxxx kernel: __irq_exit_rcu+0xd4/0x120
Jul 21 06:02:17 xxx.xxxxx kernel: irq_exit_rcu+0x18/0x50
Jul 21 06:02:17 xxx.xxxxx kernel: el1_interrupt+0x38/0x70
Jul 21 06:02:17 xxx.xxxxx kernel: el1h_64_irq_handler+0x18/0x2c
Jul 21 06:02:17 xxx.xxxxx kernel: el1h_64_irq+0x64/0x68
Jul 21 06:02:17 xxx.xxxxx kernel: arch_cpu_idle+0x18/0x2c
Jul 21 06:02:17 xxx.xxxxx kernel: default_idle_call+0x54/0x19c
Jul 21 06:02:17 xxx.xxxxx kernel: do_idle+0x26c/0x2b0
Jul 21 06:02:17 xxx.xxxxx kernel: cpu_startup_entry+0x30/0x3c
Jul 21 06:02:17 xxx.xxxxx kernel: secondary_start_kernel+0x128/0x150
Jul 21 06:02:17 xxx.xxxxx kernel: __secondary_switched+0xb0/0xb4
Jul 21 06:02:17 xxx.xxxxx kernel: ---[ end trace 0000000000000000 ]---
[/code]

No traffic goes through, LEDs on the NIC are blinking though.
The ip command shows the NIC is up, ethtool also reports no problem. I don't see much else besides the tx errors counter increasing.

Interestingly enough rebooting does [b]not[/b] fix the issue: upon reboot the issue happens again immediately after the NIC goes up.
Bringing the NIC down and up again does not fix it either.

The only fix I found so far is physically disconnecting the RJ45 and plugging it back again: traffic resumes only in this case.

I tried upgrading the kernel from 6.1.35-4 to 6.1.38-2 to no avail. I just upgraded to 6.1.39-2. I will see if tonight the issue happens again.

Any idea?

Thanks!

by **clorden** » Tue Jul 25, 2023 7:26 pm

Hello again,
I have an update and I believe this might be a regression.

After upgrading to the kernel 6.1.39-1-rpi the issue occurred again this morning:

[code]
Jul 25 06:52:32 xxx.xxxxx kernel: ------------[ cut here ]------------
Jul 25 06:52:32 xxx.xxxxx kernel: NETDEV WATCHDOG: end0 (bcmgenet): transmit queue 3 timed out
Jul 25 06:52:32 xxx.xxxxx kernel: WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x278/0x280
Jul 25 06:52:32 xxx.xxxxx kernel: Modules linked in: wireguard libchacha20poly1305 chacha_neon poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha xt_LOG nf_log_syslog xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack iptable_filter xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defr>
Jul 25 06:52:32 xxx.xxxxx kernel: drm_panel_orientation_quirks backlight
Jul 25 06:52:32 xxx.xxxxx kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G C 6.1.39-1-rpi-ARCH #1
Jul 25 06:52:32 xxx.xxxxx kernel: Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
Jul 25 06:52:32 xxx.xxxxx kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jul 25 06:52:32 xxx.xxxxx kernel: pc : dev_watchdog+0x278/0x280
Jul 25 06:52:32 xxx.xxxxx kernel: lr : dev_watchdog+0x278/0x280
Jul 25 06:52:32 xxx.xxxxx kernel: sp : ffffffc008013d90
Jul 25 06:52:32 xxx.xxxxx kernel: x29: ffffffc008013d90 x28: ffffffeee135da74 x27: ffffffc008013eb0
Jul 25 06:52:32 xxx.xxxxx kernel: x26: ffffffeee1a18008 x25: 0000000000000000 x24: ffffffeee1d4d860
Jul 25 06:52:32 xxx.xxxxx kernel: x23: ffffffeee1d46000 x22: 0000000000000003 x21: ffffff81031503dc
Jul 25 06:52:32 xxx.xxxxx kernel: x20: ffffff8103150000 x19: ffffff8103150488 x18: 0000000000000006
Jul 25 06:52:32 xxx.xxxxx kernel: x17: ffffff931d58d000 x16: 0000000000000010 x15: 0000000000000001
Jul 25 06:52:32 xxx.xxxxx kernel: x14: 0000000020000000 x13: 0000000000000002 x12: 0000000000000000
Jul 25 06:52:32 xxx.xxxxx kernel: x11: 0000000000000000 x10: ffffffeee1dc3740 x9 : ffffffeee08fa564
Jul 25 06:52:32 xxx.xxxxx kernel: x8 : 00000000ffffefff x7 : ffffffeee1dc3740 x6 : 80000000fffff000
Jul 25 06:52:32 xxx.xxxxx kernel: x5 : ffffff81fefa7990 x4 : 0000000000000040 x3 : 0000000000000004
Jul 25 06:52:32 xxx.xxxxx kernel: x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff81002c1f00
Jul 25 06:52:32 xxx.xxxxx kernel: Call trace:
Jul 25 06:52:32 xxx.xxxxx kernel: dev_watchdog+0x278/0x280
Jul 25 06:52:32 xxx.xxxxx kernel: call_timer_fn+0x3c/0x1cc
Jul 25 06:52:32 xxx.xxxxx kernel: __run_timers+0x258/0x314
Jul 25 06:52:32 xxx.xxxxx kernel: run_timer_softirq+0x38/0x60
Jul 25 06:52:32 xxx.xxxxx kernel: __do_softirq+0x198/0x4d8
Jul 25 06:52:32 xxx.xxxxx kernel: ____do_softirq+0x18/0x24
Jul 25 06:52:32 xxx.xxxxx kernel: call_on_irq_stack+0x24/0x54
Jul 25 06:52:32 xxx.xxxxx kernel: do_softirq_own_stack+0x24/0x3c
Jul 25 06:52:32 xxx.xxxxx kernel: __irq_exit_rcu+0xd4/0x120
Jul 25 06:52:32 xxx.xxxxx kernel: irq_exit_rcu+0x18/0x50
Jul 25 06:52:32 xxx.xxxxx kernel: el1_interrupt+0x38/0x70
Jul 25 06:52:32 xxx.xxxxx kernel: el1h_64_irq_handler+0x18/0x2c
Jul 25 06:52:32 xxx.xxxxx kernel: el1h_64_irq+0x64/0x68
Jul 25 06:52:32 xxx.xxxxx kernel: arch_cpu_idle+0x18/0x2c
Jul 25 06:52:32 xxx.xxxxx kernel: default_idle_call+0x54/0x19c
Jul 25 06:52:32 xxx.xxxxx kernel: do_idle+0x26c/0x2b0
Jul 25 06:52:32 xxx.xxxxx kernel: cpu_startup_entry+0x30/0x3c
Jul 25 06:52:32 xxx.xxxxx kernel: secondary_start_kernel+0x128/0x150
Jul 25 06:52:32 xxx.xxxxx kernel: __secondary_switched+0xb0/0xb4
Jul 25 06:52:32 xxx.xxxxx kernel: ---[ end trace 0000000000000000 ]---
[/code]

This seems a fairly similar issue as the one reported upstream last year:
make_clickable_callback(MAGIC_URL_FULL, '
', 'https://lore.kernel.org/netdev/f0f12f2a-b2ac-6292-55eb-a207c8a965f1@arm.com/T/#t', '', ' class="postlink"')

Following that investigation, I checked the behavior of the interrupt lines, this is with kernel 6.1.39-2 after about a day from booting (yes it's not a busy box ATM):

[code]
$ cat /proc/interrupts | grep end0
31: 177656 0 0 0 GICv2 189 Level end0
32: 1310 0 0 0 GICv2 190 Level end0
[/code]

While using the kernel 6.1.39 there seems to be a large difference between the number of interrupts: the second line barely ever increases.

I then downgraded the kernel to the last known working version, I.E 6.1.29-2-rpi. Here is the result after a couple of minutes since rebooting:

[code]
$ cat /proc/interrupts | grep end0
31: 3214 0 0 0 GICv2 189 Level end0
32: 970 0 0 0 GICv2 190 Level end0
[/code]

Both interrupt lines are increasing at a comparable pace.

I will keep the old kernel running to see if the issue goes away.

In any case, looking at the differences between Linux 6.1.29 and 6.1.3X there was a change in relation to the NIC driver itself, and I believe it is the only one:
make_clickable_callback(MAGIC_URL_FULL, '
', 'https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/broadcom/genet/bcmmii.c?h=v6.1.41&id=5df74018d1e6b8fdf4bba9be6e8f9bc8e638e243', '', ' class="postlink"')

And yes, I have EEE disabled.

If this is a regression, it would be an upstream bug, unless there is something else I am missing.
However our kernel is tainted, AFAIK. Any idea on how to go ahead in this case?

Thanks!

bcmgenet transmit queue timed out

bcmgenet transmit queue timed out

Re: bcmgenet transmit queue timed out

Who is online