kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155

LKML Archive mirror
 help / color / mirror / Atom feed

* kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
@ 2024-04-24  0:54 Chris Packham
  2024-04-24  1:38 ` Chris Packham
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Packham @ 2024-04-24  0:54 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, Greg Kroah-Hartman
  Cc: linux-nfs@vger.kernel.org, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

Hi Jeff, Chuck, Greg,

After updating one of our builds along the 5.15.y LTS branch our testing 
caught a new kernel bug. Output below.

I haven't dug into it yet but wondered if it rang any bells.

Thanks,
Chris

[   91.605109] ------------[ cut here ]------------
[   91.605122] kernel BUG at net/sunrpc/svc.c:570!
[   91.605129] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   91.610643] Modules linked in: mvcpss(O) platform_driver(O) ipifwd(O) 
xt_l2tp xt_hashlimit xt_conntrack xt_addrtype xt_LOG xt_CHECKSUM wp512 
vxlan veth twofish_generic twofish_common sr9800 smsc95xx smsc75xx smsc 
sm3_generic sha512_arm64 sha3_generic serpent_generic rtl8150 
rpcsec_gss_krb5 rmd160 poly1305_generic plusb pegasus optee_rng nbd 
microchip md4 md_mod mcs7830 lrw libpoly1305 lan78xx l2tp_ip6 l2tp_ip 
l2tp_eth l2tp_netlink l2tp_core udp_tunnel ipt_REJECT nf_reject_ipv4 
ip6table_nat ip6table_mangle ip6table_filter ip6t_ipv6header ip6t_REJECT 
ip6_udp_tunnel ip6_tables dm9601 dm_zero dm_mirror dm_region_hash dm_log 
dm_mod diag tipc cuse cts cpufreq_powersave cpufreq_conservative 
chacha_generic chacha20poly1305 chacha_neon libchacha cast6_generic 
cast5_generic cast_common camellia_generic blowfish_generic 
blowfish_common auth_rpcgss oid_registry at25 arm_smccc_trng 
aes_neon_blk idprom_mtd(O) idprom_i2c(O) epi3_boardinfo_i2c(O) x250(O) 
psuslot_epi3_register(O) psuslot_gpio_group(O)
[   91.610809]  psuslot(O)
[   91.611822] watchdog: watchdog1: watchdog did not stop!
[   91.697065]  gpiopins_boardinfo(O) idprom(O) epi3_boardinfo(O) 
boardinfo(O) i2c_gpio i2c_algo_bit i2c_mv64xxx pluggable(O) 
led_enable(O) omap_rng rng_core atl_reset(O) sbsa_gwdt uio_pdrv_genirq
[   91.697096] CPU: 2 PID: 1770 Comm: nfsd Kdump: loaded Tainted: 
G           O      5.15.155 #1
[   91.697103] Hardware name: Allied Telesis x250-28XTm (DT)
[   91.697107] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[   91.697112] pc : svc_destroy+0x84/0xac
[   91.701202] watchdog: watchdog0: watchdog did not stop!
[   91.702215] lr : svc_destroy+0x2c/0xac
[   91.702220] sp : ffff80000bb3bde0
[   91.702223] x29: ffff80000bb3bde0 x28: 0000000000000000 x27: 
0000000000000000
[   91.746095] x26: 0000000000000000 x25: ffff00000dbfaa40 x24: 
ffff000016c14000
[   91.746101] x23: ffff800008395c00 x22: ffff00000ee9f284 x21: 
ffff00000eea9e10
[   91.746108] x20: ffff00000eea9e00 x19: ffff00000eea9e14 x18: 
ffff800008e99000
[   91.769526] x17: 0000000000000006 x16: 0000000000000000 x15: 
0000000000000001
[   91.776782] x14: 00000000fffffffd x13: fffffc0000000000 x12: 
ffff800076bc2000
[   91.784031] x11: ffff00007fba5c10 x10: ffff800076bc2000 x9 : 
ffff8000092207c0
[   91.784038] x8 : fffffc000055eb08 x7 : ffff00000ef6c4c0 x6 : 
fffffc0001f872c8
[   91.795823] x5 : 0000000000000100 x4 : ffff00007fbaeda8 x3 : 
0000000000000000
[   91.801684] x2 : 0000000000000000 x1 : ffff00000d8f8018 x0 : 
ffff00000eea9e30
[   91.807545] Call trace:
[   91.810088]  svc_destroy+0x84/0xac
[   91.813586]  svc_exit_thread+0x108/0x15c
[   91.816998]  nfsd+0x178/0x1a0
[   91.818673]  kthread+0x150/0x160
[   91.820610]  ret_from_fork+0x10/0x20
[   91.820620] Code: a94153f3 a8c27bfd d50323bf d65f03c0 (d4210000)
[   91.820629] SMP: stopping secondary CPUs
[   91.830433] Starting crashdump kernel...
[   91.833064] Bye!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24  0:54 kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155 Chris Packham
@ 2024-04-24  1:38 ` Chris Packham
  2024-04-24  7:42   ` Chris Packham
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Packham @ 2024-04-24  1:38 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, Greg Kroah-Hartman
  Cc: linux-nfs@vger.kernel.org, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org


On 24/04/24 12:54, Chris Packham wrote:
> Hi Jeff, Chuck, Greg,
>
> After updating one of our builds along the 5.15.y LTS branch our 
> testing caught a new kernel bug. Output below.
>
> I haven't dug into it yet but wondered if it rang any bells.

A bit more info. This is happening at "reboot" for us. Our embedded 
devices use a bit of a hacked up reboot process so that they come back 
faster in the case of a failure.

It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B

I can trigger it with `killall -9 nfsd` which I'm not sure is a 
completely legit thing to do to kernel threads but it's probably close 
to what our customized reboot does.

>
> Thanks,
> Chris
>
> [   91.605109] ------------[ cut here ]------------
> [   91.605122] kernel BUG at net/sunrpc/svc.c:570!
> [   91.605129] Internal error: Oops - BUG: 00000000f2000800 [#1] 
> PREEMPT SMP
> [   91.610643] Modules linked in: mvcpss(O) platform_driver(O) 
> ipifwd(O) xt_l2tp xt_hashlimit xt_conntrack xt_addrtype xt_LOG 
> xt_CHECKSUM wp512 vxlan veth twofish_generic twofish_common sr9800 
> smsc95xx smsc75xx smsc sm3_generic sha512_arm64 sha3_generic 
> serpent_generic rtl8150 rpcsec_gss_krb5 rmd160 poly1305_generic plusb 
> pegasus optee_rng nbd microchip md4 md_mod mcs7830 lrw libpoly1305 
> lan78xx l2tp_ip6 l2tp_ip l2tp_eth l2tp_netlink l2tp_core udp_tunnel 
> ipt_REJECT nf_reject_ipv4 ip6table_nat ip6table_mangle ip6table_filter 
> ip6t_ipv6header ip6t_REJECT ip6_udp_tunnel ip6_tables dm9601 dm_zero 
> dm_mirror dm_region_hash dm_log dm_mod diag tipc cuse cts 
> cpufreq_powersave cpufreq_conservative chacha_generic chacha20poly1305 
> chacha_neon libchacha cast6_generic cast5_generic cast_common 
> camellia_generic blowfish_generic blowfish_common auth_rpcgss 
> oid_registry at25 arm_smccc_trng aes_neon_blk idprom_mtd(O) 
> idprom_i2c(O) epi3_boardinfo_i2c(O) x250(O) psuslot_epi3_register(O) 
> psuslot_gpio_group(O)
> [   91.610809]  psuslot(O)
> [   91.611822] watchdog: watchdog1: watchdog did not stop!
> [   91.697065]  gpiopins_boardinfo(O) idprom(O) epi3_boardinfo(O) 
> boardinfo(O) i2c_gpio i2c_algo_bit i2c_mv64xxx pluggable(O) 
> led_enable(O) omap_rng rng_core atl_reset(O) sbsa_gwdt uio_pdrv_genirq
> [   91.697096] CPU: 2 PID: 1770 Comm: nfsd Kdump: loaded Tainted: 
> G           O      5.15.155 #1
> [   91.697103] Hardware name: Allied Telesis x250-28XTm (DT)
> [   91.697107] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [   91.697112] pc : svc_destroy+0x84/0xac
> [   91.701202] watchdog: watchdog0: watchdog did not stop!
> [   91.702215] lr : svc_destroy+0x2c/0xac
> [   91.702220] sp : ffff80000bb3bde0
> [   91.702223] x29: ffff80000bb3bde0 x28: 0000000000000000 x27: 
> 0000000000000000
> [   91.746095] x26: 0000000000000000 x25: ffff00000dbfaa40 x24: 
> ffff000016c14000
> [   91.746101] x23: ffff800008395c00 x22: ffff00000ee9f284 x21: 
> ffff00000eea9e10
> [   91.746108] x20: ffff00000eea9e00 x19: ffff00000eea9e14 x18: 
> ffff800008e99000
> [   91.769526] x17: 0000000000000006 x16: 0000000000000000 x15: 
> 0000000000000001
> [   91.776782] x14: 00000000fffffffd x13: fffffc0000000000 x12: 
> ffff800076bc2000
> [   91.784031] x11: ffff00007fba5c10 x10: ffff800076bc2000 x9 : 
> ffff8000092207c0
> [   91.784038] x8 : fffffc000055eb08 x7 : ffff00000ef6c4c0 x6 : 
> fffffc0001f872c8
> [   91.795823] x5 : 0000000000000100 x4 : ffff00007fbaeda8 x3 : 
> 0000000000000000
> [   91.801684] x2 : 0000000000000000 x1 : ffff00000d8f8018 x0 : 
> ffff00000eea9e30
> [   91.807545] Call trace:
> [   91.810088]  svc_destroy+0x84/0xac
> [   91.813586]  svc_exit_thread+0x108/0x15c
> [   91.816998]  nfsd+0x178/0x1a0
> [   91.818673]  kthread+0x150/0x160
> [   91.820610]  ret_from_fork+0x10/0x20
> [   91.820620] Code: a94153f3 a8c27bfd d50323bf d65f03c0 (d4210000)
> [   91.820629] SMP: stopping secondary CPUs
> [   91.830433] Starting crashdump kernel...
> [   91.833064] Bye!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24  1:38 ` Chris Packham
@ 2024-04-24  7:42   ` Chris Packham
  2024-04-24 13:33     ` Chuck Lever III
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Packham @ 2024-04-24  7:42 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, Greg Kroah-Hartman, neilb@suse.de
  Cc: linux-nfs@vger.kernel.org, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org


On 24/04/24 13:38, Chris Packham wrote:
>
> On 24/04/24 12:54, Chris Packham wrote:
>> Hi Jeff, Chuck, Greg,
>>
>> After updating one of our builds along the 5.15.y LTS branch our 
>> testing caught a new kernel bug. Output below.
>>
>> I haven't dug into it yet but wondered if it rang any bells.
>
> A bit more info. This is happening at "reboot" for us. Our embedded 
> devices use a bit of a hacked up reboot process so that they come back 
> faster in the case of a failure.
>
> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>
> I can trigger it with `killall -9 nfsd` which I'm not sure is a 
> completely legit thing to do to kernel threads but it's probably close 
> to what our customized reboot does.

I've bisected between v5.15.153 and v5.15.155 and identified commit 
dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in 
nfsd()") as the first bad commit. Based on the context that seems to 
line up with my reproduction. I'm wondering if perhaps something got 
missed out of the stable track? Unfortunately I'm not able to run a more 
recent kernel with all of the nfs related setup that is being used on  
the system in question.

>
>>
>> Thanks,
>> Chris
>>
>> [   91.605109] ------------[ cut here ]------------
>> [   91.605122] kernel BUG at net/sunrpc/svc.c:570!
>> [   91.605129] Internal error: Oops - BUG: 00000000f2000800 [#1] 
>> PREEMPT SMP
>> [   91.610643] Modules linked in: mvcpss(O) platform_driver(O) 
>> ipifwd(O) xt_l2tp xt_hashlimit xt_conntrack xt_addrtype xt_LOG 
>> xt_CHECKSUM wp512 vxlan veth twofish_generic twofish_common sr9800 
>> smsc95xx smsc75xx smsc sm3_generic sha512_arm64 sha3_generic 
>> serpent_generic rtl8150 rpcsec_gss_krb5 rmd160 poly1305_generic plusb 
>> pegasus optee_rng nbd microchip md4 md_mod mcs7830 lrw libpoly1305 
>> lan78xx l2tp_ip6 l2tp_ip l2tp_eth l2tp_netlink l2tp_core udp_tunnel 
>> ipt_REJECT nf_reject_ipv4 ip6table_nat ip6table_mangle 
>> ip6table_filter ip6t_ipv6header ip6t_REJECT ip6_udp_tunnel ip6_tables 
>> dm9601 dm_zero dm_mirror dm_region_hash dm_log dm_mod diag tipc cuse 
>> cts cpufreq_powersave cpufreq_conservative chacha_generic 
>> chacha20poly1305 chacha_neon libchacha cast6_generic cast5_generic 
>> cast_common camellia_generic blowfish_generic blowfish_common 
>> auth_rpcgss oid_registry at25 arm_smccc_trng aes_neon_blk 
>> idprom_mtd(O) idprom_i2c(O) epi3_boardinfo_i2c(O) x250(O) 
>> psuslot_epi3_register(O) psuslot_gpio_group(O)
>> [   91.610809]  psuslot(O)
>> [   91.611822] watchdog: watchdog1: watchdog did not stop!
>> [   91.697065]  gpiopins_boardinfo(O) idprom(O) epi3_boardinfo(O) 
>> boardinfo(O) i2c_gpio i2c_algo_bit i2c_mv64xxx pluggable(O) 
>> led_enable(O) omap_rng rng_core atl_reset(O) sbsa_gwdt uio_pdrv_genirq
>> [   91.697096] CPU: 2 PID: 1770 Comm: nfsd Kdump: loaded Tainted: 
>> G           O      5.15.155 #1
>> [   91.697103] Hardware name: Allied Telesis x250-28XTm (DT)
>> [   91.697107] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS 
>> BTYPE=--)
>> [   91.697112] pc : svc_destroy+0x84/0xac
>> [   91.701202] watchdog: watchdog0: watchdog did not stop!
>> [   91.702215] lr : svc_destroy+0x2c/0xac
>> [   91.702220] sp : ffff80000bb3bde0
>> [   91.702223] x29: ffff80000bb3bde0 x28: 0000000000000000 x27: 
>> 0000000000000000
>> [   91.746095] x26: 0000000000000000 x25: ffff00000dbfaa40 x24: 
>> ffff000016c14000
>> [   91.746101] x23: ffff800008395c00 x22: ffff00000ee9f284 x21: 
>> ffff00000eea9e10
>> [   91.746108] x20: ffff00000eea9e00 x19: ffff00000eea9e14 x18: 
>> ffff800008e99000
>> [   91.769526] x17: 0000000000000006 x16: 0000000000000000 x15: 
>> 0000000000000001
>> [   91.776782] x14: 00000000fffffffd x13: fffffc0000000000 x12: 
>> ffff800076bc2000
>> [   91.784031] x11: ffff00007fba5c10 x10: ffff800076bc2000 x9 : 
>> ffff8000092207c0
>> [   91.784038] x8 : fffffc000055eb08 x7 : ffff00000ef6c4c0 x6 : 
>> fffffc0001f872c8
>> [   91.795823] x5 : 0000000000000100 x4 : ffff00007fbaeda8 x3 : 
>> 0000000000000000
>> [   91.801684] x2 : 0000000000000000 x1 : ffff00000d8f8018 x0 : 
>> ffff00000eea9e30
>> [   91.807545] Call trace:
>> [   91.810088]  svc_destroy+0x84/0xac
>> [   91.813586]  svc_exit_thread+0x108/0x15c
>> [   91.816998]  nfsd+0x178/0x1a0
>> [   91.818673]  kthread+0x150/0x160
>> [   91.820610]  ret_from_fork+0x10/0x20
>> [   91.820620] Code: a94153f3 a8c27bfd d50323bf d65f03c0 (d4210000)
>> [   91.820629] SMP: stopping secondary CPUs
>> [   91.830433] Starting crashdump kernel...
>> [   91.833064] Bye!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24  7:42   ` Chris Packham
@ 2024-04-24 13:33     ` Chuck Lever III
  2024-04-24 14:03       ` Chuck Lever III
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2024-04-24 13:33 UTC (permalink / raw)
  To: Chris Packham, Neil Brown
  Cc: Jeff Layton, Greg Kroah-Hartman, Linux NFS Mailing List, netdev,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org



> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
> 
> On 24/04/24 13:38, Chris Packham wrote:
>> 
>> On 24/04/24 12:54, Chris Packham wrote:
>>> Hi Jeff, Chuck, Greg,
>>> 
>>> After updating one of our builds along the 5.15.y LTS branch our 
>>> testing caught a new kernel bug. Output below.
>>> 
>>> I haven't dug into it yet but wondered if it rang any bells.
>> 
>> A bit more info. This is happening at "reboot" for us. Our embedded 
>> devices use a bit of a hacked up reboot process so that they come back 
>> faster in the case of a failure.
>> 
>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>> 
>> I can trigger it with `killall -9 nfsd` which I'm not sure is a 
>> completely legit thing to do to kernel threads but it's probably close 
>> to what our customized reboot does.
> 
> I've bisected between v5.15.153 and v5.15.155 and identified commit 
> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in 
> nfsd()") as the first bad commit. Based on the context that seems to 
> line up with my reproduction. I'm wondering if perhaps something got 
> missed out of the stable track? Unfortunately I'm not able to run a more 
> recent kernel with all of the nfs related setup that is being used on  
> the system in question.

Thanks for bisecting, that would have been my first suggestion.

The backport included all of the NFSD patches up to v6.2, but
there might be a missing server-side SunRPC patch. I'll try to
have a look soon, but this week is an NFS plug-fest. Neil,
maybe you can have a look too?


>>> Thanks,
>>> Chris
>>> 
>>> [   91.605109] ------------[ cut here ]------------
>>> [   91.605122] kernel BUG at net/sunrpc/svc.c:570!
>>> [   91.605129] Internal error: Oops - BUG: 00000000f2000800 [#1] 
>>> PREEMPT SMP
>>> [   91.610643] Modules linked in: mvcpss(O) platform_driver(O) 
>>> ipifwd(O) xt_l2tp xt_hashlimit xt_conntrack xt_addrtype xt_LOG 
>>> xt_CHECKSUM wp512 vxlan veth twofish_generic twofish_common sr9800 
>>> smsc95xx smsc75xx smsc sm3_generic sha512_arm64 sha3_generic 
>>> serpent_generic rtl8150 rpcsec_gss_krb5 rmd160 poly1305_generic plusb 
>>> pegasus optee_rng nbd microchip md4 md_mod mcs7830 lrw libpoly1305 
>>> lan78xx l2tp_ip6 l2tp_ip l2tp_eth l2tp_netlink l2tp_core udp_tunnel 
>>> ipt_REJECT nf_reject_ipv4 ip6table_nat ip6table_mangle 
>>> ip6table_filter ip6t_ipv6header ip6t_REJECT ip6_udp_tunnel ip6_tables 
>>> dm9601 dm_zero dm_mirror dm_region_hash dm_log dm_mod diag tipc cuse 
>>> cts cpufreq_powersave cpufreq_conservative chacha_generic 
>>> chacha20poly1305 chacha_neon libchacha cast6_generic cast5_generic 
>>> cast_common camellia_generic blowfish_generic blowfish_common 
>>> auth_rpcgss oid_registry at25 arm_smccc_trng aes_neon_blk 
>>> idprom_mtd(O) idprom_i2c(O) epi3_boardinfo_i2c(O) x250(O) 
>>> psuslot_epi3_register(O) psuslot_gpio_group(O)
>>> [   91.610809]  psuslot(O)
>>> [   91.611822] watchdog: watchdog1: watchdog did not stop!
>>> [   91.697065]  gpiopins_boardinfo(O) idprom(O) epi3_boardinfo(O) 
>>> boardinfo(O) i2c_gpio i2c_algo_bit i2c_mv64xxx pluggable(O) 
>>> led_enable(O) omap_rng rng_core atl_reset(O) sbsa_gwdt uio_pdrv_genirq
>>> [   91.697096] CPU: 2 PID: 1770 Comm: nfsd Kdump: loaded Tainted: 
>>> G           O      5.15.155 #1
>>> [   91.697103] Hardware name: Allied Telesis x250-28XTm (DT)
>>> [   91.697107] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS 
>>> BTYPE=--)
>>> [   91.697112] pc : svc_destroy+0x84/0xac
>>> [   91.701202] watchdog: watchdog0: watchdog did not stop!
>>> [   91.702215] lr : svc_destroy+0x2c/0xac
>>> [   91.702220] sp : ffff80000bb3bde0
>>> [   91.702223] x29: ffff80000bb3bde0 x28: 0000000000000000 x27: 
>>> 0000000000000000
>>> [   91.746095] x26: 0000000000000000 x25: ffff00000dbfaa40 x24: 
>>> ffff000016c14000
>>> [   91.746101] x23: ffff800008395c00 x22: ffff00000ee9f284 x21: 
>>> ffff00000eea9e10
>>> [   91.746108] x20: ffff00000eea9e00 x19: ffff00000eea9e14 x18: 
>>> ffff800008e99000
>>> [   91.769526] x17: 0000000000000006 x16: 0000000000000000 x15: 
>>> 0000000000000001
>>> [   91.776782] x14: 00000000fffffffd x13: fffffc0000000000 x12: 
>>> ffff800076bc2000
>>> [   91.784031] x11: ffff00007fba5c10 x10: ffff800076bc2000 x9 : 
>>> ffff8000092207c0
>>> [   91.784038] x8 : fffffc000055eb08 x7 : ffff00000ef6c4c0 x6 : 
>>> fffffc0001f872c8
>>> [   91.795823] x5 : 0000000000000100 x4 : ffff00007fbaeda8 x3 : 
>>> 0000000000000000
>>> [   91.801684] x2 : 0000000000000000 x1 : ffff00000d8f8018 x0 : 
>>> ffff00000eea9e30
>>> [   91.807545] Call trace:
>>> [   91.810088]  svc_destroy+0x84/0xac
>>> [   91.813586]  svc_exit_thread+0x108/0x15c
>>> [   91.816998]  nfsd+0x178/0x1a0
>>> [   91.818673]  kthread+0x150/0x160
>>> [   91.820610]  ret_from_fork+0x10/0x20
>>> [   91.820620] Code: a94153f3 a8c27bfd d50323bf d65f03c0 (d4210000)
>>> [   91.820629] SMP: stopping secondary CPUs
>>> [   91.830433] Starting crashdump kernel...
>>> [   91.833064] Bye!


--
Chuck Lever



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24 13:33     ` Chuck Lever III
@ 2024-04-24 14:03       ` Chuck Lever III
  2024-04-24 23:37         ` NeilBrown
  2024-04-25 15:50         ` Chuck Lever
  0 siblings, 2 replies; 11+ messages in thread
From: Chuck Lever III @ 2024-04-24 14:03 UTC (permalink / raw)
  To: Chris Packham, Neil Brown
  Cc: Jeff Layton, Greg Kroah-Hartman, Linux NFS Mailing List, netdev,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org


> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>> 
>> On 24/04/24 13:38, Chris Packham wrote:
>>> 
>>> On 24/04/24 12:54, Chris Packham wrote:
>>>> Hi Jeff, Chuck, Greg,
>>>> 
>>>> After updating one of our builds along the 5.15.y LTS branch our 
>>>> testing caught a new kernel bug. Output below.
>>>> 
>>>> I haven't dug into it yet but wondered if it rang any bells.
>>> 
>>> A bit more info. This is happening at "reboot" for us. Our embedded 
>>> devices use a bit of a hacked up reboot process so that they come back 
>>> faster in the case of a failure.
>>> 
>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>>> 
>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a 
>>> completely legit thing to do to kernel threads but it's probably close 
>>> to what our customized reboot does.
>> 
>> I've bisected between v5.15.153 and v5.15.155 and identified commit 
>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in 
>> nfsd()") as the first bad commit. Based on the context that seems to 
>> line up with my reproduction. I'm wondering if perhaps something got 
>> missed out of the stable track? Unfortunately I'm not able to run a more 
>> recent kernel with all of the nfs related setup that is being used on  
>> the system in question.
> 
> Thanks for bisecting, that would have been my first suggestion.
> 
> The backport included all of the NFSD patches up to v6.2, but
> there might be a missing server-side SunRPC patch.

So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
call in  nfsd()") is from v6.6, so it was applied to v5.15.y
only to get a subsequent NFSD fix to apply.

The immediately previous upstream commit is missing:

  390390240145 ("nfsd: don't allow nfsd threads to be signalled.")

For testing, I've applied this to my nfsd-5.15.y branch here:

  https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

However even if that fixes the reported crash, this suggests
that after v6.6, nfsd threads are not going to respond to
"killall -9 nfsd".


--
Chuck Lever



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24 14:03       ` Chuck Lever III
@ 2024-04-24 23:37         ` NeilBrown
  2024-04-25 20:51           ` Chris Packham
  2024-04-25 15:50         ` Chuck Lever
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2024-04-24 23:37 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Chris Packham, Jeff Layton, Greg Kroah-Hartman,
	Linux NFS Mailing List, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Thu, 25 Apr 2024, Chuck Lever III wrote:
> 
> > On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > 
> >> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> >> 
> >> On 24/04/24 13:38, Chris Packham wrote:
> >>> 
> >>> On 24/04/24 12:54, Chris Packham wrote:
> >>>> Hi Jeff, Chuck, Greg,
> >>>> 
> >>>> After updating one of our builds along the 5.15.y LTS branch our 
> >>>> testing caught a new kernel bug. Output below.
> >>>> 
> >>>> I haven't dug into it yet but wondered if it rang any bells.
> >>> 
> >>> A bit more info. This is happening at "reboot" for us. Our embedded 
> >>> devices use a bit of a hacked up reboot process so that they come back 
> >>> faster in the case of a failure.
> >>> 
> >>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
> >>> 
> >>> I can trigger it with `killall -9 nfsd` which I'm not sure is a 
> >>> completely legit thing to do to kernel threads but it's probably close 
> >>> to what our customized reboot does.
> >> 
> >> I've bisected between v5.15.153 and v5.15.155 and identified commit 
> >> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in 
> >> nfsd()") as the first bad commit. Based on the context that seems to 
> >> line up with my reproduction. I'm wondering if perhaps something got 
> >> missed out of the stable track? Unfortunately I'm not able to run a more 
> >> recent kernel with all of the nfs related setup that is being used on  
> >> the system in question.
> > 
> > Thanks for bisecting, that would have been my first suggestion.
> > 
> > The backport included all of the NFSD patches up to v6.2, but
> > there might be a missing server-side SunRPC patch.
> 
> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
> only to get a subsequent NFSD fix to apply.
> 
> The immediately previous upstream commit is missing:
> 
>   390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
> 
> For testing, I've applied this to my nfsd-5.15.y branch here:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> 
> However even if that fixes the reported crash, this suggests
> that after v6.6, nfsd threads are not going to respond to
> "killall -9 nfsd".

I think this likely is the problem.  The nfsd threads must be being
killed by a signal.
One only other cause for an nfsd thread to exit is if
svc_set_num_threads() is called, and all places that call that hold a
ref on the serv structure so the final put won't happen when the thread
exits.

Before the patch that bisect found, the nfsd thread would exit with

 svc_get();
 svc_exit_thread();
 nfsd_put();

This also holds a ref across the svc_exit_thread(), and ensures the
final 'put' happens from nfsD_put(), not svc_put() (in
svc_exit_thread()).

Chris: what was the context when the crash happened?  Could the nfsd
threads have been signalled?  That hasn't been the standard way to stop
nfsd threads for a long time, so I'm a little surprised that it is
happening.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24 14:03       ` Chuck Lever III
  2024-04-24 23:37         ` NeilBrown
@ 2024-04-25 15:50         ` Chuck Lever
  1 sibling, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2024-04-25 15:50 UTC (permalink / raw)
  To: Chris Packham, Neil Brown
  Cc: Jeff Layton, Greg Kroah-Hartman, Linux NFS Mailing List, netdev,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On Wed, Apr 24, 2024 at 02:03:22PM +0000, Chuck Lever III wrote:
> 
> > On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > 
> >> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> >> 
> >> On 24/04/24 13:38, Chris Packham wrote:
> >>> 
> >>> On 24/04/24 12:54, Chris Packham wrote:
> >>>> Hi Jeff, Chuck, Greg,
> >>>> 
> >>>> After updating one of our builds along the 5.15.y LTS branch our 
> >>>> testing caught a new kernel bug. Output below.
> >>>> 
> >>>> I haven't dug into it yet but wondered if it rang any bells.
> >>> 
> >>> A bit more info. This is happening at "reboot" for us. Our embedded 
> >>> devices use a bit of a hacked up reboot process so that they come back 
> >>> faster in the case of a failure.
> >>> 
> >>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
> >>> 
> >>> I can trigger it with `killall -9 nfsd` which I'm not sure is a 
> >>> completely legit thing to do to kernel threads but it's probably close 
> >>> to what our customized reboot does.
> >> 
> >> I've bisected between v5.15.153 and v5.15.155 and identified commit 
> >> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in 
> >> nfsd()") as the first bad commit. Based on the context that seems to 
> >> line up with my reproduction. I'm wondering if perhaps something got 
> >> missed out of the stable track? Unfortunately I'm not able to run a more 
> >> recent kernel with all of the nfs related setup that is being used on  
> >> the system in question.
> > 
> > Thanks for bisecting, that would have been my first suggestion.
> > 
> > The backport included all of the NFSD patches up to v6.2, but
> > there might be a missing server-side SunRPC patch.
> 
> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
> only to get a subsequent NFSD fix to apply.
> 
> The immediately previous upstream commit is missing:
> 
>   390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
> 
> For testing, I've applied this to my nfsd-5.15.y branch here:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> 
> However even if that fixes the reported crash, this suggests
> that after v6.6, nfsd threads are not going to respond to
> "killall -9 nfsd".

I cannot reproduce a crash when using "killall -9 nfsd" on the fixed
kernel above. On that kernel, the killall command does not terminate
the nfsd threads either.

Either we can apply 390390240145 to 5.15.y, or if the imperviousness
to "kill" is considered a regression, I can look into stripping out
dec6b8bcac73 and the subsequent commits that depend on it.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-24 23:37         ` NeilBrown
@ 2024-04-25 20:51           ` Chris Packham
  2024-04-25 21:05             ` Chuck Lever III
  2024-04-25 23:18             ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Chris Packham @ 2024-04-25 20:51 UTC (permalink / raw)
  To: NeilBrown, Chuck Lever III
  Cc: Jeff Layton, Greg Kroah-Hartman, Linux NFS Mailing List, netdev,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org


On 25/04/24 11:37, NeilBrown wrote:
> On Thu, 25 Apr 2024, Chuck Lever III wrote:
>>> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>
>>>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>>>>
>>>> On 24/04/24 13:38, Chris Packham wrote:
>>>>> On 24/04/24 12:54, Chris Packham wrote:
>>>>>> Hi Jeff, Chuck, Greg,
>>>>>>
>>>>>> After updating one of our builds along the 5.15.y LTS branch our
>>>>>> testing caught a new kernel bug. Output below.
>>>>>>
>>>>>> I haven't dug into it yet but wondered if it rang any bells.
>>>>> A bit more info. This is happening at "reboot" for us. Our embedded
>>>>> devices use a bit of a hacked up reboot process so that they come back
>>>>> faster in the case of a failure.
>>>>>
>>>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>>>>>
>>>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a
>>>>> completely legit thing to do to kernel threads but it's probably close
>>>>> to what our customized reboot does.
>>>> I've bisected between v5.15.153 and v5.15.155 and identified commit
>>>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in
>>>> nfsd()") as the first bad commit. Based on the context that seems to
>>>> line up with my reproduction. I'm wondering if perhaps something got
>>>> missed out of the stable track? Unfortunately I'm not able to run a more
>>>> recent kernel with all of the nfs related setup that is being used on
>>>> the system in question.
>>> Thanks for bisecting, that would have been my first suggestion.
>>>
>>> The backport included all of the NFSD patches up to v6.2, but
>>> there might be a missing server-side SunRPC patch.
>> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
>> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
>> only to get a subsequent NFSD fix to apply.
>>
>> The immediately previous upstream commit is missing:
>>
>>    390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
>>
>> For testing, I've applied this to my nfsd-5.15.y branch here:
>>
>>    https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>>
>> However even if that fixes the reported crash, this suggests
>> that after v6.6, nfsd threads are not going to respond to
>> "killall -9 nfsd".
> I think this likely is the problem.  The nfsd threads must be being
> killed by a signal.
> One only other cause for an nfsd thread to exit is if
> svc_set_num_threads() is called, and all places that call that hold a
> ref on the serv structure so the final put won't happen when the thread
> exits.
>
> Before the patch that bisect found, the nfsd thread would exit with
>
>   svc_get();
>   svc_exit_thread();
>   nfsd_put();
>
> This also holds a ref across the svc_exit_thread(), and ensures the
> final 'put' happens from nfsD_put(), not svc_put() (in
> svc_exit_thread()).
>
> Chris: what was the context when the crash happened?  Could the nfsd
> threads have been signalled?  That hasn't been the standard way to stop
> nfsd threads for a long time, so I'm a little surprised that it is
> happening.

We use a hacked up version of shutdown from util-linux and which does a 
`kill (-1, SIGTERM);` then `kill (-1, SIGKILL);` (I don't think that 
particular behaviour is the hackery). I'm not sure if -1 will pick up 
kernel threads but based on the symptoms it appears to be doing so (or 
maybe something else is in it's SIGTERM handler). I don't think we were 
ever really intending to send the signals to nfsd so whether it actually 
terminates or not I don't think is an issue for us. I can confirm that 
applying 390390240145 resolves the symptom we were seeing.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-25 20:51           ` Chris Packham
@ 2024-04-25 21:05             ` Chuck Lever III
  2024-04-25 21:07               ` Chris Packham
  2024-04-25 23:18             ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2024-04-25 21:05 UTC (permalink / raw)
  To: Chris Packham
  Cc: Neil Brown, Jeff Layton, Greg Kroah-Hartman,
	Linux NFS Mailing List, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org



> On Apr 25, 2024, at 4:51 PM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
> 
> On 25/04/24 11:37, NeilBrown wrote:
>> On Thu, 25 Apr 2024, Chuck Lever III wrote:
>>>> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> 
>>>>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>>>>> 
>>>>> On 24/04/24 13:38, Chris Packham wrote:
>>>>>> On 24/04/24 12:54, Chris Packham wrote:
>>>>>>> Hi Jeff, Chuck, Greg,
>>>>>>> 
>>>>>>> After updating one of our builds along the 5.15.y LTS branch our
>>>>>>> testing caught a new kernel bug. Output below.
>>>>>>> 
>>>>>>> I haven't dug into it yet but wondered if it rang any bells.
>>>>>> A bit more info. This is happening at "reboot" for us. Our embedded
>>>>>> devices use a bit of a hacked up reboot process so that they come back
>>>>>> faster in the case of a failure.
>>>>>> 
>>>>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>>>>>> 
>>>>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a
>>>>>> completely legit thing to do to kernel threads but it's probably close
>>>>>> to what our customized reboot does.
>>>>> I've bisected between v5.15.153 and v5.15.155 and identified commit
>>>>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in
>>>>> nfsd()") as the first bad commit. Based on the context that seems to
>>>>> line up with my reproduction. I'm wondering if perhaps something got
>>>>> missed out of the stable track? Unfortunately I'm not able to run a more
>>>>> recent kernel with all of the nfs related setup that is being used on
>>>>> the system in question.
>>>> Thanks for bisecting, that would have been my first suggestion.
>>>> 
>>>> The backport included all of the NFSD patches up to v6.2, but
>>>> there might be a missing server-side SunRPC patch.
>>> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
>>> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
>>> only to get a subsequent NFSD fix to apply.
>>> 
>>> The immediately previous upstream commit is missing:
>>> 
>>>   390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
>>> 
>>> For testing, I've applied this to my nfsd-5.15.y branch here:
>>> 
>>>   https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>>> 
>>> However even if that fixes the reported crash, this suggests
>>> that after v6.6, nfsd threads are not going to respond to
>>> "killall -9 nfsd".
>> I think this likely is the problem.  The nfsd threads must be being
>> killed by a signal.
>> One only other cause for an nfsd thread to exit is if
>> svc_set_num_threads() is called, and all places that call that hold a
>> ref on the serv structure so the final put won't happen when the thread
>> exits.
>> 
>> Before the patch that bisect found, the nfsd thread would exit with
>> 
>>  svc_get();
>>  svc_exit_thread();
>>  nfsd_put();
>> 
>> This also holds a ref across the svc_exit_thread(), and ensures the
>> final 'put' happens from nfsD_put(), not svc_put() (in
>> svc_exit_thread()).
>> 
>> Chris: what was the context when the crash happened?  Could the nfsd
>> threads have been signalled?  That hasn't been the standard way to stop
>> nfsd threads for a long time, so I'm a little surprised that it is
>> happening.
> 
> We use a hacked up version of shutdown from util-linux and which does a 
> `kill (-1, SIGTERM);` then `kill (-1, SIGKILL);` (I don't think that 
> particular behaviour is the hackery). I'm not sure if -1 will pick up 
> kernel threads but based on the symptoms it appears to be doing so (or 
> maybe something else is in it's SIGTERM handler). I don't think we were 
> ever really intending to send the signals to nfsd so whether it actually 
> terminates or not I don't think is an issue for us. I can confirm that 
> applying 390390240145 resolves the symptom we were seeing.

I'm 2/3 of the way through testing 5.15.156 with 390390240145
applied, so it would be just another day before I can send a
patch to stable@.

May I add Tested-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz> ?

--
Chuck Lever



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-25 21:05             ` Chuck Lever III
@ 2024-04-25 21:07               ` Chris Packham
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Packham @ 2024-04-25 21:07 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Neil Brown, Jeff Layton, Greg Kroah-Hartman,
	Linux NFS Mailing List, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org


On 26/04/24 09:05, Chuck Lever III wrote:
>
>> On Apr 25, 2024, at 4:51 PM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>>
>>
>> On 25/04/24 11:37, NeilBrown wrote:
>>> On Thu, 25 Apr 2024, Chuck Lever III wrote:
>>>>> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>>
>>>>>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>>>>>>
>>>>>> On 24/04/24 13:38, Chris Packham wrote:
>>>>>>> On 24/04/24 12:54, Chris Packham wrote:
>>>>>>>> Hi Jeff, Chuck, Greg,
>>>>>>>>
>>>>>>>> After updating one of our builds along the 5.15.y LTS branch our
>>>>>>>> testing caught a new kernel bug. Output below.
>>>>>>>>
>>>>>>>> I haven't dug into it yet but wondered if it rang any bells.
>>>>>>> A bit more info. This is happening at "reboot" for us. Our embedded
>>>>>>> devices use a bit of a hacked up reboot process so that they come back
>>>>>>> faster in the case of a failure.
>>>>>>>
>>>>>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>>>>>>>
>>>>>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a
>>>>>>> completely legit thing to do to kernel threads but it's probably close
>>>>>>> to what our customized reboot does.
>>>>>> I've bisected between v5.15.153 and v5.15.155 and identified commit
>>>>>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in
>>>>>> nfsd()") as the first bad commit. Based on the context that seems to
>>>>>> line up with my reproduction. I'm wondering if perhaps something got
>>>>>> missed out of the stable track? Unfortunately I'm not able to run a more
>>>>>> recent kernel with all of the nfs related setup that is being used on
>>>>>> the system in question.
>>>>> Thanks for bisecting, that would have been my first suggestion.
>>>>>
>>>>> The backport included all of the NFSD patches up to v6.2, but
>>>>> there might be a missing server-side SunRPC patch.
>>>> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
>>>> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
>>>> only to get a subsequent NFSD fix to apply.
>>>>
>>>> The immediately previous upstream commit is missing:
>>>>
>>>>    390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
>>>>
>>>> For testing, I've applied this to my nfsd-5.15.y branch here:
>>>>
>>>>    https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>>>>
>>>> However even if that fixes the reported crash, this suggests
>>>> that after v6.6, nfsd threads are not going to respond to
>>>> "killall -9 nfsd".
>>> I think this likely is the problem.  The nfsd threads must be being
>>> killed by a signal.
>>> One only other cause for an nfsd thread to exit is if
>>> svc_set_num_threads() is called, and all places that call that hold a
>>> ref on the serv structure so the final put won't happen when the thread
>>> exits.
>>>
>>> Before the patch that bisect found, the nfsd thread would exit with
>>>
>>>   svc_get();
>>>   svc_exit_thread();
>>>   nfsd_put();
>>>
>>> This also holds a ref across the svc_exit_thread(), and ensures the
>>> final 'put' happens from nfsD_put(), not svc_put() (in
>>> svc_exit_thread()).
>>>
>>> Chris: what was the context when the crash happened?  Could the nfsd
>>> threads have been signalled?  That hasn't been the standard way to stop
>>> nfsd threads for a long time, so I'm a little surprised that it is
>>> happening.
>> We use a hacked up version of shutdown from util-linux and which does a
>> `kill (-1, SIGTERM);` then `kill (-1, SIGKILL);` (I don't think that
>> particular behaviour is the hackery). I'm not sure if -1 will pick up
>> kernel threads but based on the symptoms it appears to be doing so (or
>> maybe something else is in it's SIGTERM handler). I don't think we were
>> ever really intending to send the signals to nfsd so whether it actually
>> terminates or not I don't think is an issue for us. I can confirm that
>> applying 390390240145 resolves the symptom we were seeing.
> I'm 2/3 of the way through testing 5.15.156 with 390390240145
> applied, so it would be just another day before I can send a
> patch to stable@.
>
> May I add Tested-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz> ?

Sure go ahead.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
  2024-04-25 20:51           ` Chris Packham
  2024-04-25 21:05             ` Chuck Lever III
@ 2024-04-25 23:18             ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2024-04-25 23:18 UTC (permalink / raw)
  To: Chris Packham
  Cc: Chuck Lever III, Jeff Layton, Greg Kroah-Hartman,
	Linux NFS Mailing List, netdev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Fri, 26 Apr 2024, Chris Packham wrote:
> 
> On 25/04/24 11:37, NeilBrown wrote:
> > On Thu, 25 Apr 2024, Chuck Lever III wrote:
> >>> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>
> >>>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> >>>>
> >>>> On 24/04/24 13:38, Chris Packham wrote:
> >>>>> On 24/04/24 12:54, Chris Packham wrote:
> >>>>>> Hi Jeff, Chuck, Greg,
> >>>>>>
> >>>>>> After updating one of our builds along the 5.15.y LTS branch our
> >>>>>> testing caught a new kernel bug. Output below.
> >>>>>>
> >>>>>> I haven't dug into it yet but wondered if it rang any bells.
> >>>>> A bit more info. This is happening at "reboot" for us. Our embedded
> >>>>> devices use a bit of a hacked up reboot process so that they come back
> >>>>> faster in the case of a failure.
> >>>>>
> >>>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
> >>>>>
> >>>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a
> >>>>> completely legit thing to do to kernel threads but it's probably close
> >>>>> to what our customized reboot does.
> >>>> I've bisected between v5.15.153 and v5.15.155 and identified commit
> >>>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in
> >>>> nfsd()") as the first bad commit. Based on the context that seems to
> >>>> line up with my reproduction. I'm wondering if perhaps something got
> >>>> missed out of the stable track? Unfortunately I'm not able to run a more
> >>>> recent kernel with all of the nfs related setup that is being used on
> >>>> the system in question.
> >>> Thanks for bisecting, that would have been my first suggestion.
> >>>
> >>> The backport included all of the NFSD patches up to v6.2, but
> >>> there might be a missing server-side SunRPC patch.
> >> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
> >> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
> >> only to get a subsequent NFSD fix to apply.
> >>
> >> The immediately previous upstream commit is missing:
> >>
> >>    390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
> >>
> >> For testing, I've applied this to my nfsd-5.15.y branch here:
> >>
> >>    https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> >>
> >> However even if that fixes the reported crash, this suggests
> >> that after v6.6, nfsd threads are not going to respond to
> >> "killall -9 nfsd".
> > I think this likely is the problem.  The nfsd threads must be being
> > killed by a signal.
> > One only other cause for an nfsd thread to exit is if
> > svc_set_num_threads() is called, and all places that call that hold a
> > ref on the serv structure so the final put won't happen when the thread
> > exits.
> >
> > Before the patch that bisect found, the nfsd thread would exit with
> >
> >   svc_get();
> >   svc_exit_thread();
> >   nfsd_put();
> >
> > This also holds a ref across the svc_exit_thread(), and ensures the
> > final 'put' happens from nfsD_put(), not svc_put() (in
> > svc_exit_thread()).
> >
> > Chris: what was the context when the crash happened?  Could the nfsd
> > threads have been signalled?  That hasn't been the standard way to stop
> > nfsd threads for a long time, so I'm a little surprised that it is
> > happening.
> 
> We use a hacked up version of shutdown from util-linux and which does a 
> `kill (-1, SIGTERM);` then `kill (-1, SIGKILL);` (I don't think that 
> particular behaviour is the hackery). I'm not sure if -1 will pick up 
> kernel threads but based on the symptoms it appears to be doing so (or 
> maybe something else is in it's SIGTERM handler). I don't think we were 
> ever really intending to send the signals to nfsd so whether it actually 
> terminates or not I don't think is an issue for us. I can confirm that 
> applying 390390240145 resolves the symptom we were seeing.
> 
> 

Makes sense - thanks.
"kill -1 ..." does send the signal to *every* process including kernel
threads.  I'm glad you weren't depending on that to kill nfsd.
Hopefully no one else is.  I think the best way forward is to apply that
patch to 5.15-stable as Chuck plans to do.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-25 23:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-24  0:54 kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155 Chris Packham
2024-04-24  1:38 ` Chris Packham
2024-04-24  7:42   ` Chris Packham
2024-04-24 13:33     ` Chuck Lever III
2024-04-24 14:03       ` Chuck Lever III
2024-04-24 23:37         ` NeilBrown
2024-04-25 20:51           ` Chris Packham
2024-04-25 21:05             ` Chuck Lever III
2024-04-25 21:07               ` Chris Packham
2024-04-25 23:18             ` NeilBrown
2024-04-25 15:50         ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).