All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
To: NeilBrown <neilb@suse.de>, Chuck Lever III <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155
Date: Thu, 25 Apr 2024 20:51:09 +0000	[thread overview]
Message-ID: <141fbaa0-f8fa-4bfe-8c2d-7749fcf78ab3@alliedtelesis.co.nz> (raw)
In-Reply-To: <171400185158.7600.16163546434537681088@noble.neil.brown.name>


On 25/04/24 11:37, NeilBrown wrote:
> On Thu, 25 Apr 2024, Chuck Lever III wrote:
>>> On Apr 24, 2024, at 9:33 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>
>>>> On Apr 24, 2024, at 3:42 AM, Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
>>>>
>>>> On 24/04/24 13:38, Chris Packham wrote:
>>>>> On 24/04/24 12:54, Chris Packham wrote:
>>>>>> Hi Jeff, Chuck, Greg,
>>>>>>
>>>>>> After updating one of our builds along the 5.15.y LTS branch our
>>>>>> testing caught a new kernel bug. Output below.
>>>>>>
>>>>>> I haven't dug into it yet but wondered if it rang any bells.
>>>>> A bit more info. This is happening at "reboot" for us. Our embedded
>>>>> devices use a bit of a hacked up reboot process so that they come back
>>>>> faster in the case of a failure.
>>>>>
>>>>> It doesn't happen with a proper `systemctl reboot` or with a SYSRQ+B
>>>>>
>>>>> I can trigger it with `killall -9 nfsd` which I'm not sure is a
>>>>> completely legit thing to do to kernel threads but it's probably close
>>>>> to what our customized reboot does.
>>>> I've bisected between v5.15.153 and v5.15.155 and identified commit
>>>> dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread() call in
>>>> nfsd()") as the first bad commit. Based on the context that seems to
>>>> line up with my reproduction. I'm wondering if perhaps something got
>>>> missed out of the stable track? Unfortunately I'm not able to run a more
>>>> recent kernel with all of the nfs related setup that is being used on
>>>> the system in question.
>>> Thanks for bisecting, that would have been my first suggestion.
>>>
>>> The backport included all of the NFSD patches up to v6.2, but
>>> there might be a missing server-side SunRPC patch.
>> So dec6b8bcac73 ("nfsd: Simplify code around svc_exit_thread()
>> call in  nfsd()") is from v6.6, so it was applied to v5.15.y
>> only to get a subsequent NFSD fix to apply.
>>
>> The immediately previous upstream commit is missing:
>>
>>    390390240145 ("nfsd: don't allow nfsd threads to be signalled.")
>>
>> For testing, I've applied this to my nfsd-5.15.y branch here:
>>
>>    https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>>
>> However even if that fixes the reported crash, this suggests
>> that after v6.6, nfsd threads are not going to respond to
>> "killall -9 nfsd".
> I think this likely is the problem.  The nfsd threads must be being
> killed by a signal.
> One only other cause for an nfsd thread to exit is if
> svc_set_num_threads() is called, and all places that call that hold a
> ref on the serv structure so the final put won't happen when the thread
> exits.
>
> Before the patch that bisect found, the nfsd thread would exit with
>
>   svc_get();
>   svc_exit_thread();
>   nfsd_put();
>
> This also holds a ref across the svc_exit_thread(), and ensures the
> final 'put' happens from nfsD_put(), not svc_put() (in
> svc_exit_thread()).
>
> Chris: what was the context when the crash happened?  Could the nfsd
> threads have been signalled?  That hasn't been the standard way to stop
> nfsd threads for a long time, so I'm a little surprised that it is
> happening.

We use a hacked up version of shutdown from util-linux and which does a 
`kill (-1, SIGTERM);` then `kill (-1, SIGKILL);` (I don't think that 
particular behaviour is the hackery). I'm not sure if -1 will pick up 
kernel threads but based on the symptoms it appears to be doing so (or 
maybe something else is in it's SIGTERM handler). I don't think we were 
ever really intending to send the signals to nfsd so whether it actually 
terminates or not I don't think is an issue for us. I can confirm that 
applying 390390240145 resolves the symptom we were seeing.


  reply	other threads:[~2024-04-25 20:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-24  0:54 kernel BUG at net/sunrpc/svc.c:570 after updating from v5.15.153 to v5.15.155 Chris Packham
2024-04-24  1:38 ` Chris Packham
2024-04-24  7:42   ` Chris Packham
2024-04-24 13:33     ` Chuck Lever III
2024-04-24 14:03       ` Chuck Lever III
2024-04-24 23:37         ` NeilBrown
2024-04-25 20:51           ` Chris Packham [this message]
2024-04-25 21:05             ` Chuck Lever III
2024-04-25 21:07               ` Chris Packham
2024-04-25 23:18             ` NeilBrown
2024-04-25 15:50         ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=141fbaa0-f8fa-4bfe-8c2d-7749fcf78ab3@alliedtelesis.co.nz \
    --to=chris.packham@alliedtelesis.co.nz \
    --cc=chuck.lever@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jlayton@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.