All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Kent <raven@themaw.net>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>, Al Viro <viro@zeniv.linux.org.uk>,
	Eric Sandeen <sandeen@sandeen.net>,
	Fox Chen <foxhlchen@gmail.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Rick Lindsley <ricklind@linux.vnet.ibm.com>,
	David Howells <dhowells@redhat.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v4 0/5] kernfs: proposed locking and concurrency improvement
Date: Fri, 14 May 2021 09:02:08 +0800	[thread overview]
Message-ID: <99eb90d96007017ae6cca5512b8c492bef44a5b9.camel@themaw.net> (raw)
In-Reply-To: <YJ1DeXnKTxuy6uc9@kroah.com>

On Thu, 2021-05-13 at 17:19 +0200, Greg Kroah-Hartman wrote:
> On Thu, May 13, 2021 at 09:50:19PM +0800, Ian Kent wrote:
> > On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote:
> > > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote:
> > > > There have been a few instances of contention on the
> > > > kernfs_mutex
> > > > during
> > > > path walks, a case on very large IBM systems seen by myself, a
> > > > report by
> > > > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > > > couple
> > > > of other reports by CoreOS users.
> > > > 
> > > > The common thread is a large number of kernfs path walks
> > > > leading to
> > > > slowness of path walks due to kernfs_mutex contention.
> > > > 
> > > > The problem being that changes to the VFS over some time have
> > > > increased
> > > > it's concurrency capabilities to an extent that kernfs's use of
> > > > a
> > > > mutex
> > > > is no longer appropriate. There's also an issue of walks for
> > > > non-
> > > > existent
> > > > paths causing contention if there are quite a few of them which
> > > > is
> > > > a less
> > > > common problem.
> > > > 
> > > > This patch series is relatively straight forward.
> > > > 
> > > > All it does is add the ability to take advantage of VFS
> > > > negative
> > > > dentry
> > > > caching to avoid needless dentry alloc/free cycles for lookups
> > > > of
> > > > paths
> > > > that don't exit and change the kernfs_mutex to a read/write
> > > > semaphore.
> > > > 
> > > > The patch that tried to stay in VFS rcu-walk mode during path
> > > > walks
> > > > has
> > > > been dropped for two reasons. First, it doesn't actually give
> > > > very
> > > > much
> > > > improvement and, second, if there's a place where mistakes
> > > > could go
> > > > unnoticed it would be in that path. This makes the patch series
> > > > simpler
> > > > to review and reduces the likelihood of problems going
> > > > unnoticed
> > > > and
> > > > popping up later.
> > > > 
> > > > The patch to use a revision to identify if a directory has
> > > > changed
> > > > has
> > > > also been dropped. If the directory has changed the dentry
> > > > revision
> > > > needs to be updated to avoid subsequent rb tree searches and
> > > > after
> > > > changing to use a read/write semaphore the update also requires
> > > > a
> > > > lock.
> > > > But the d_lock is the only lock available at this point which
> > > > might
> > > > itself be contended.
> > > > 
> > > > Changes since v3:
> > > > - remove unneeded indirection when referencing the super block.
> > > > - check if inode attribute update is actually needed.
> > > > 
> > > > Changes since v2:
> > > > - actually fix the inode attribute update locking.
> > > > - drop the patch that tried to stay in rcu-walk mode.
> > > > - drop the use a revision to identify if a directory has
> > > > changed
> > > > patch.
> > > > 
> > > > Changes since v1:
> > > > - fix locking in .permission() and .getattr() by re-factoring
> > > > the
> > > > attribute
> > > >   handling code.
> > > > ---
> > > > 
> > > > Ian Kent (5):
> > > >       kernfs: move revalidate to be near lookup
> > > >       kernfs: use VFS negative dentry caching
> > > >       kernfs: switch kernfs to use an rwsem
> > > >       kernfs: use i_lock to protect concurrent inode updates
> > > >       kernfs: add kernfs_need_inode_refresh()
> > > > 
> > > > 
> > > >  fs/kernfs/dir.c             | 170 ++++++++++++++++++++--------
> > > > ----
> > > > ----
> > > >  fs/kernfs/file.c            |   4 +-
> > > >  fs/kernfs/inode.c           |  45 ++++++++--
> > > >  fs/kernfs/kernfs-internal.h |   5 +-
> > > >  fs/kernfs/mount.c           |  12 +--
> > > >  fs/kernfs/symlink.c         |   4 +-
> > > >  include/linux/kernfs.h      |   2 +-
> > > >  7 files changed, 147 insertions(+), 95 deletions(-)
> > > > 
> > > > --
> > > > Ian
> > > > 
> > > 
> > > Any benchmark numbers that you ran that are better/worse with
> > > this
> > > patch
> > > series?  That woul dbe good to know, otherwise you aren't
> > > changing
> > > functionality here, so why would we take these changes?  :)
> > 
> > Hi Greg,
> > 
> > I'm sorry, I don't have a benchmark.
> > 
> > My continued work on this has been driven by the report from
> > Brice Goglin and Fox Chen, and also because I've seen a couple
> > of other reports of kernfs_mutex contention that is resolved
> > by the series.
> > 
> > Unfortunately the two reports I've seen fairly recently are on
> > kernels that are about as far away from the upstream kernel
> > as you can get so probably aren't useful in making my case.
> > 
> > The report I've worked on most recently is on CoreOS/Kunbernetes
> > (based on RHEL-8.3) where the machine load goes to around 870
> > after loading what they call an OpenShift performance profile.
> > 
> > I looked at some sysreq dumps and they have a seemingly endless
> > number of processes waiting on the kernfs_mutex.
> > 
> > I tried to look at the Kubernetes source but it's written in
> > go so there would need to be a lot of time spent to work out
> > what's going on, I'm trying to find someone to help with that.
> > 
> > All I can say from looking at the Kubernetes source is it has
> > quite a few sysfs paths in it so I assume it uses sysfs fairly
> > heavily.
> > 
> > The other problem I saw was also on CoreOS/Kunernetes.
> > A vmcore analysis showed kernfs_mutex contention but with a
> > different set of processes and not as significant as the former
> > problem.
> > 
> > So, even though this isn't against the current upstream, there
> > isn't much difference in the kernfs/sysfs source between those
> > two kernels and given the results of Brice and Fox I fear I'll
> > be seeing more of this as time goes by.
> > 
> > I'm fairly confident that the user space applications aren't
> > optimal (although you may have stronger words for it than that)
> > I was hoping you would agree that it's sensible for the kernel
> > to protect itself to the extent that it can provided the change
> > is straight forward enough.
> > 
> > I have tried to make the patches as simple as possible without
> > loosing much of the improvement to minimize any potential ongoing
> > maintenance burden.
> > 
> > So, I'm sorry I can't offer you more incentive to consider the
> > series, but I remain hopeful you will.
> 
> At the very least, if you could test the series on those "older"
> systems
> and say "booting went from X seconds to Y seconds!".

The last test I did was done on the system showing high load and
it went from around 870 to around 3. It completely resolved the
reported problem.

I need to have the current patches re-tested and that can take a
while and I need to look at Fox's results results, I'm thinking
the additional patch in v4 is probably not needed.

Ian



      reply	other threads:[~2021-05-14  1:02 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  0:38 [PATCH v4 0/5] kernfs: proposed locking and concurrency improvement Ian Kent
2021-05-12  0:38 ` [PATCH v4 1/5] kernfs: move revalidate to be near lookup Ian Kent
2021-05-12  0:38 ` [PATCH v4 2/5] kernfs: use VFS negative dentry caching Ian Kent
2021-05-12  0:39 ` [PATCH v4 3/5] kernfs: switch kernfs to use an rwsem Ian Kent
2021-05-12  0:39 ` [PATCH v4 4/5] kernfs: use i_lock to protect concurrent inode updates Ian Kent
2021-05-12  0:39 ` [PATCH v4 5/5] kernfs: add kernfs_need_inode_refresh() Ian Kent
2021-05-12  6:21 ` [PATCH v4 0/5] kernfs: proposed locking and concurrency improvement Greg Kroah-Hartman
2021-05-12  7:16   ` Fox Chen
2021-05-12  8:47     ` Fox Chen
2021-05-12  8:54       ` Fox Chen
2021-05-13 14:10         ` Ian Kent
2021-05-13 15:37           ` Fox Chen
2021-05-14  1:34             ` Ian Kent
2021-05-14  2:34               ` Fox Chen
2021-05-17  1:32                 ` Ian Kent
2021-05-18  8:26                   ` Fox Chen
2021-05-27  1:23                   ` Ian Kent
2021-05-27  6:50                     ` Greg Kroah-Hartman
2021-05-28  5:45                       ` Ian Kent
2021-05-13 13:50   ` Ian Kent
2021-05-13 15:19     ` Greg Kroah-Hartman
2021-05-14  1:02       ` Ian Kent [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99eb90d96007017ae6cca5512b8c492bef44a5b9.camel@themaw.net \
    --to=raven@themaw.net \
    --cc=brice.goglin@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=foxhlchen@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=mtosatti@redhat.com \
    --cc=ricklind@linux.vnet.ibm.com \
    --cc=sandeen@sandeen.net \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.