NVDIMM Device and Persistent Memory development
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: John Groves <John@groves.net>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, john@jagalactic.com,
	Christoph Hellwig <hch@infradead.org>,
	dave.hansen@linux.intel.com, gregory.price@memverge.com
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
Date: Mon, 11 Mar 2024 12:29:25 +1100	[thread overview]
Message-ID: <Ze5edU3JbLEFwJOH@dread.disaster.area> (raw)
In-Reply-To: <5segby7xk6wbyblovpapdymiuvg63e5qarahc4pramhsqikx2x@y3zmih6mgs33>

On Thu, Feb 29, 2024 at 08:52:48AM -0600, John Groves wrote:
> On 24/02/29 01:15PM, Dave Chinner wrote:
> > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > >    bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> > >    iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> > >   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > >   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > >   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> > 
> > Most of the IO latencies are up round the 4-20ms marks. That seems
> > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> > should only take a couple of hundred microseconds. For Famfs, the
> > latencies appear to be around 1-4ms.
> > 
> > So where's all that extra time coming from?
> 
> Below, you will see two runs with performance and latency distribution
> about the same as famfs (the answer for that was --fallocate=native).

Ah, that is exactly what I suspected, and was wanting profiles
because that will show up in them clearly.

> > >   lat (msec)   : 100=0.08%
> > >   cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> > 
> > And why is system time reporting at almost zero instead of almost
> > all the remaining cpu time (i.e. up at 80-90%)?
> 
> Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
> it's about what you would expect. I suspect some sort of measurement error,
> like maybe the method doesn't work with my cpu model? (I'm grasping, but with
> a somewhat rational basis...)
> 
> I pasted two xfs runs below. The first has the wonky cpu sys value, and
> the second looks about like what one would expect.
> 
> > 
> > Can you run call-graph kernel profiles for XFS and famfs whilst
> > running this workload so we have some insight into what is behaving
> > differently here?
> 
> Can you point me to an example of how to do that?

perf record --call-graph ...
pref report --call-graph ...


> I'd been thinking about the ~2x gap for a few days, and the most obvious
> difference is famfs files must be preallocated (like fallocate, but works
> a bit differently since allocation happens in user space). I just checked 
> one of the xfs files, and it had maybe 80 extents (whereas the famfs 
> files always have 1 extent here).

Which is about 4MB per extent. Extent size is not the problem for
zero-seek-latency storage hardware, though.

Essentially what you are seeing is interleaving extent allocation
between all the files because they are located in the same
directory. The locality algorithm is trying to place the data
extents close to the owner inode, but the indoes are also all close
together because they are located in the same AG as the parent
directory inode. Allocation concurrency is created by placing new
directories in different allocation groups, so we end up with
workloads in different directories being largely isolated from each
other.

However, that means when you are trying to write to many files in
the same directory at the same time, they are largely all competing
for the same AG lock to do block allocation during IO submission.
That creates interleaving of write() sized extents between different
files. We use speculative preallocation for buffered IO to avoid
this, and for direct IO the application needs to use extent size hints
or preallocation to avoid this contention based interleaving.

IOWs, by using fallocate() to preallocate all the space there will
be no allocation during IO submission and so the serialisation that
occurs due to competing allocations just goes away...

> FWIW I ran xfs with and without io_uring, and there was no apparent
> difference (which makes sense to me because it's not block I/O).
> 
> The prior ~2x gap still seems like a lot of overhead for extent list 
> mapping to memory, but adding --fallocate=native to the xfs test brought 
> it into line with famfs:

As I suspected. :)

As for CPU usage accounting, the number of context switches says it
all.

"Bad":

>   cpu          : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801

"good":

>   cpu          : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708

I'd say that in the "bad" case most of the kernel work is being
shuffled off to kernel threads to do the work and so it doesn't get
accounted to the submission task.  In comparison, in the "good" case
the work is being done in the submission thread and hence there's a
lot fewer context switches and the system time is correctly
accounted to the submission task.

Perhaps an io_uring task accounting problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2024-03-11  1:29 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2024-02-26 12:05   ` Jonathan Cameron
2024-02-26 15:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
2024-02-26 12:10   ` Jonathan Cameron
2024-02-26 15:13     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
2024-02-26 12:21   ` Jonathan Cameron
2024-02-26 15:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2024-02-26 12:32   ` Jonathan Cameron
2024-02-26 16:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
2024-02-26 12:34   ` Jonathan Cameron
2024-02-26 16:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
2024-02-24  1:39   ` Randy Dunlap
2024-02-24  2:23     ` John Groves
2024-02-24  3:27       ` Randy Dunlap
2024-02-24 23:32         ` John Groves
2024-02-24 23:40           ` Randy Dunlap
2024-02-26 12:39   ` Jonathan Cameron
2024-02-26 16:44     ` John Groves
2024-02-26 16:56       ` Jonathan Cameron
2024-02-26 18:04         ` John Groves
2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
2024-02-26 12:48   ` Jonathan Cameron
2024-02-26 17:35     ` John Groves
2024-02-27 10:28       ` Jonathan Cameron
2024-02-28  1:06         ` John Groves
2024-02-27 13:38   ` Christian Brauner
2024-02-27 14:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
2024-02-26 12:51   ` Jonathan Cameron
2024-02-26 21:47     ` John Groves
2024-02-27 10:34       ` Jonathan Cameron
2024-02-27 17:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
2024-02-26 12:56   ` Jonathan Cameron
2024-02-26 22:22     ` John Groves
2024-02-27 13:39   ` Christian Brauner
2024-02-27 18:38     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
2024-02-26 13:20   ` Jonathan Cameron
2024-02-26 22:43     ` John Groves
2024-02-27 13:41   ` Christian Brauner
2024-02-28  0:59     ` John Groves
2024-02-28  1:49       ` Randy Dunlap
2024-02-28  8:17         ` Christian Brauner
2024-02-28 10:07       ` Christian Brauner
2024-02-28 12:01         ` Christian Brauner
2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
2024-02-26 13:25   ` Jonathan Cameron
2024-02-26 22:53     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
2024-02-26 13:30   ` Jonathan Cameron
2024-02-26 23:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
2024-02-26 13:32   ` Jonathan Cameron
2024-02-26 23:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
2024-02-26 13:44   ` Jonathan Cameron
2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
2024-02-23 18:23   ` Dave Hansen
2024-02-23 19:56     ` John Groves
2024-02-23 20:04       ` Dan Williams
2024-02-23 20:39         ` John Groves
2024-02-23 21:19           ` Dave Hansen
2024-02-23 23:50             ` Dan Williams
2024-02-24  3:59               ` Matthew Wilcox
2024-02-24  4:30                 ` Dan Williams
2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
2024-02-26 13:47   ` Jonathan Cameron
2024-02-27 22:15     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
2024-02-26 13:52   ` Jonathan Cameron
2024-02-27 22:27     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 19/20] famfs: Update MAINTAINERS file John Groves
2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
2024-02-24  1:50   ` Randy Dunlap
2024-02-24  2:24     ` John Groves
2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
2024-02-26 13:27   ` John Groves
2024-02-26 15:53     ` Luis Chamberlain
2024-02-26 21:16       ` John Groves
2024-02-27  0:58         ` Luis Chamberlain
2024-02-27  2:05           ` John Groves
2024-02-29  2:15             ` Dave Chinner
2024-02-29 14:52               ` John Groves
2024-03-11  1:29                 ` Dave Chinner [this message]
2024-02-29  6:52 ` Amir Goldstein
2024-02-29 22:16   ` John Groves
2024-05-17  9:55   ` Miklos Szeredi
2024-05-19  5:59     ` Amir Goldstein
2024-05-22  2:05       ` John Groves
2024-05-22  8:58         ` Miklos Szeredi
2024-05-22 10:16           ` Amir Goldstein
2024-05-22 11:28             ` Miklos Szeredi
2024-05-22 13:41               ` Amir Goldstein
2024-05-23  2:49           ` John Groves
2024-05-23 13:57             ` Miklos Szeredi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ze5edU3JbLEFwJOH@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=John@groves.net \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregory.price@memverge.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jgroves@micron.com \
    --cc=john@jagalactic.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).