Linux-mm Archive mirror
 help / color / mirror / Atom feed
From: Luis Chamberlain <mcgrof@kernel.org>
To: Chris Mason <clm@meta.com>, Dave Chinner <david@fromorbit.com>,
	David Bueso <dave@stgolabs.net>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	"Paul E. McKenney" <paulmck@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm <linux-mm@kvack.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>,
	Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Date: Fri, 10 May 2024 16:57:07 -0700	[thread overview]
Message-ID: <Zj60U5SdWepnmLzD@bombadil.infradead.org> (raw)
In-Reply-To: <bb2e87d7-a706-4dc8-9c09-9257b69ebd5c@meta.com>

On Sat, Feb 24, 2024 at 05:57:43PM -0500, Chris Mason wrote:
> Going back to Luis's original email, I'd echo Willy's suggestion for
> profiles.  Unless we're saturating memory bandwidth, buffered should be
> able to get much closer to O_DIRECT, just at a much higher overall cost.

I finally had some time to look beyond just "what locks" could be the
main culprit, David Bueso helped me review this, thanks!

Based on all the discussions on this insanely long thread, I do believe
the issue was the single threaded write-behind cache flushing back Chinner
noted.

Lifting the /proc/sys/vm/dirty_ratio from 20 to 90 keeps the profile
perky and nice with most top penalties seen just in userspace as seen in
the first paste exhibit a) below but as soon as we start throttling we
hit the profile on past exhibit b) below.

a) without the throttling:

Samples: 1M of event 'cycles:P', Event count (approx.): 1061541571785                                                                                                                         
  Children      Self  Command          Shared Object               Symbol                                                                                                                     
+   17.05%    16.85%  fio              fio                         [.] get_io_u                                                                                                              ◆
+    3.04%     0.01%  fio              [kernel.vmlinux]            [k] entry_SYSCALL_64                                                                                                      ▒
+    3.03%     0.02%  fio              [kernel.vmlinux]            [k] do_syscall_64                                                                                                         ▒
+    1.39%     0.04%  fio              [kernel.vmlinux]            [k] __do_sys_io_uring_enter                                                                                               ▒
+    1.33%     0.00%  fio              libc.so.6                   [.] __GI___libc_open                                                                                                      ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] __x64_sys_openat                                                                                                      ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] do_sys_openat2                                                                                                        ▒
+    1.33%     0.00%  fio              [unknown]                   [k] 0x312d6e65742f2f6d                                                                                                    ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] do_filp_open                                                                                                          ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] path_openat                                                                                                           ▒
+    1.29%     0.00%  fio              [kernel.vmlinux]            [k] down_write                                                                                                            ▒
+    1.29%     0.00%  fio              [kernel.vmlinux]            [k] rwsem_down_write_slowpath                                                                                             ▒
+    1.26%     1.25%  fio              [kernel.vmlinux]            [k] osq_lock                                                                                                              ▒
+    1.14%     0.00%  fio              fio                         [.] 0x000055bbb94449fa                                                                                                    ▒
+    1.14%     1.14%  fio              fio                         [.] 0x000000000002a9f5                                                                                                    ▒
+    0.98%     0.00%  fio              [unknown]                   [k] 0x000055bbd6310520                                                                                                    ▒
+    0.93%     0.00%  fio              fio                         [.] 0x000055bbb94b197b                                                                                                    ▒
+    0.89%     0.00%  perf             libc.so.6                   [.] __GI___libc_write                                                                                                     ▒
+    0.89%     0.00%  perf             [kernel.vmlinux]            [k] entry_SYSCALL_64                                                                                                      ▒
+    0.88%     0.00%  perf             [kernel.vmlinux]            [k] do_syscall_64                                                                                                         ▒
+    0.86%     0.00%  perf             [kernel.vmlinux]            [k] ksys_write                                                                                                            ▒
+    0.85%     0.01%  perf             [kernel.vmlinux]            [k] vfs_write                                                                                                             ▒
+    0.83%     0.00%  perf             [ext4]                      [k] ext4_buffered_write_iter                                                                                              ▒
+    0.81%     0.01%  perf             [kernel.vmlinux]            [k] generic_perform_write                                                                                                 ▒
+    0.77%     0.02%  fio              [kernel.vmlinux]            [k] io_submit_sqes                                                                                                        ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] ret_from_fork_asm                                                                                                     ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] ret_from_fork                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] kthread                                                                                                               ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] worker_thread                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] process_one_work                                                                                                      ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] wb_workfn                                                                                                             ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] wb_writeback                                                                                                          ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] __writeback_inodes_wb                                                                                                 ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] writeback_sb_inodes                                                                                                   ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] __writeback_single_inode                                                                                              ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] do_writepages                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [xfs]                       [k] xfs_vm_writepages                                                                                                     ▒
+    0.75%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] submit_bio_noacct_nocheck                                                                                             ▒
+    0.75%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] iomap_submit_ioend          

So we see *more* penalty because of perf's own buffered IO writes of the
perf data than any writeback from from XFS.

a) when we hit throttling:

Samples: 1M of event 'cycles:P', Event count (approx.): 816903693659                                                                                                                          
  Children      Self  Command          Shared Object               Symbol                                                                                                                     
+   14.24%    14.06%  fio              fio                         [.] get_io_u                                                                                                              ◆
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] ret_from_fork_asm                                                                                                     ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] ret_from_fork                                                                                                         ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] kthread                                                                                                               ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] worker_thread                                                                                                         ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] process_one_work                                                                                                      ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] wb_workfn                                                                                                             ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] wb_writeback                                                                                                          ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __writeback_inodes_wb                                                                                                 ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] writeback_sb_inodes                                                                                                   ▒
+    4.87%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __writeback_single_inode                                                                                              ▒
+    4.87%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] do_writepages                                                                                                         ▒
+    4.87%     0.00%  kworker/u513:3-  [xfs]                       [k] xfs_vm_writepages                                                                                                     ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] iomap_submit_ioend                                                                                                    ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] submit_bio_noacct_nocheck                                                                                             ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __submit_bio                                                                                                          ▒
+    4.82%     0.04%  kworker/u513:3-  [nd_pmem]                   [k] pmem_submit_bio                                                                                                       ▒
+    4.78%     0.05%  kworker/u513:3-  [nd_pmem]                   [k] pmem_do_write          

Although my focus was on measuring the limits of the page cache, this
thread also had a *slew* of ideas on how to improve that status quo,
pathological or not. We have to accept some workloads are clearly
pathological, but that's the point in coming up with limits and testing
the page cache. But since there were a slew of unexpected ideas spread
out this entire thread about general improvements, even for general use
cases, I've collected all of them and put them as notes for for review
for this topic at LSFMM.

Thanks all for the feedback!

  Luis


  parent reply	other threads:[~2024-05-10 23:57 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain [this message]
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zj60U5SdWepnmLzD@bombadil.infradead.org \
    --to=mcgrof@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=clm@meta.com \
    --cc=da.gomez@samsung.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=p.raghav@samsung.com \
    --cc=paulmck@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).