From: Benjamin Berg <benjamin@sipsolutions.net>
To: Anton Ivanov <anton.ivanov@cambridgegreys.com>,
Johannes Berg <johannes@sipsolutions.net>,
linux-um@lists.infradead.org
Subject: Re: [RFC PATCH 0/3] um: clean up mm creation - another attempt
Date: Wed, 17 Jan 2024 20:54:35 +0100 [thread overview]
Message-ID: <478ac27fd53fa20b4f735b1d792639cd61d5eda4.camel@sipsolutions.net> (raw)
In-Reply-To: <57c2ec52-29a6-4ce7-9334-e0ee436ba630@cambridgegreys.com>
On Wed, 2024-01-17 at 19:45 +0000, Anton Ivanov wrote:
> On 17/01/2024 17:17, Benjamin Berg wrote:
> > Hi,
> >
> > On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> > > [SNIP]
> > > Once we are there, we can look for optimizations. The fundamental
> > > problem is that page faults (even minor ones) are extremely expensive
> > > for us.
> > >
> > > Just throwing out ideas on what we could do:
> > > 1. SECCOMP as that reduces the amount of context switches.
> > > (Yes, I know I should resubmit the patchset)
> > > 2. Maybe we can disable/cripple page access tracking? If we assume
> > > initially mark all pages as accessed by userspace (i.e.
> > > pte_mkyoung), then we avoid a minor page fault on first access.
> > > Doing that will mess with page eviction though.
> > > 3. Do DAX (direct_access) for files. i.e. mmap files directly in the
> > > host kernel rather than through UM.
> > > With a hostfs like file system, one should be able to add an
> > > intermediate block device that maps host files to physical pages,
> > > then do DAX in the FS.
> > > For disk images, the existing iomem infrastructure should be
> > > usable, this should work with any DAX enabled filesystems (ext2,
> > > ext4, xfs, virtiofs, erofs).
> >
> > So, I experimented quite a bit over Christmas (including getting DAX to
> > work with virtiofs). At the end of all this my conclusion is that
> > insufficient page table synchronization is our main problem.
> >
> > Basically, right now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfright now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfaultault.
> >
> > Really, what we need is a set_pte_at() implementation that marks the
> > memory range for synchronization. Then we can make sure we sync it
> > before switching to the userspace process (the equivalent of running
> > flush_tlb_mm_range right now).
> >
> > I think we should:
> > * Rewrite the userspace syscall code
> > - Support delaying the execution of syscalls
> > - Only support mmap/munmap/mprotect and LDT
> > - Do simple compression of consecutive syscalls here
> > - Drop the hand-written assembler
> > * Improve the tlb.c code
> > - remove the HVC abstraction
>
> Cool. That was not working particularly well. I tried to improve it a
> few times, but ripping it out and replacing it is probably a better idea.
Hm, now I realise that we still want mmap() syscall compression for the
kernel itself in tlb.c.
> > - never force immediate syscall execution
> > * Let set_pte_at() track which memory ranges that need syncing
> > * At that point we should be able to:
> > - drop copy_context_skas0
> > - make flush_tlb_* no-ops
> > - drop flush_tlb_page from handle_page_fault
> > - move unmap() from flush_thread to init_new_context
> > (or do it as part of start_userspace)
> >
> > So, I did try this using nasty hacks and IIRC one of my runs was going
> > from 21s to 16s and another from 63s to 56s. Which seems like a nice
> > improvement.
>
> Excellent. I assume you were using hostfs as usual, right? If so, the
> difference is likely to be even more noticeable on ubd.
Yes, I was mostly testing hostfs. Initially also virtiofs with DAX, but
I went back as that didn't result in a pagefault count improvement once
I made some other adjustments.
Benjamin
>
> >
> > Benjamin
> >
> >
> > PS: As for DAX, it doesn't really seem to help performance. It didn't
> > seem to lower the amount of page faults in UML. And, from my
> > perspective, it isn't really worth just for the memory sharing.
> >
> > PPS: dirty/young tracking seemed to be only cause a small amount of
> > page faults in the grand scheme. So probably not something worth
> > following up on.
> >
>
prev parent reply other threads:[~2024-01-17 19:54 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-22 22:37 [RFC PATCH 0/3] um: clean up mm creation - another attempt Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 1/3] um/x86: remove ldt mutex and use mmap lock instead Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 2/3] um: clean up init_new_context() Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 3/3] um: don't force-flush in mm/userspace process start Johannes Berg
2023-09-25 13:29 ` [RFC PATCH 0/3] um: clean up mm creation - another attempt Anton Ivanov
2023-09-25 13:33 ` Johannes Berg
2023-09-25 13:34 ` Anton Ivanov
2023-09-25 14:27 ` Anton Ivanov
2023-09-25 14:44 ` Johannes Berg
2023-09-25 15:20 ` Anton Ivanov
2023-09-26 12:16 ` Anton Ivanov
2023-09-26 12:38 ` Johannes Berg
2023-09-26 13:04 ` Anton Ivanov
2023-09-27 9:52 ` Benjamin Berg
2023-09-27 9:59 ` Anton Ivanov
2023-09-27 10:42 ` Benjamin Berg
2024-01-17 17:17 ` Benjamin Berg
2024-01-17 19:45 ` Anton Ivanov
2024-01-17 19:54 ` Benjamin Berg [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=478ac27fd53fa20b4f735b1d792639cd61d5eda4.camel@sipsolutions.net \
--to=benjamin@sipsolutions.net \
--cc=anton.ivanov@cambridgegreys.com \
--cc=johannes@sipsolutions.net \
--cc=linux-um@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).