From: Prakash Sangappa <prakash.sangappa@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"muchun.song@linux.dev" <muchun.song@linux.dev>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"willy@infradead.org" <willy@infradead.org>
Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior
Date: Fri, 10 May 2024 16:28:31 +0000 [thread overview]
Message-ID: <AE7FE90E-4799-4691-876D-84B03F97F1CE@oracle.com> (raw)
In-Reply-To: <c1b69c38-59c8-4d27-9ff5-bcca553da7c2@redhat.com>
> On May 8, 2024, at 10:28 AM, David Hildenbrand <david@redhat.com> wrote:
>
> On 08.05.24 19:00, Prakash Sangappa wrote:
>>> On May 7, 2024, at 5:00 AM, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>>> file size does not get updated in the mmap call.
>>>> The current behavior is that hugetlbfs file size will get extended by a
>>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>>> not normal filesystem behavior.
>>>> There seem to have been very little discussion about this. There was a
>>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>>> extending because of the hugetlb page reservations. Looks like this was
>>>> not merged.
>>>> It appears there is no correlation between file size and hugetlb page
>>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>>> not extended even though hugetlb pages are reserved.
>>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>>> even though hugetlb pages are not reserved.
>>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>>> range specified in the mmap call.
>>>> Issue:
>>>> Some applications would prefer to manage hugetlb page allocations explicity
>>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>>> access beyond file size is expected to generate a signal. This does not
>>>> work properly due to current behavior which extends file size in mmap call.
>>>
>>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>> Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.
>
> I'd assume that most applications that mmap() hugetlb files need to
> special-case hugetlb because of the different logical page size
> granularity already. But yes, it's all unfortunate.
Will run this by out application/Database team regarding implementing workarounds.
>
>> However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.
>
> The issue is, as you write, that it's existing behavior and changing it
> could cause harm to other apps that rely on that. But I do wonder if really
> anybody relies on that ...
>
> Let's explore the history:
>
> The current VM_WRITE check was added in:
>
> commit b6174df5eec9cdfd598c03d6d0807e344e109213
> Author: Zhang, Yanmin <yanmin.zhang@intel.com>
> Date: Mon Jul 10 04:44:49 2006 -0700
>
> [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area
> Sometimes, applications need below call to be successful although
> "/mnt/hugepages/file1" doesn't exist.
> fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
> *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);
> As for regular pages (or files), above call does work, but as for huge
> pages, above call would fail because hugetlbfs_file_mmap would fail if
> (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
> This capability on huge page is useful on ia64 when the process wants to
> protect one area on region 4, so other threads couldn't read/write this
> area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the
> capability.
>
> But it was only moved.
>
> Before that patch:
> * mmap(PROT_WRITE) would have failed if the file size would be exceeded
> * mmap(PROT_READ/PROT_NONE) would have extended the file
>
> After that patch
> * mmap(PROT_WRITE) will extend the file
> * mmap(PROT_READ/PROT_NONE) do not extend the file
>
> The code before that predates git times.
>
> Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
> around all hugetlbfs quirks.
>
> I suggest either
>
> (a) Document it, along with the workaround
At least needs documentation.
> (b) Change it an cross fingers.
>
>
> In QEMU source code is a very interesting comment:
>
> * ftruncate is not supported by hugetlbfs in older
> * hosts, so don't bother bailing out on errors.
> * If anything goes wrong with it under other filesystems,
> * mmap will fail.
>
> So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
> was added?
>
> QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
> smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
> a rare case to happen, though, and likely also undesired here: we want it to behave just
> like ordinary files!
Ideally yes.
Thanks for your feedback.
-Prakash.
>
> --
> Cheers,
>
> David / dhildenb
prev parent reply other threads:[~2024-05-10 16:28 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-03 1:21 [RFC PATCH 0/1] Address hugetlbfs mmap behavior Prakash Sangappa
2024-05-03 1:21 ` [RFC PATCH 1/1] hugetlbfs: Add mount option to choose normal " Prakash Sangappa
2024-05-07 12:00 ` [RFC PATCH 0/1] Address hugetlbfs " David Hildenbrand
2024-05-08 17:00 ` Prakash Sangappa
2024-05-08 17:28 ` David Hildenbrand
2024-05-10 16:28 ` Prakash Sangappa [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AE7FE90E-4799-4691-876D-84B03F97F1CE@oracle.com \
--to=prakash.sangappa@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).