Linux-mm Archive mirror
 help / color / mirror / Atom feed
From: Prakash Sangappa <prakash.sangappa@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"willy@infradead.org" <willy@infradead.org>
Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior
Date: Fri, 10 May 2024 16:28:31 +0000	[thread overview]
Message-ID: <AE7FE90E-4799-4691-876D-84B03F97F1CE@oracle.com> (raw)
In-Reply-To: <c1b69c38-59c8-4d27-9ff5-bcca553da7c2@redhat.com>



> On May 8, 2024, at 10:28 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 08.05.24 19:00, Prakash Sangappa wrote:
>>> On May 7, 2024, at 5:00 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>>> file size does not get updated in the mmap call.
>>>> The current behavior is that hugetlbfs file size will get extended by a
>>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>>> not normal filesystem behavior.
>>>> There seem to have been very little discussion about this. There was a
>>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>>> extending because of the hugetlb page reservations. Looks like this was
>>>> not merged.
>>>> It appears there is no correlation between file size and hugetlb page
>>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>>> not extended even though hugetlb pages are reserved.
>>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>>> even though hugetlb pages are not reserved.
>>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>>> range specified in the mmap call.
>>>> Issue:
>>>> Some applications would prefer to manage hugetlb page allocations explicity
>>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>>> access beyond file size is expected to generate a signal. This does not
>>>> work properly due to current behavior which extends file size in mmap call.
>>> 
>>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>> Another workaround could be to ftruncate(2) the file after  mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.
> 
> I'd assume that most applications that mmap() hugetlb files need to
> special-case hugetlb because of the different logical page size
> granularity already. But yes, it's all unfortunate.

Will run this by out application/Database team regarding implementing workarounds. 

> 
>> However, should this mmap behavior  be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.
> 
> The issue is, as you write, that it's existing behavior and changing it
> could cause harm to other apps that rely on that. But I do wonder if really
> anybody relies on that ...
> 
> Let's explore the history:
> 
> The current VM_WRITE check was added in:
> 
> commit b6174df5eec9cdfd598c03d6d0807e344e109213
> Author: Zhang, Yanmin <yanmin.zhang@intel.com>
> Date:   Mon Jul 10 04:44:49 2006 -0700
> 
>    [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area
>        Sometimes, applications need below call to be successful although
>    "/mnt/hugepages/file1" doesn't exist.
>        fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
>    *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);
>        As for regular pages (or files), above call does work, but as for huge
>    pages, above call would fail because hugetlbfs_file_mmap would fail if
>    (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
>        This capability on huge page is useful on ia64 when the process wants to
>    protect one area on region 4, so other threads couldn't read/write this
>    area.  A famous JVM (Java Virtual Machine) implementation on IA64 needs the
>    capability.
> 
> But it was only moved.
> 
> Before that patch:
> * mmap(PROT_WRITE) would have failed if the file size would be exceeded
> * mmap(PROT_READ/PROT_NONE) would have extended the file
> 
> After that patch
> * mmap(PROT_WRITE) will extend the file
> * mmap(PROT_READ/PROT_NONE) do not extend the file
> 
> The code before that predates git times.
> 
> Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
> around all hugetlbfs quirks.
> 
> I suggest either
> 
> (a) Document it, along with the workaround

At least needs documentation. 

> (b) Change it an cross fingers.
> 
> 
> In QEMU source code is a very interesting comment:
> 
>     * ftruncate is not supported by hugetlbfs in older
>     * hosts, so don't bother bailing out on errors.
>     * If anything goes wrong with it under other filesystems,
>     * mmap will fail.
> 
> So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
> was added?
> 
> QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
> smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
> a rare case to happen, though, and likely also undesired here: we want it to behave just
> like ordinary files!

Ideally yes. 

Thanks for your feedback. 
-Prakash.

> 
> -- 
> Cheers,
> 
> David / dhildenb



      reply	other threads:[~2024-05-10 16:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-03  1:21 [RFC PATCH 0/1] Address hugetlbfs mmap behavior Prakash Sangappa
2024-05-03  1:21 ` [RFC PATCH 1/1] hugetlbfs: Add mount option to choose normal " Prakash Sangappa
2024-05-07 12:00 ` [RFC PATCH 0/1] Address hugetlbfs " David Hildenbrand
2024-05-08 17:00   ` Prakash Sangappa
2024-05-08 17:28     ` David Hildenbrand
2024-05-10 16:28       ` Prakash Sangappa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AE7FE90E-4799-4691-876D-84B03F97F1CE@oracle.com \
    --to=prakash.sangappa@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).