Sparclinux Archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] expand mmap_prepare functionality, port more users
@ 2025-09-16 14:11 Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), The f_op->mmap hook has been deprecated in favour of
f_op->mmap_prepare.

This was introduced in order to make it possible for us to eventually
eliminate the f_op->mmap hook which is highly problematic as it allows
drivers and filesystems raw access to a VMA which is not yet correctly
initialised.

This hook also introduced complexity for the memory mapping operation, as
we must correctly unwind what we do should an error arises.

Overall this interface being so open has caused significant problems for
us, including security issues, it is important for us to simply eliminate
this as a source of problems.

Therefore this series continues what was established by extending the
functionality further to permit more drivers and filesystems to use
mmap_prepare.

We start by udpating some existing users who can use the mmap_prepare
functionality as-is.

We then introduce the concept of an mmap 'action', which a user, on
mmap_prepare, can request to be performed upon the VMA:

* Nothing - default, we're done
* Remap PFN - perform PFN remap with specified parameters
* I/O remap PFN - perform I/O PFN remap with specified parameters

By setting the action in mmap_prepare, this allows us to dynamically decide
what to do next, so if a driver/filesystem needs to determine whether to
e.g. remap or use a mixed map, it can do so then change which is done.

This significantly expands the capabilities of the mmap_prepare hook, while
maintaining as much control as possible in the mm logic.

We split [io_]remap_pfn_range*() functions which allow for PFN remap (a
typical mapping prepopulation operation) split between a prepare/complete
step, as well as io_mremap_pfn_range_prepare, complete for a similar
purpose.

From there we update various mm-adjacent logic to use this functionality as
a first set of changes.

We also add success and error hooks for post-action processing for
e.g. output debug log on success and filtering error codes.


v3:
* Squashed fix patches.
* Propagated tags (thanks everyone!)
* Dropped kcov as per Jason.
* Dropped vmcore as per Jason.
* Dropped procfs patch as per Jason.
* Dropped cramfs patch as per Jason.
* Dropped mmap_action_mixedmap() as per Jason.
* Dropped mmap_action_mixedmap_pages() as per Jason.
* Dropped all remaining mixedmap logic as per Jason.
* Dropped custom action as per Jason.
* Parameterise helpers by vm_area_desc * rather than mmap_action * as per
  discussion with Jason.
* Renamed addr to start for remap action as per discussion with Jason.
* Added kernel documentation tags for mmap_action_remap() as per Jason.
* Added mmap_action_remap_full() as per Jason.
* Removed pgprot parameter from mmap_action_remap() to tighten up the
  interface as per discussion with Jason.
* Added a warning if the caller tries to remap past the end or before the
  start of a VMA.
* const-ified vma_desc_size() and vma_desc_pages() as per David.
* Added a comment describing mmap_action.
* Updated char mm driver patch to utilise mmap_action_remap_full().
* Updated resctl patch to utilise mmap_action_remap_full().
* Fixed typo in mmap_action->success_hook comment as per Reinette.
* Const-ify VMA in success_hook so drivers which do odd things with the VMA
  at this point stand out.
* Fixed mistake in mmap_action_complete() not returning error on success
  hook failure.
* Fixed up comments for mmap_action_type enum values.
* Added ability to invoke I/O remap.
* Added mmap_action_ioremap() and mmap_action_ioremap_full() helpers for
  this.
* Added iommufd I/O remap implementation.

v2:
* Propagated tags, thanks everyone! :)
* Refactored resctl patch to avoid assigned-but-not-used variable.
* Updated resctl change to not use .mmap_abort as discussed with Jason.
* Removed .mmap_abort as discussed with Jason.
* Removed references to .mmap_abort from documentation.
* Fixed silly VM_WARN_ON_ONCE() mistake (asserting opposite of what we mean
  to) as per report from Alexander.
* Fixed relay kerneldoc error.
* Renamed __mmap_prelude to __mmap_setup, keep __mmap_complete the same as
  per David.
* Fixed docs typo in mmap_complete description + formatted bold rather than
  capitalised as per Randy.
* Eliminated mmap_complete and rework into actions specified in
  mmap_prepare (via vm_area_desc) which therefore eliminates the driver's
  ability to do anything crazy and allows us to control generic logic.
* Added helper functions for these -  vma_desc_set_remap(),
  vma_desc_set_mixedmap().
* However unfortunately had to add post action hooks to vm_area_desc, as
  already hugetlbfs for instance needs to access the VMA to function
  correctly. It is at least the smallest possible means of doing this.
* Updated VMA test logic, the stacked filesystem compatibility layer and
  documentation to reflect this.
* Updated hugetlbfs implementation to use new approach, and refactored to
  accept desc where at all possible and to do as much as possible in
  .mmap_prepare, and the minimum required in the new post_hook callback.
* Updated /dev/mem and /dev/zero mmap logic to use the new mechanism.
* Updated cramfs, resctl to use the new mechanism.
* Updated proc_mmap hooks to only have proc_mmap_prepare.
* Updated the vmcore implementation to use the new hooks.
* Updated kcov to use the new hooks.
* Added hooks for success/failure for post-action handling.
* Added custom action hook for truly custom cases.
* Abstracted actions to separate type so we can use generic custom actions
  in custom handlers when necessary.
* Added callout re: lock issue raised in
  https://lore.kernel.org/linux-mm/20250801162930.GB184255@nvidia.com/ as
  per discussion with Jason.
https://lore.kernel.org/all/cover.1757534913.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (13):
  mm/shmem: update shmem to use mmap_prepare
  device/dax: update devdax to use mmap_prepare
  mm: add vma_desc_size(), vma_desc_pages() helpers
  relay: update relay to use mmap_prepare
  mm/vma: rename __mmap_prepare() function to avoid confusion
  mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  mm: introduce io_remap_pfn_range_[prepare, complete]()
  mm: add ability to take further action in vm_area_desc
  doc: update porting, vfs documentation for mmap_prepare actions
  mm/hugetlbfs: update hugetlbfs to use mmap_prepare
  mm: update mem char driver to use mmap_prepare
  mm: update resctl to use mmap_prepare
  iommufd: update to use mmap_prepare

 Documentation/filesystems/porting.rst |   5 +
 Documentation/filesystems/vfs.rst     |   4 +
 arch/csky/include/asm/pgtable.h       |   5 +
 arch/mips/alchemy/common/setup.c      |  28 ++++-
 arch/mips/include/asm/pgtable.h       |  10 ++
 arch/sparc/include/asm/pgtable_32.h   |  32 +++++-
 arch/sparc/include/asm/pgtable_64.h   |  32 +++++-
 drivers/char/mem.c                    |  76 ++++++++------
 drivers/dax/device.c                  |  32 ++++--
 drivers/iommu/iommufd/main.c          |  47 +++++----
 fs/hugetlbfs/inode.c                  |  36 ++++---
 fs/ntfs3/file.c                       |   2 +-
 fs/resctrl/pseudo_lock.c              |  20 ++--
 include/linux/hugetlb.h               |   9 +-
 include/linux/hugetlb_inline.h        |  15 ++-
 include/linux/mm.h                    | 127 +++++++++++++++++++++-
 include/linux/mm_types.h              |  46 ++++++++
 include/linux/shmem_fs.h              |   3 +-
 kernel/relay.c                        |  33 +++---
 mm/hugetlb.c                          |  77 ++++++++------
 mm/memory.c                           | 128 ++++++++++++++---------
 mm/secretmem.c                        |   2 +-
 mm/shmem.c                            |  49 ++++++---
 mm/util.c                             | 145 +++++++++++++++++++++++++-
 mm/vma.c                              |  74 ++++++++-----
 tools/testing/vma/vma_internal.h      |  83 ++++++++++++++-
 26 files changed, 868 insertions(+), 252 deletions(-)

--
2.51.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 16:42   ` Jason Gunthorpe
  2025-09-17 10:30   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 02/13] device/dax: update devdax " Lorenzo Stoakes
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

This simply assigns the vm_ops so is easily updated - do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 mm/shmem.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 87005c086d5a..df02a2e0ebbb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2938,16 +2938,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	return retval;
 }
 
-static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int shmem_mmap_prepare(struct vm_area_desc *desc)
 {
+	struct file *file = desc->file;
 	struct inode *inode = file_inode(file);
 
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)
-		vma->vm_ops = &shmem_vm_ops;
+		desc->vm_ops = &shmem_vm_ops;
 	else
-		vma->vm_ops = &shmem_anon_vm_ops;
+		desc->vm_ops = &shmem_anon_vm_ops;
 	return 0;
 }
 
@@ -5217,7 +5218,7 @@ static const struct address_space_operations shmem_aops = {
 };
 
 static const struct file_operations shmem_file_operations = {
-	.mmap		= shmem_mmap,
+	.mmap_prepare	= shmem_mmap_prepare,
 	.open		= shmem_file_open,
 	.get_unmapped_area = shmem_get_unmapped_area,
 #ifdef CONFIG_TMPFS
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 02/13] device/dax: update devdax to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 16:43   ` Jason Gunthorpe
  2025-09-17 10:37   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

The devdax driver does nothing special in its f_op->mmap hook, so
straightforwardly update it to use the mmap_prepare hook instead.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/dax/device.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bb40a6060af..c2181439f925 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -13,8 +13,9 @@
 #include "dax-private.h"
 #include "bus.h"
 
-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
-		const char *func)
+static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
+		       unsigned long start, unsigned long end, struct file *file,
+		       const char *func)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long mask;
@@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		return -ENXIO;
 
 	/* prevent private mappings from being established */
-	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+	if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, attempted private mapping\n",
 				current->comm, func);
@@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	}
 
 	mask = dev_dax->align - 1;
-	if (vma->vm_start & mask || vma->vm_end & mask) {
+	if (start & mask || end & mask) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
-				current->comm, func, vma->vm_start, vma->vm_end,
+				current->comm, func, start, end,
 				mask);
 		return -EINVAL;
 	}
 
-	if (!vma_is_dax(vma)) {
+	if (!file_is_dax(file)) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, vma is not DAX capable\n",
 				current->comm, func);
@@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	return 0;
 }
 
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+		     const char *func)
+{
+	return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
+			   vma->vm_file, func);
+}
+
 /* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
 __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
@@ -285,8 +293,9 @@ static const struct vm_operations_struct dax_vm_ops = {
 	.pagesize = dev_dax_pagesize,
 };
 
-static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
+static int dax_mmap_prepare(struct vm_area_desc *desc)
 {
+	struct file *filp = desc->file;
 	struct dev_dax *dev_dax = filp->private_data;
 	int rc, id;
 
@@ -297,13 +306,14 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * fault time.
 	 */
 	id = dax_read_lock();
-	rc = check_vma(dev_dax, vma, __func__);
+	rc = __check_vma(dev_dax, desc->vm_flags, desc->start, desc->end, filp,
+			 __func__);
 	dax_read_unlock(id);
 	if (rc)
 		return rc;
 
-	vma->vm_ops = &dax_vm_ops;
-	vm_flags_set(vma, VM_HUGEPAGE);
+	desc->vm_ops = &dax_vm_ops;
+	desc->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
@@ -377,7 +387,7 @@ static const struct file_operations dax_fops = {
 	.open = dax_open,
 	.release = dax_release,
 	.get_unmapped_area = dax_get_unmapped_area,
-	.mmap = dax_mmap,
+	.mmap_prepare = dax_mmap_prepare,
 	.fop_flags = FOP_MMAP_SYNC,
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 02/13] device/dax: update devdax " Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 16:46   ` Jason Gunthorpe
  2025-09-17 10:39   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 04/13] relay: update relay to use mmap_prepare Lorenzo Stoakes
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

It's useful to be able to determine the size of a VMA descriptor range
used on f_op->mmap_prepare, expressed both in bytes and pages, so add
helpers for both and update code that could make use of it to do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
---
 fs/ntfs3/file.c    |  2 +-
 include/linux/mm.h | 10 ++++++++++
 mm/secretmem.c     |  2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index c1ece707b195..86eb88f62714 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)
 
 	if (rw) {
 		u64 to = min_t(loff_t, i_size_read(inode),
-			       from + desc->end - desc->start);
+			       from + vma_desc_size(desc));
 
 		if (is_sparsed(ni)) {
 			/* Allocate clusters for rw map. */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da6e0abad2cb..dd1fec5f028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3571,6 +3571,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+static inline unsigned long vma_desc_size(const struct vm_area_desc *desc)
+{
+	return desc->end - desc->start;
+}
+
+static inline unsigned long vma_desc_pages(const struct vm_area_desc *desc)
+{
+	return vma_desc_size(desc) >> PAGE_SHIFT;
+}
+
 /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 				unsigned long vm_start, unsigned long vm_end)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 60137305bc20..62066ddb1e9c 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)
 
 static int secretmem_mmap_prepare(struct vm_area_desc *desc)
 {
-	const unsigned long len = desc->end - desc->start;
+	const unsigned long len = vma_desc_size(desc);
 
 	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
 		return -EINVAL;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 04/13] relay: update relay to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 16:48   ` Jason Gunthorpe
  2025-09-17 10:41   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

It is relatively trivial to update this code to use the f_op->mmap_prepare
hook in favour of the deprecated f_op->mmap hook, so do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 kernel/relay.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 8d915fe98198..e36f6b926f7f 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -72,17 +72,18 @@ static void relay_free_page_array(struct page **array)
 }
 
 /**
- *	relay_mmap_buf: - mmap channel buffer to process address space
- *	@buf: relay channel buffer
- *	@vma: vm_area_struct describing memory to be mapped
+ *	relay_mmap_prepare_buf: - mmap channel buffer to process address space
+ *	@buf: the relay channel buffer
+ *	@desc: describing what to map
  *
  *	Returns 0 if ok, negative on error
  *
  *	Caller should already have grabbed mmap_lock.
  */
-static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
+static int relay_mmap_prepare_buf(struct rchan_buf *buf,
+				  struct vm_area_desc *desc)
 {
-	unsigned long length = vma->vm_end - vma->vm_start;
+	unsigned long length = vma_desc_size(desc);
 
 	if (!buf)
 		return -EBADF;
@@ -90,9 +91,9 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
 	if (length != (unsigned long)buf->chan->alloc_size)
 		return -EINVAL;
 
-	vma->vm_ops = &relay_file_mmap_ops;
-	vm_flags_set(vma, VM_DONTEXPAND);
-	vma->vm_private_data = buf;
+	desc->vm_ops = &relay_file_mmap_ops;
+	desc->vm_flags |= VM_DONTEXPAND;
+	desc->private_data = buf;
 
 	return 0;
 }
@@ -749,16 +750,16 @@ static int relay_file_open(struct inode *inode, struct file *filp)
 }
 
 /**
- *	relay_file_mmap - mmap file op for relay files
- *	@filp: the file
- *	@vma: the vma describing what to map
+ *	relay_file_mmap_prepare - mmap file op for relay files
+ *	@desc: describing what to map
  *
- *	Calls upon relay_mmap_buf() to map the file into user space.
+ *	Calls upon relay_mmap_prepare_buf() to map the file into user space.
  */
-static int relay_file_mmap(struct file *filp, struct vm_area_struct *vma)
+static int relay_file_mmap_prepare(struct vm_area_desc *desc)
 {
-	struct rchan_buf *buf = filp->private_data;
-	return relay_mmap_buf(buf, vma);
+	struct rchan_buf *buf = desc->file->private_data;
+
+	return relay_mmap_prepare_buf(buf, desc);
 }
 
 /**
@@ -1006,7 +1007,7 @@ static ssize_t relay_file_read(struct file *filp,
 const struct file_operations relay_file_operations = {
 	.open		= relay_file_open,
 	.poll		= relay_file_poll,
-	.mmap		= relay_file_mmap,
+	.mmap_prepare	= relay_file_mmap_prepare,
 	.read		= relay_file_read,
 	.release	= relay_file_release,
 };
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 04/13] relay: update relay to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 16:48   ` Jason Gunthorpe
  2025-09-17 10:49   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Now we have the f_op->mmap_prepare() hook, having a static function called
__mmap_prepare() that has nothing to do with it is confusing, so rename
the function to __mmap_setup().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/vma.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index ac791ed8c92f..bdb070a62a2e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2329,7 +2329,7 @@ static void update_ksm_flags(struct mmap_state *map)
 }
 
 /*
- * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ * __mmap_setup() - Prepare to gather any overlapping VMAs that need to be
  * unmapped once the map operation is completed, check limits, account mapping
  * and clean up any pre-existing VMAs.
  *
@@ -2338,7 +2338,7 @@ static void update_ksm_flags(struct mmap_state *map)
  *
  * Returns: 0 on success, error code otherwise.
  */
-static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
+static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
 {
 	int error;
 	struct vma_iterator *vmi = map->vmi;
@@ -2649,7 +2649,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 
 	map.check_ksm_early = can_set_ksm_flags_early(&map);
 
-	error = __mmap_prepare(&map, uf);
+	error = __mmap_setup(&map, uf);
 	if (!error && have_mmap_prepare)
 		error = call_mmap_prepare(&map);
 	if (error)
@@ -2679,7 +2679,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 
 	return addr;
 
-	/* Accounting was done by __mmap_prepare(). */
+	/* Accounting was done by __mmap_setup(). */
 unacct_error:
 	if (map.charged)
 		vm_unacct_memory(map.charged);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:07   ` Jason Gunthorpe
  2025-09-17 11:07   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

We need the ability to split PFN remap between updating the VMA and
performing the actual remap, in order to do away with the legacy
f_op->mmap hook.

To do so, update the PFN remap code to provide shared logic, and also make
remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").

Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor
and PFN parameters, and remap_pfn_range_complete() which accepts the same
parameters as remap_pfn_rangte().

remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
it must be supplied with a correct PFN to do so.  If the caller must hold
locks to be able to do this, those locks should be held across the
operation, and mmap_abort() should be provided to revoke the lock should
an error arise.

While we're here, also clean up the duplicated #ifdef
__HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.

We would prefer to define these functions in mm/internal.h, however we
will do the same for io_remap*() and these have arch defines that require
access to the remap functions.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h |  25 +++++++--
 mm/memory.c        | 128 ++++++++++++++++++++++++++++-----------------
 2 files changed, 102 insertions(+), 51 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd1fec5f028a..3277e035006d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -489,6 +489,21 @@ extern unsigned int kobjsize(const void *objp);
  */
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
 
+/*
+ * Physically remapped pages are special. Tell the
+ * rest of the world about it:
+ *   VM_IO tells people not to look at these pages
+ *	(accesses can have side effects).
+ *   VM_PFNMAP tells the core MM that the base pages are just
+ *	raw PFN mappings, and do not have a "struct page" associated
+ *	with them.
+ *   VM_DONTEXPAND
+ *      Disable vma merging and expanding with mremap().
+ *   VM_DONTDUMP
+ *      Omit vma from core dump, even when VM_IO turned off.
+ */
+#define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
+
 /* This mask prevents VMA from being scanned with khugepaged */
 #define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
 
@@ -3622,10 +3637,12 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 
 struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
 		unsigned long addr);
-int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
-			unsigned long pfn, unsigned long size, pgprot_t);
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot);
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t pgprot);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t pgprot);
+
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
 			struct page **pages, unsigned long *num);
diff --git a/mm/memory.c b/mm/memory.c
index 41e641823558..4be4a9dc0fd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,8 +2900,27 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	return 0;
 }
 
+static int get_remap_pgoff(vm_flags_t vm_flags, unsigned long addr,
+		unsigned long end, unsigned long vm_start, unsigned long vm_end,
+		unsigned long pfn, pgoff_t *vm_pgoff_p)
+{
+	/*
+	 * There's a horrible special case to handle copy-on-write
+	 * behaviour that some programs depend on. We mark the "original"
+	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
+	 * See vm_normal_page() for details.
+	 */
+	if (is_cow_mapping(vm_flags)) {
+		if (addr != vm_start || end != vm_end)
+			return -EINVAL;
+		*vm_pgoff_p = pfn;
+	}
+
+	return 0;
+}
+
 static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -2912,32 +2931,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
 		return -EINVAL;
 
-	/*
-	 * Physically remapped pages are special. Tell the
-	 * rest of the world about it:
-	 *   VM_IO tells people not to look at these pages
-	 *	(accesses can have side effects).
-	 *   VM_PFNMAP tells the core MM that the base pages are just
-	 *	raw PFN mappings, and do not have a "struct page" associated
-	 *	with them.
-	 *   VM_DONTEXPAND
-	 *      Disable vma merging and expanding with mremap().
-	 *   VM_DONTDUMP
-	 *      Omit vma from core dump, even when VM_IO turned off.
-	 *
-	 * There's a horrible special case to handle copy-on-write
-	 * behaviour that some programs depend on. We mark the "original"
-	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
-	 * See vm_normal_page() for details.
-	 */
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (addr != vma->vm_start || end != vma->vm_end)
-			return -EINVAL;
-		vma->vm_pgoff = pfn;
+	if (set_vma) {
+		err = get_remap_pgoff(vma->vm_flags, addr, end,
+				      vma->vm_start, vma->vm_end,
+				      pfn, &vma->vm_pgoff);
+		if (err)
+			return err;
+		vm_flags_set(vma, VM_REMAP_FLAGS);
+	} else {
+		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);
 	}
 
-	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
-
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
@@ -2957,11 +2961,10 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
  * Variant of remap_pfn_range that does not call track_pfn_remap.  The caller
  * must have pre-validated the caching bits of the pgprot_t.
  */
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
-	int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
-
+	int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, set_vma);
 	if (!error)
 		return 0;
 
@@ -2974,6 +2977,18 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 	return error;
 }
 
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+	/*
+	 * We set addr=VMA start, end=VMA end here, so this won't fail, but we
+	 * check it again on complete and will fail there if specified addr is
+	 * invalid.
+	 */
+	get_remap_pgoff(desc->vm_flags, desc->start, desc->end,
+			desc->start, desc->end, pfn, &desc->pgoff);
+	desc->vm_flags |= VM_REMAP_FLAGS;
+}
+
 #ifdef __HAVE_PFNMAP_TRACKING
 static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
 		unsigned long size, pgprot_t *prot)
@@ -3002,23 +3017,9 @@ void pfnmap_track_ctx_release(struct kref *ref)
 	pfnmap_untrack(ctx->pfn, ctx->size);
 	kfree(ctx);
 }
-#endif /* __HAVE_PFNMAP_TRACKING */
 
-/**
- * remap_pfn_range - remap kernel memory to userspace
- * @vma: user vma to map to
- * @addr: target page aligned user address to start at
- * @pfn: page frame number of kernel physical memory address
- * @size: size of mapping area
- * @prot: page protection flags for this mapping
- *
- * Note: this is only safe if the mm semaphore is held when called.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-#ifdef __HAVE_PFNMAP_TRACKING
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
-		    unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_track(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
 	struct pfnmap_track_ctx *ctx = NULL;
 	int err;
@@ -3044,7 +3045,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		return -EINVAL;
 	}
 
-	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot, set_vma);
 	if (ctx) {
 		if (err)
 			kref_put(&ctx->kref, pfnmap_track_ctx_release);
@@ -3054,11 +3055,44 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	return err;
 }
 
+/**
+ * remap_pfn_range - remap kernel memory to userspace
+ * @vma: user vma to map to
+ * @addr: target page aligned user address to start at
+ * @pfn: page frame number of kernel physical memory address
+ * @size: size of mapping area
+ * @prot: page protection flags for this mapping
+ *
+ * Note: this is only safe if the mm semaphore is held when called.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_track(vma, addr, pfn, size, prot,
+				     /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	/* With set_vma = false, the VMA will not be modified. */
+	return remap_pfn_range_track(vma, addr, pfn, size, prot,
+				     /* set_vma = */false);
+}
 #else
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot, /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+			     unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot,
+				       /* set_vma = */false);
 }
 #endif
 EXPORT_SYMBOL(remap_pfn_range);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]()
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:19   ` Jason Gunthorpe
  2025-09-17 11:12   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
remap_pfn_range_complete() to allow for I/O remapping via mmap_prepare.

We have to make some architecture-specific changes for those architectures
which define customised handlers.

It doesn't really make sense to make this internal-only as arches specify
their version of these functions so we declare these in mm.h.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/csky/include/asm/pgtable.h     |  5 +++++
 arch/mips/alchemy/common/setup.c    | 28 ++++++++++++++++++++++---
 arch/mips/include/asm/pgtable.h     | 10 +++++++++
 arch/sparc/include/asm/pgtable_32.h | 32 +++++++++++++++++++++++++----
 arch/sparc/include/asm/pgtable_64.h | 32 +++++++++++++++++++++++++----
 include/linux/mm.h                  | 18 ++++++++++++++++
 6 files changed, 114 insertions(+), 11 deletions(-)

diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 5a394be09c35..c83505839a06 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -266,4 +266,9 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
 #define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
 	remap_pfn_range(vma, vaddr, pfn, size, prot)
 
+/* default io_remap_pfn_range_prepare can be used. */
+
+#define io_remap_pfn_range_complete(vma, addr, pfn, size, prot) \
+	remap_pfn_range_complete(vma, addr, pfn, size, prot)
+
 #endif /* __ASM_CSKY_PGTABLE_H */
diff --git a/arch/mips/alchemy/common/setup.c b/arch/mips/alchemy/common/setup.c
index a7a6d31a7a41..a4ab02776994 100644
--- a/arch/mips/alchemy/common/setup.c
+++ b/arch/mips/alchemy/common/setup.c
@@ -94,12 +94,34 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t phys_addr, phys_addr_t size)
 	return phys_addr;
 }
 
-int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
 {
 	phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);
 
-	return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
+	return phys_addr >> PAGE_SHIFT;
+}
+
+int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
+		unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, vaddr, calc_pfn(pfn, size), size, prot);
 }
 EXPORT_SYMBOL(io_remap_pfn_range);
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+			       unsigned long size)
+{
+	remap_pfn_range_prepare(desc, calc_pfn(pfn, size));
+}
+EXPORT_SYMBOL(io_remap_pfn_range_prepare);
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_pfn(pfn, size),
+			size, prot);
+}
+EXPORT_SYMBOL(io_remap_pfn_range_complete);
+
 #endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index ae73ecf4c41a..6a8964f55a31 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -607,6 +607,16 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t addr, phys_addr_t size);
 int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
 		unsigned long pfn, unsigned long size, pgprot_t prot);
 #define io_remap_pfn_range io_remap_pfn_range
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+		unsigned long size);
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot);
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 #else
 #define fixup_bigphys_addr(addr, size)	(addr)
 #endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 7c199c003ffe..30749c5ffe95 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -397,10 +397,11 @@ __get_iospace (unsigned long addr)
 
 int remap_pfn_range(struct vm_area_struct *, unsigned long, unsigned long,
 		    unsigned long, pgprot_t);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t pgprot);
 
-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
-				     unsigned long from, unsigned long pfn,
-				     unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
 {
 	unsigned long long offset, space, phys_base;
 
@@ -408,10 +409,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 	space = GET_IOSPACE(pfn);
 	phys_base = offset | (space << 32ULL);
 
-	return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+	return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+				     unsigned long from, unsigned long pfn,
+				     unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
 }
 #define io_remap_pfn_range io_remap_pfn_range
 
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+		unsigned long size)
+{
+	remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+			size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
 ({									  \
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 669cd02469a1..b06f55915653 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1050,6 +1050,9 @@ int page_in_phys_avail(unsigned long paddr);
 
 int remap_pfn_range(struct vm_area_struct *, unsigned long, unsigned long,
 		    unsigned long, pgprot_t);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t pgprot);
 
 void adi_restore_tags(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pte_t pte);
@@ -1084,9 +1087,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 	return 0;
 }
 
-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
-				     unsigned long from, unsigned long pfn,
-				     unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
 {
 	unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;
 	int space = GET_IOSPACE(pfn);
@@ -1094,10 +1095,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 
 	phys_base = offset | (((unsigned long) space) << 32UL);
 
-	return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+	return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+				     unsigned long from, unsigned long pfn,
+				     unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
 }
 #define io_remap_pfn_range io_remap_pfn_range
 
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+	unsigned long size)
+{
+	return remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+					size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 static inline unsigned long __untagged_addr(unsigned long start)
 {
 	if (adi_capable()) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3277e035006d..6d4cc7cdf1e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3684,6 +3684,24 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef io_remap_pfn_range_prepare
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+	unsigned long size)
+{
+	return remap_pfn_range_prepare(desc, pfn);
+}
+#endif
+
+#ifndef io_remap_pfn_range_complete
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, pfn, size,
+			pgprot_decrypted(prot));
+}
+#endif
+
 static inline vm_fault_t vmf_error(int err)
 {
 	if (err == -ENOMEM)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:28   ` Jason Gunthorpe
  2025-09-17 11:32   ` Pedro Falcato
  2025-09-16 14:11 ` [PATCH v3 09/13] doc: update porting, vfs documentation for mmap_prepare actions Lorenzo Stoakes
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Some drivers/filesystems need to perform additional tasks after the VMA is
set up.  This is typically in the form of pre-population.

The forms of pre-population most likely to be performed are a PFN remap
or the insertion of normal folios and PFNs into a mixed map.

We start by implementing the PFN remap functionality, ensuring that we
perform the appropriate actions at the appropriate time - that is setting
flags at the point of .mmap_prepare, and performing the actual remap at the
point at which the VMA is fully established.

This prevents the driver from doing anything too crazy with a VMA at any
stage, and we retain complete control over how the mm functionality is
applied.

Unfortunately callers still do often require some kind of custom action,
so we add an optional success/error _hook to allow the caller to do
something after the action has succeeded or failed.

This is done at the point when the VMA has already been established, so
the harm that can be done is limited.

The error hook can be used to filter errors if necessary.

If any error arises on these final actions, we simply unmap the VMA
altogether.

Also update the stacked filesystem compatibility layer to utilise the
action behaviour, and update the VMA tests accordingly.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               |  74 ++++++++++++++++
 include/linux/mm_types.h         |  46 ++++++++++
 mm/util.c                        | 145 ++++++++++++++++++++++++++++++-
 mm/vma.c                         |  70 ++++++++++-----
 tools/testing/vma/vma_internal.h |  83 +++++++++++++++++-
 5 files changed, 390 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6d4cc7cdf1e1..ee4efb394ed3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3596,6 +3596,80 @@ static inline unsigned long vma_desc_pages(const struct vm_area_desc *desc)
 	return vma_desc_size(desc) >> PAGE_SHIFT;
 }
 
+/**
+ * mmap_action_remap - helper for mmap_prepare hook to specify that a pure PFN
+ * remap is required.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start: The virtual address to start the remap from, must be within the VMA.
+ * @start_pfn: The first PFN in the range to remap.
+ * @size: The size of the range to remap, in bytes, at most spanning to the end
+ * of the VMA.
+ */
+static inline void mmap_action_remap(struct vm_area_desc *desc,
+				     unsigned long start,
+				     unsigned long start_pfn,
+				     unsigned long size)
+{
+	struct mmap_action *action = &desc->action;
+
+	/* [start, start + size) must be within the VMA. */
+	WARN_ON_ONCE(start < desc->start || start >= desc->end);
+	WARN_ON_ONCE(start + size > desc->end);
+
+	action->type = MMAP_REMAP_PFN;
+	action->remap.start = start;
+	action->remap.start_pfn = start_pfn;
+	action->remap.size = size;
+	action->remap.pgprot = desc->page_prot;
+}
+
+/**
+ * mmap_action_remap_full - helper for mmap_prepare hook to specify that the
+ * entirety of a VMA should be PFN remapped.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start_pfn: The first PFN in the range to remap.
+ */
+static inline void mmap_action_remap_full(struct vm_area_desc *desc,
+					  unsigned long start_pfn)
+{
+	mmap_action_remap(desc, desc->start, start_pfn, vma_desc_size(desc));
+}
+
+/**
+ * mmap_action_ioremap - helper for mmap_prepare hook to specify that a pure PFN
+ * I/O remap is required.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start: The virtual address to start the remap from, must be within the VMA.
+ * @start_pfn: The first PFN in the range to remap.
+ * @size: The size of the range to remap, in bytes, at most spanning to the end
+ * of the VMA.
+ */
+static inline void mmap_action_ioremap(struct vm_area_desc *desc,
+				       unsigned long start,
+				       unsigned long start_pfn,
+				       unsigned long size)
+{
+	mmap_action_remap(desc, start, start_pfn, size);
+	desc->action.remap.is_io_remap = true;
+}
+
+/**
+ * mmap_action_ioremap_full - helper for mmap_prepare hook to specify that the
+ * entirety of a VMA should be PFN I/O remapped.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start_pfn: The first PFN in the range to remap.
+ */
+static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
+					  unsigned long start_pfn)
+{
+	mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
+}
+
+void mmap_action_prepare(struct mmap_action *action,
+			     struct vm_area_desc *desc);
+int mmap_action_complete(struct mmap_action *action,
+			     struct vm_area_struct *vma);
+
 /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 				unsigned long vm_start, unsigned long vm_end)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31b27086586d..aa1e2003f366 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -775,6 +775,49 @@ struct pfnmap_track_ctx {
 };
 #endif
 
+/* What action should be taken after an .mmap_prepare call is complete? */
+enum mmap_action_type {
+	MMAP_NOTHING,		/* Mapping is complete, no further action. */
+	MMAP_REMAP_PFN,		/* Remap PFN range. */
+};
+
+/*
+ * Describes an action an mmap_prepare hook can instruct to be taken to complete
+ * the mapping of a VMA. Specified in vm_area_desc.
+ */
+struct mmap_action {
+	union {
+		/* Remap range. */
+		struct {
+			unsigned long start;
+			unsigned long start_pfn;
+			unsigned long size;
+			pgprot_t pgprot;
+			bool is_io_remap;
+		} remap;
+	};
+	enum mmap_action_type type;
+
+	/*
+	 * If specified, this hook is invoked after the selected action has been
+	 * successfully completed. Note that the VMA write lock still held.
+	 *
+	 * The absolute minimum ought to be done here.
+	 *
+	 * Returns 0 on success, or an error code.
+	 */
+	int (*success_hook)(const struct vm_area_struct *vma);
+
+	/*
+	 * If specified, this hook is invoked when an error occurred when
+	 * attempting the selection action.
+	 *
+	 * The hook can return an error code in order to filter the error, but
+	 * it is not valid to clear the error here.
+	 */
+	int (*error_hook)(int err);
+};
+
 /*
  * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
  * manipulate mutable fields which will cause those fields to be updated in the
@@ -798,6 +841,9 @@ struct vm_area_desc {
 	/* Write-only fields. */
 	const struct vm_operations_struct *vm_ops;
 	void *private_data;
+
+	/* Take further action? */
+	struct mmap_action action;
 };
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 6c1d64ed0221..64e0f28e251a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1155,15 +1155,18 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 		.vm_file = vma->vm_file,
 		.vm_flags = vma->vm_flags,
 		.page_prot = vma->vm_page_prot,
+
+		.action.type = MMAP_NOTHING, /* Default */
 	};
 	int err;
 
 	err = f_op->mmap_prepare(&desc);
 	if (err)
 		return err;
-	set_vma_from_desc(vma, &desc);
 
-	return 0;
+	mmap_action_prepare(&desc.action, &desc);
+	set_vma_from_desc(vma, &desc);
+	return mmap_action_complete(&desc.action, vma);
 }
 EXPORT_SYMBOL(__compat_vma_mmap_prepare);
 
@@ -1279,6 +1282,144 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 	}
 }
 
+#ifdef CONFIG_MMU
+/**
+ * mmap_action_prepare - Perform preparatory setup for an VMA descriptor
+ * action which need to be performed.
+ * @desc: The VMA descriptor to prepare for @action.
+ * @action: The action to perform.
+ */
+void mmap_action_prepare(struct mmap_action *action,
+			 struct vm_area_desc *desc)
+{
+	switch (action->type) {
+	case MMAP_NOTHING:
+		break;
+	case MMAP_REMAP_PFN:
+		if (action->remap.is_io_remap)
+			io_remap_pfn_range_prepare(desc, action->remap.start_pfn,
+				action->remap.size);
+		else
+			remap_pfn_range_prepare(desc, action->remap.start_pfn);
+		break;
+	}
+}
+EXPORT_SYMBOL(mmap_action_prepare);
+
+/**
+ * mmap_action_complete - Execute VMA descriptor action.
+ * @action: The action to perform.
+ * @vma: The VMA to perform the action upon.
+ *
+ * Similar to mmap_action_prepare().
+ *
+ * Return: 0 on success, or error, at which point the VMA will be unmapped.
+ */
+int mmap_action_complete(struct mmap_action *action,
+			 struct vm_area_struct *vma)
+{
+	int err = 0;
+
+	switch (action->type) {
+	case MMAP_NOTHING:
+		break;
+	case MMAP_REMAP_PFN:
+		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) !=
+				VM_REMAP_FLAGS);
+
+		if (action->remap.is_io_remap)
+			err = io_remap_pfn_range_complete(vma, action->remap.start,
+				action->remap.start_pfn, action->remap.size,
+				action->remap.pgprot);
+		else
+			err = remap_pfn_range_complete(vma, action->remap.start,
+				action->remap.start_pfn, action->remap.size,
+				action->remap.pgprot);
+		break;
+	}
+
+	/*
+	 * If an error occurs, unmap the VMA altogether and return an error. We
+	 * only clear the newly allocated VMA, since this function is only
+	 * invoked if we do NOT merge, so we only clean up the VMA we created.
+	 */
+	if (err) {
+		const size_t len = vma_pages(vma) << PAGE_SHIFT;
+
+		do_munmap(current->mm, vma->vm_start, len, NULL);
+
+		if (action->error_hook) {
+			/* We may want to filter the error. */
+			err = action->error_hook(err);
+
+			/* The caller should not clear the error. */
+			VM_WARN_ON_ONCE(!err);
+		}
+		return err;
+	}
+
+	if (action->success_hook)
+		err = action->success_hook(vma);
+
+	return err;
+}
+EXPORT_SYMBOL(mmap_action_complete);
+#else
+void mmap_action_prepare(struct mmap_action *action,
+			struct vm_area_desc *desc)
+{
+	switch (action->type) {
+	case MMAP_NOTHING:
+		break;
+	case MMAP_REMAP_PFN:
+		WARN_ON_ONCE(1); /* nommu cannot handle these. */
+		break;
+	}
+}
+EXPORT_SYMBOL(mmap_action_prepare);
+
+int mmap_action_complete(struct mmap_action *action,
+			struct vm_area_struct *vma)
+{
+	int err = 0;
+
+	switch (action->type) {
+	case MMAP_NOTHING:
+		break;
+	case MMAP_REMAP_PFN:
+		WARN_ON_ONCE(1); /* nommu cannot handle this. */
+
+		break;
+	}
+
+	/*
+	 * If an error occurs, unmap the VMA altogether and return an error. We
+	 * only clear the newly allocated VMA, since this function is only
+	 * invoked if we do NOT merge, so we only clean up the VMA we created.
+	 */
+	if (err) {
+		const size_t len = vma_pages(vma) << PAGE_SHIFT;
+
+		do_munmap(current->mm, vma->vm_start, len, NULL);
+
+		if (action->error_hook) {
+			/* We may want to filter the error. */
+			err = action->error_hook(err);
+
+			/* The caller should not clear the error. */
+			VM_WARN_ON_ONCE(!err);
+		}
+		return err;
+	}
+
+	if (action->success_hook)
+		err = action->success_hook(vma);
+
+	return 0;
+}
+EXPORT_SYMBOL(mmap_action_complete);
+#endif
+
 #ifdef CONFIG_MMU
 /**
  * folio_pte_batch - detect a PTE batch for a large folio
diff --git a/mm/vma.c b/mm/vma.c
index bdb070a62a2e..1be297f7bb00 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2328,17 +2328,33 @@ static void update_ksm_flags(struct mmap_state *map)
 	map->vm_flags = ksm_vma_flags(map->mm, map->file, map->vm_flags);
 }
 
+static void set_desc_from_map(struct vm_area_desc *desc,
+		const struct mmap_state *map)
+{
+	desc->start = map->addr;
+	desc->end = map->end;
+
+	desc->pgoff = map->pgoff;
+	desc->vm_file = map->file;
+	desc->vm_flags = map->vm_flags;
+	desc->page_prot = map->page_prot;
+}
+
 /*
  * __mmap_setup() - Prepare to gather any overlapping VMAs that need to be
  * unmapped once the map operation is completed, check limits, account mapping
  * and clean up any pre-existing VMAs.
  *
+ * As a result it sets up the @map and @desc objects.
+ *
  * @map: Mapping state.
+ * @desc: VMA descriptor
  * @uf:  Userfaultfd context list.
  *
  * Returns: 0 on success, error code otherwise.
  */
-static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
+static int __mmap_setup(struct mmap_state *map, struct vm_area_desc *desc,
+			struct list_head *uf)
 {
 	int error;
 	struct vma_iterator *vmi = map->vmi;
@@ -2395,6 +2411,7 @@ static int __mmap_setup(struct mmap_state *map, struct list_head *uf)
 	 */
 	vms_clean_up_area(vms, &map->mas_detach);
 
+	set_desc_from_map(desc, map);
 	return 0;
 }
 
@@ -2567,34 +2584,26 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
  *
  * Returns 0 on success, or an error code otherwise.
  */
-static int call_mmap_prepare(struct mmap_state *map)
+static int call_mmap_prepare(struct mmap_state *map,
+		struct vm_area_desc *desc)
 {
 	int err;
-	struct vm_area_desc desc = {
-		.mm = map->mm,
-		.file = map->file,
-		.start = map->addr,
-		.end = map->end,
-
-		.pgoff = map->pgoff,
-		.vm_file = map->file,
-		.vm_flags = map->vm_flags,
-		.page_prot = map->page_prot,
-	};
 
 	/* Invoke the hook. */
-	err = vfs_mmap_prepare(map->file, &desc);
+	err = vfs_mmap_prepare(map->file, desc);
 	if (err)
 		return err;
 
+	mmap_action_prepare(&desc->action, desc);
+
 	/* Update fields permitted to be changed. */
-	map->pgoff = desc.pgoff;
-	map->file = desc.vm_file;
-	map->vm_flags = desc.vm_flags;
-	map->page_prot = desc.page_prot;
+	map->pgoff = desc->pgoff;
+	map->file = desc->vm_file;
+	map->vm_flags = desc->vm_flags;
+	map->page_prot = desc->page_prot;
 	/* User-defined fields. */
-	map->vm_ops = desc.vm_ops;
-	map->vm_private_data = desc.private_data;
+	map->vm_ops = desc->vm_ops;
+	map->vm_private_data = desc->private_data;
 
 	return 0;
 }
@@ -2642,16 +2651,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
-	int error;
 	bool have_mmap_prepare = file && file->f_op->mmap_prepare;
 	VMA_ITERATOR(vmi, mm, addr);
 	MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
+	struct vm_area_desc desc = {
+		.mm = mm,
+		.file = file,
+		.action = {
+			.type = MMAP_NOTHING, /* Default to no further action. */
+		},
+	};
+	bool allocated_new = false;
+	int error;
 
 	map.check_ksm_early = can_set_ksm_flags_early(&map);
 
-	error = __mmap_setup(&map, uf);
+	error = __mmap_setup(&map, &desc, uf);
 	if (!error && have_mmap_prepare)
-		error = call_mmap_prepare(&map);
+		error = call_mmap_prepare(&map, &desc);
 	if (error)
 		goto abort_munmap;
 
@@ -2670,6 +2687,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 		error = __mmap_new_vma(&map, &vma);
 		if (error)
 			goto unacct_error;
+		allocated_new = true;
 	}
 
 	if (have_mmap_prepare)
@@ -2677,6 +2695,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 
 	__mmap_complete(&map, vma);
 
+	if (have_mmap_prepare && allocated_new) {
+		error = mmap_action_complete(&desc.action, vma);
+		if (error)
+			return error;
+	}
+
 	return addr;
 
 	/* Accounting was done by __mmap_setup(). */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07167446dcf4..8c4722c2eced 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -274,6 +274,50 @@ struct mm_struct {
 
 struct vm_area_struct;
 
+
+/* What action should be taken after an .mmap_prepare call is complete? */
+enum mmap_action_type {
+	MMAP_NOTHING,		/* Mapping is complete, no further action. */
+	MMAP_REMAP_PFN,		/* Remap PFN range. */
+};
+
+/*
+ * Describes an action an mmap_prepare hook can instruct to be taken to complete
+ * the mapping of a VMA. Specified in vm_area_desc.
+ */
+struct mmap_action {
+	union {
+		/* Remap range. */
+		struct {
+			unsigned long start;
+			unsigned long start_pfn;
+			unsigned long size;
+			pgprot_t pgprot;
+			bool is_io_remap;
+		} remap;
+	};
+	enum mmap_action_type type;
+
+	/*
+	 * If specified, this hook is invoked after the selected action has been
+	 * successfully completed. Note that the VMA write lock still held.
+	 *
+	 * The absolute minimum ought to be done here.
+	 *
+	 * Returns 0 on success, or an error code.
+	 */
+	int (*success_hook)(const struct vm_area_struct *vma);
+
+	/*
+	 * If specified, this hook is invoked when an error occurred when
+	 * attempting the selection action.
+	 *
+	 * The hook can return an error code in order to filter the error, but
+	 * it is not valid to clear the error here.
+	 */
+	int (*error_hook)(int err);
+};
+
 /*
  * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
  * manipulate mutable fields which will cause those fields to be updated in the
@@ -297,6 +341,9 @@ struct vm_area_desc {
 	/* Write-only fields. */
 	const struct vm_operations_struct *vm_ops;
 	void *private_data;
+
+	/* Take further action? */
+	struct mmap_action action;
 };
 
 struct file_operations {
@@ -1466,12 +1513,23 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
 static inline void set_vma_from_desc(struct vm_area_struct *vma,
 		struct vm_area_desc *desc);
 
+static inline void mmap_action_prepare(struct mmap_action *action,
+					   struct vm_area_desc *desc)
+{
+}
+
+static inline int mmap_action_complete(struct mmap_action *action,
+					   struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 		struct file *file, struct vm_area_struct *vma)
 {
 	struct vm_area_desc desc = {
 		.mm = vma->vm_mm,
-		.file = vma->vm_file,
+		.file = file,
 		.start = vma->vm_start,
 		.end = vma->vm_end,
 
@@ -1479,15 +1537,18 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 		.vm_file = vma->vm_file,
 		.vm_flags = vma->vm_flags,
 		.page_prot = vma->vm_page_prot,
+
+		.action.type = MMAP_NOTHING, /* Default */
 	};
 	int err;
 
 	err = f_op->mmap_prepare(&desc);
 	if (err)
 		return err;
-	set_vma_from_desc(vma, &desc);
 
-	return 0;
+	mmap_action_prepare(&desc.action, &desc);
+	set_vma_from_desc(vma, &desc);
+	return mmap_action_complete(&desc.action, vma);
 }
 
 static inline int compat_vma_mmap_prepare(struct file *file,
@@ -1548,4 +1609,20 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
 	return vm_flags;
 }
 
+static inline void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+}
+
+static inline int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t pgprot)
+{
+	return 0;
+}
+
+static inline int do_munmap(struct mm_struct *, unsigned long, size_t,
+		struct list_head *uf)
+{
+	return 0;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 09/13] doc: update porting, vfs documentation for mmap_prepare actions
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare Lorenzo Stoakes
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Now we have introduced the ability to specify that actions should be taken
after a VMA is established via the vm_area_desc->action field as specified
in mmap_prepare, update both the VFS documentation and the porting guide
to describe this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 Documentation/filesystems/porting.rst | 5 +++++
 Documentation/filesystems/vfs.rst     | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..6743ed0b9112 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1285,3 +1285,8 @@ rather than a VMA, as the VMA at this stage is not yet valid.
 The vm_area_desc provides the minimum required information for a filesystem
 to initialise state upon memory mapping of a file-backed region, and output
 parameters for the file system to set this state.
+
+In nearly all cases, this is all that is required for a filesystem. However, if
+a filesystem needs to perform an operation such a pre-population of page tables,
+then that action can be specified in the vm_area_desc->action field, which can
+be configured using the mmap_action_*() helpers.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..9e96c46ee10e 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1236,6 +1236,10 @@ otherwise noted.
 	file-backed memory mapping, most notably establishing relevant
 	private state and VMA callbacks.
 
+	If further action such as pre-population of page tables is required,
+	this can be specified by the vm_area_desc->action field and related
+	parameters.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides.  When opening a device node
 (character or block special) most filesystems will call special
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (8 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 09/13] doc: update porting, vfs documentation for mmap_prepare actions Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:30   ` Jason Gunthorpe
  2025-09-16 14:11 ` [PATCH v3 11/13] mm: update mem char driver " Lorenzo Stoakes
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Since we can now perform actions after the VMA is established via
mmap_prepare, use desc->action_success_hook to set up the hugetlb lock
once the VMA is setup.

We also make changes throughout hugetlbfs to make this possible.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/hugetlbfs/inode.c           | 36 ++++++++++------
 include/linux/hugetlb.h        |  9 +++-
 include/linux/hugetlb_inline.h | 15 ++++---
 mm/hugetlb.c                   | 77 ++++++++++++++++++++--------------
 4 files changed, 85 insertions(+), 52 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index f42548ee9083..9e0625167517 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -96,8 +96,15 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
 #define PGOFF_LOFFT_MAX \
 	(((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))
 
-static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+static int hugetlb_file_mmap_prepare_success(const struct vm_area_struct *vma)
 {
+	/* Unfortunate we have to reassign vma->vm_private_data. */
+	return hugetlb_vma_lock_alloc((struct vm_area_struct *)vma);
+}
+
+static int hugetlbfs_file_mmap_prepare(struct vm_area_desc *desc)
+{
+	struct file *file = desc->file;
 	struct inode *inode = file_inode(file);
 	loff_t len, vma_len;
 	int ret;
@@ -112,8 +119,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
-	vma->vm_ops = &hugetlb_vm_ops;
+	desc->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
+	desc->vm_ops = &hugetlb_vm_ops;
 
 	/*
 	 * page based offset in vm_pgoff could be sufficiently large to
@@ -122,16 +129,16 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * sizeof(unsigned long).  So, only check in those instances.
 	 */
 	if (sizeof(unsigned long) == sizeof(loff_t)) {
-		if (vma->vm_pgoff & PGOFF_LOFFT_MAX)
+		if (desc->pgoff & PGOFF_LOFFT_MAX)
 			return -EINVAL;
 	}
 
 	/* must be huge page aligned */
-	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
+	if (desc->pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
-	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
-	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+	vma_len = (loff_t)vma_desc_size(desc);
+	len = vma_len + ((loff_t)desc->pgoff << PAGE_SHIFT);
 	/* check for overflow */
 	if (len < vma_len)
 		return -EINVAL;
@@ -141,7 +148,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	ret = -ENOMEM;
 
-	vm_flags = vma->vm_flags;
+	vm_flags = desc->vm_flags;
 	/*
 	 * for SHM_HUGETLB, the pages are reserved in the shmget() call so skip
 	 * reserving here. Note: only for SHM hugetlbfs file, the inode
@@ -151,17 +158,20 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 		vm_flags |= VM_NORESERVE;
 
 	if (hugetlb_reserve_pages(inode,
-				vma->vm_pgoff >> huge_page_order(h),
-				len >> huge_page_shift(h), vma,
-				vm_flags) < 0)
+			desc->pgoff >> huge_page_order(h),
+			len >> huge_page_shift(h), desc,
+			vm_flags) < 0)
 		goto out;
 
 	ret = 0;
-	if (vma->vm_flags & VM_WRITE && inode->i_size < len)
+	if ((desc->vm_flags & VM_WRITE) && inode->i_size < len)
 		i_size_write(inode, len);
 out:
 	inode_unlock(inode);
 
+	/* Allocate the VMA lock after we set it up. */
+	if (!ret)
+		desc->action.success_hook = hugetlb_file_mmap_prepare_success;
 	return ret;
 }
 
@@ -1221,7 +1231,7 @@ static void init_once(void *foo)
 
 static const struct file_operations hugetlbfs_file_operations = {
 	.read_iter		= hugetlbfs_read_iter,
-	.mmap			= hugetlbfs_file_mmap,
+	.mmap_prepare		= hugetlbfs_file_mmap_prepare,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
 	.llseek			= default_llseek,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e63e46b8e1f..2387513d6ae5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -150,8 +150,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			     struct folio **foliop);
 #endif /* CONFIG_USERFAULTFD */
 long hugetlb_reserve_pages(struct inode *inode, long from, long to,
-						struct vm_area_struct *vma,
-						vm_flags_t vm_flags);
+			   struct vm_area_desc *desc, vm_flags_t vm_flags);
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
 bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
@@ -280,6 +279,7 @@ bool is_hugetlb_entry_hwpoisoned(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
 void fixup_hugetlb_reservations(struct vm_area_struct *vma);
 void hugetlb_split(struct vm_area_struct *vma, unsigned long addr);
+int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
@@ -466,6 +466,11 @@ static inline void fixup_hugetlb_reservations(struct vm_area_struct *vma)
 
 static inline void hugetlb_split(struct vm_area_struct *vma, unsigned long addr) {}
 
+static inline int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #ifndef pgd_write
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..a27aa0162918 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -2,22 +2,27 @@
 #ifndef _LINUX_HUGETLB_INLINE_H
 #define _LINUX_HUGETLB_INLINE_H
 
-#ifdef CONFIG_HUGETLB_PAGE
-
 #include <linux/mm.h>
 
-static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+#ifdef CONFIG_HUGETLB_PAGE
+
+static inline bool is_vm_hugetlb_flags(vm_flags_t vm_flags)
 {
-	return !!(vma->vm_flags & VM_HUGETLB);
+	return !!(vm_flags & VM_HUGETLB);
 }
 
 #else
 
-static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+static inline bool is_vm_hugetlb_flags(vm_flags_t vm_flags)
 {
 	return false;
 }
 
 #endif
 
+static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
+{
+	return is_vm_hugetlb_flags(vma->vm_flags);
+}
+
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1806685ea326..af28f7fbabb8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -119,7 +119,6 @@ struct mutex *hugetlb_fault_mutex_table __ro_after_init;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, bool take_locks);
@@ -427,17 +426,21 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 	}
 }
 
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+/*
+ * vma specific semaphore used for pmd sharing and fault/truncation
+ * synchronization
+ */
+int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 {
 	struct hugetlb_vma_lock *vma_lock;
 
 	/* Only establish in (flags) sharable vmas */
 	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
-		return;
+		return 0;
 
 	/* Should never get here with non-NULL vm_private_data */
 	if (vma->vm_private_data)
-		return;
+		return -EINVAL;
 
 	vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
 	if (!vma_lock) {
@@ -452,13 +455,15 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 		 * allocation failure.
 		 */
 		pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
-		return;
+		return -EINVAL;
 	}
 
 	kref_init(&vma_lock->refs);
 	init_rwsem(&vma_lock->rw_sema);
 	vma_lock->vma = vma;
 	vma->vm_private_data = vma_lock;
+
+	return 0;
 }
 
 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -1190,20 +1195,28 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 	}
 }
 
-static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
+static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
 {
-	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
-	VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma);
+	VM_WARN_ON_ONCE_VMA(!is_vm_hugetlb_page(vma), vma);
+	VM_WARN_ON_ONCE_VMA(vma->vm_flags & VM_MAYSHARE, vma);
 
-	set_vma_private_data(vma, (unsigned long)map);
+	set_vma_private_data(vma, get_vma_private_data(vma) | flags);
 }
 
-static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
+static void set_vma_desc_resv_map(struct vm_area_desc *desc, struct resv_map *map)
 {
-	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
-	VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma);
+	VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+	VM_WARN_ON_ONCE(desc->vm_flags & VM_MAYSHARE);
 
-	set_vma_private_data(vma, get_vma_private_data(vma) | flags);
+	desc->private_data = map;
+}
+
+static void set_vma_desc_resv_flags(struct vm_area_desc *desc, unsigned long flags)
+{
+	VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+	VM_WARN_ON_ONCE(desc->vm_flags & VM_MAYSHARE);
+
+	desc->private_data = (void *)((unsigned long)desc->private_data | flags);
 }
 
 static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
@@ -1213,6 +1226,13 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
 	return (get_vma_private_data(vma) & flag) != 0;
 }
 
+static bool is_vma_desc_resv_set(struct vm_area_desc *desc, unsigned long flag)
+{
+	VM_WARN_ON_ONCE(!is_vm_hugetlb_flags(desc->vm_flags));
+
+	return ((unsigned long)desc->private_data) & flag;
+}
+
 bool __vma_private_lock(struct vm_area_struct *vma)
 {
 	return !(vma->vm_flags & VM_MAYSHARE) &&
@@ -7250,9 +7270,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
  */
 
 long hugetlb_reserve_pages(struct inode *inode,
-					long from, long to,
-					struct vm_area_struct *vma,
-					vm_flags_t vm_flags)
+		long from, long to,
+		struct vm_area_desc *desc,
+		vm_flags_t vm_flags)
 {
 	long chg = -1, add = -1, spool_resv, gbl_resv;
 	struct hstate *h = hstate_inode(inode);
@@ -7267,12 +7287,6 @@ long hugetlb_reserve_pages(struct inode *inode,
 		return -EINVAL;
 	}
 
-	/*
-	 * vma specific semaphore used for pmd sharing and fault/truncation
-	 * synchronization
-	 */
-	hugetlb_vma_lock_alloc(vma);
-
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
@@ -7285,9 +7299,9 @@ long hugetlb_reserve_pages(struct inode *inode,
 	 * Shared mappings base their reservation on the number of pages that
 	 * are already allocated on behalf of the file. Private mappings need
 	 * to reserve the full area even if read-only as mprotect() may be
-	 * called to make the mapping read-write. Assume !vma is a shm mapping
+	 * called to make the mapping read-write. Assume !desc is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+	if (!desc || desc->vm_flags & VM_MAYSHARE) {
 		/*
 		 * resv_map can not be NULL as hugetlb_reserve_pages is only
 		 * called for inodes for which resv_maps were created (see
@@ -7304,8 +7318,8 @@ long hugetlb_reserve_pages(struct inode *inode,
 
 		chg = to - from;
 
-		set_vma_resv_map(vma, resv_map);
-		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
+		set_vma_desc_resv_map(desc, resv_map);
+		set_vma_desc_resv_flags(desc, HPAGE_RESV_OWNER);
 	}
 
 	if (chg < 0)
@@ -7315,7 +7329,7 @@ long hugetlb_reserve_pages(struct inode *inode,
 				chg * pages_per_huge_page(h), &h_cg) < 0)
 		goto out_err;
 
-	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
+	if (desc && !(desc->vm_flags & VM_MAYSHARE) && h_cg) {
 		/* For private mappings, the hugetlb_cgroup uncharge info hangs
 		 * of the resv_map.
 		 */
@@ -7349,7 +7363,7 @@ long hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+	if (!desc || desc->vm_flags & VM_MAYSHARE) {
 		add = region_add(resv_map, from, to, regions_needed, h, h_cg);
 
 		if (unlikely(add < 0)) {
@@ -7403,16 +7417,15 @@ long hugetlb_reserve_pages(struct inode *inode,
 	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
 					    chg * pages_per_huge_page(h), h_cg);
 out_err:
-	hugetlb_vma_lock_free(vma);
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
+	if (!desc || desc->vm_flags & VM_MAYSHARE)
 		/* Only call region_abort if the region_chg succeeded but the
 		 * region_add failed or didn't run.
 		 */
 		if (chg >= 0 && add < 0)
 			region_abort(resv_map, from, to, regions_needed);
-	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	if (desc && is_vma_desc_resv_set(desc, HPAGE_RESV_OWNER)) {
 		kref_put(&resv_map->refs, resv_map_release);
-		set_vma_resv_map(vma, NULL);
+		set_vma_desc_resv_map(desc, NULL);
 	}
 	return chg < 0 ? chg : add < 0 ? add : -EINVAL;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 11/13] mm: update mem char driver to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (9 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:40   ` Jason Gunthorpe
  2025-09-16 14:11 ` [PATCH v3 12/13] mm: update resctl " Lorenzo Stoakes
  2025-09-16 14:11 ` [PATCH v3 13/13] iommufd: update " Lorenzo Stoakes
  12 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Update the mem char driver (backing /dev/mem and /dev/zero) to use
f_op->mmap_prepare hook rather than the deprecated f_op->mmap.

The /dev/zero implementation has a very unique and rather concerning
characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
when they are, in fact, not.

The new f_op->mmap_prepare() can support this, but rather than introducing
a helper function to perform this hack (and risk introducing other users),
simply set desc->vm_op to NULL here and add a comment describing what's
going on.

We also introduce shmem_zero_setup_desc() to allow for the shared mapping
case via an f_op->mmap_prepare() hook, and generalise the code between
this and shmem_zero_setup().

We also use the desc->action_error_hook to filter the remap error to
-EAGAIN to keep behaviour consistent.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 drivers/char/mem.c       | 76 ++++++++++++++++++++++------------------
 include/linux/shmem_fs.h |  3 +-
 mm/shmem.c               | 40 ++++++++++++++++-----
 3 files changed, 76 insertions(+), 43 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 34b815901b20..0136b82c2a29 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -304,13 +304,13 @@ static unsigned zero_mmap_capabilities(struct file *file)
 }
 
 /* can't do an in-place private mapping if there's no MMU */
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
 {
-	return is_nommu_shared_mapping(vma->vm_flags);
+	return is_nommu_shared_mapping(desc->vm_flags);
 }
 #else
 
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
 {
 	return 1;
 }
@@ -322,46 +322,49 @@ static const struct vm_operations_struct mmap_mem_ops = {
 #endif
 };
 
-static int mmap_mem(struct file *file, struct vm_area_struct *vma)
+static int mmap_filter_error(int err)
 {
-	size_t size = vma->vm_end - vma->vm_start;
-	phys_addr_t offset = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+	return -EAGAIN;
+}
+
+static int mmap_mem_prepare(struct vm_area_desc *desc)
+{
+	struct file *file = desc->file;
+	const size_t size = vma_desc_size(desc);
+	const phys_addr_t offset = (phys_addr_t)desc->pgoff << PAGE_SHIFT;
 
 	/* Does it even fit in phys_addr_t? */
-	if (offset >> PAGE_SHIFT != vma->vm_pgoff)
+	if (offset >> PAGE_SHIFT != desc->pgoff)
 		return -EINVAL;
 
 	/* It's illegal to wrap around the end of the physical address space. */
 	if (offset + (phys_addr_t)size - 1 < offset)
 		return -EINVAL;
 
-	if (!valid_mmap_phys_addr_range(vma->vm_pgoff, size))
+	if (!valid_mmap_phys_addr_range(desc->pgoff, size))
 		return -EINVAL;
 
-	if (!private_mapping_ok(vma))
+	if (!private_mapping_ok(desc))
 		return -ENOSYS;
 
-	if (!range_is_allowed(vma->vm_pgoff, size))
+	if (!range_is_allowed(desc->pgoff, size))
 		return -EPERM;
 
-	if (!phys_mem_access_prot_allowed(file, vma->vm_pgoff, size,
-						&vma->vm_page_prot))
+	if (!phys_mem_access_prot_allowed(file, desc->pgoff, size,
+					  &desc->page_prot))
 		return -EINVAL;
 
-	vma->vm_page_prot = phys_mem_access_prot(file, vma->vm_pgoff,
-						 size,
-						 vma->vm_page_prot);
+	desc->page_prot = phys_mem_access_prot(file, desc->pgoff,
+					       size,
+					       desc->page_prot);
 
-	vma->vm_ops = &mmap_mem_ops;
+	desc->vm_ops = &mmap_mem_ops;
+
+	/* Remap-pfn-range will mark the range VM_IO. */
+	mmap_action_remap_full(desc, desc->pgoff);
+	/* We filter remap errors to -EAGAIN. */
+	desc->action.error_hook = mmap_filter_error;
 
-	/* Remap-pfn-range will mark the range VM_IO */
-	if (remap_pfn_range(vma,
-			    vma->vm_start,
-			    vma->vm_pgoff,
-			    size,
-			    vma->vm_page_prot)) {
-		return -EAGAIN;
-	}
 	return 0;
 }
 
@@ -501,14 +504,18 @@ static ssize_t read_zero(struct file *file, char __user *buf,
 	return cleared;
 }
 
-static int mmap_zero(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_zero(struct vm_area_desc *desc)
 {
 #ifndef CONFIG_MMU
 	return -ENOSYS;
 #endif
-	if (vma->vm_flags & VM_SHARED)
-		return shmem_zero_setup(vma);
-	vma_set_anonymous(vma);
+	if (desc->vm_flags & VM_SHARED)
+		return shmem_zero_setup_desc(desc);
+	/*
+	 * This is a highly unique situation where we mark a MAP_PRIVATE mapping
+	 * of /dev/zero anonymous, despite it not being.
+	 */
+	desc->vm_ops = NULL;
 	return 0;
 }
 
@@ -526,10 +533,11 @@ static unsigned long get_unmapped_area_zero(struct file *file,
 {
 	if (flags & MAP_SHARED) {
 		/*
-		 * mmap_zero() will call shmem_zero_setup() to create a file,
-		 * so use shmem's get_unmapped_area in case it can be huge;
-		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
-		 * so as not to confuse shmem with our handle on "/dev/zero".
+		 * mmap_prepare_zero() will call shmem_zero_setup() to create a
+		 * file, so use shmem's get_unmapped_area in case it can be
+		 * huge; and pass NULL for file as in mmap.c's
+		 * get_unmapped_area(), so as not to confuse shmem with our
+		 * handle on "/dev/zero".
 		 */
 		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
 	}
@@ -632,7 +640,7 @@ static const struct file_operations __maybe_unused mem_fops = {
 	.llseek		= memory_lseek,
 	.read		= read_mem,
 	.write		= write_mem,
-	.mmap		= mmap_mem,
+	.mmap_prepare	= mmap_mem_prepare,
 	.open		= open_mem,
 #ifndef CONFIG_MMU
 	.get_unmapped_area = get_unmapped_area_mem,
@@ -668,7 +676,7 @@ static const struct file_operations zero_fops = {
 	.write_iter	= write_iter_zero,
 	.splice_read	= copy_splice_read,
 	.splice_write	= splice_write_zero,
-	.mmap		= mmap_zero,
+	.mmap_prepare	= mmap_prepare_zero,
 	.get_unmapped_area = get_unmapped_area_zero,
 #ifndef CONFIG_MMU
 	.mmap_capabilities = zero_mmap_capabilities,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0e47465ef0fd..5b368f9549d6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -94,7 +94,8 @@ extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
 		const char *name, loff_t size, unsigned long flags);
-extern int shmem_zero_setup(struct vm_area_struct *);
+int shmem_zero_setup(struct vm_area_struct *vma);
+int shmem_zero_setup_desc(struct vm_area_desc *desc);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
diff --git a/mm/shmem.c b/mm/shmem.c
index df02a2e0ebbb..c5744c711f6c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5893,14 +5893,9 @@ struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name,
 }
 EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt);
 
-/**
- * shmem_zero_setup - setup a shared anonymous mapping
- * @vma: the vma to be mmapped is prepared by do_mmap
- */
-int shmem_zero_setup(struct vm_area_struct *vma)
+static struct file *__shmem_zero_setup(unsigned long start, unsigned long end, vm_flags_t vm_flags)
 {
-	struct file *file;
-	loff_t size = vma->vm_end - vma->vm_start;
+	loff_t size = end - start;
 
 	/*
 	 * Cloning a new file under mmap_lock leads to a lock ordering conflict
@@ -5908,7 +5903,17 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	 * accessible to the user through its mapping, use S_PRIVATE flag to
 	 * bypass file security, in the same way as shmem_kernel_file_setup().
 	 */
-	file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
+	return shmem_kernel_file_setup("dev/zero", size, vm_flags);
+}
+
+/**
+ * shmem_zero_setup - setup a shared anonymous mapping
+ * @vma: the vma to be mmapped is prepared by do_mmap
+ */
+int shmem_zero_setup(struct vm_area_struct *vma)
+{
+	struct file *file = __shmem_zero_setup(vma->vm_start, vma->vm_end, vma->vm_flags);
+
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
@@ -5920,6 +5925,25 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	return 0;
 }
 
+/**
+ * shmem_zero_setup_desc - same as shmem_zero_setup, but determined by VMA
+ * descriptor for convenience.
+ * @desc: Describes VMA
+ * Returns: 0 on success, or error
+ */
+int shmem_zero_setup_desc(struct vm_area_desc *desc)
+{
+	struct file *file = __shmem_zero_setup(desc->start, desc->end, desc->vm_flags);
+
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	desc->vm_file = file;
+	desc->vm_ops = &shmem_anon_vm_ops;
+
+	return 0;
+}
+
 /**
  * shmem_read_folio_gfp - read into page cache, using specified page allocation flags.
  * @mapping:	the folio's address_space
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 12/13] mm: update resctl to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (10 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 11/13] mm: update mem char driver " Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 17:40   ` Jason Gunthorpe
  2025-09-16 14:11 ` [PATCH v3 13/13] iommufd: update " Lorenzo Stoakes
  12 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Make use of the ability to specify a remap action within mmap_prepare to
update the resctl pseudo-lock to use mmap_prepare in favour of the
deprecated mmap hook.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
---
 fs/resctrl/pseudo_lock.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
index 87bbc2605de1..0bfc13c5b96d 100644
--- a/fs/resctrl/pseudo_lock.c
+++ b/fs/resctrl/pseudo_lock.c
@@ -995,10 +995,11 @@ static const struct vm_operations_struct pseudo_mmap_ops = {
 	.mremap = pseudo_lock_dev_mremap,
 };
 
-static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
+static int pseudo_lock_dev_mmap_prepare(struct vm_area_desc *desc)
 {
-	unsigned long vsize = vma->vm_end - vma->vm_start;
-	unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long off = desc->pgoff << PAGE_SHIFT;
+	unsigned long vsize = vma_desc_size(desc);
+	struct file *filp = desc->file;
 	struct pseudo_lock_region *plr;
 	struct rdtgroup *rdtgrp;
 	unsigned long physical;
@@ -1043,7 +1044,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * Ensure changes are carried directly to the memory being mapped,
 	 * do not allow copy-on-write mapping.
 	 */
-	if (!(vma->vm_flags & VM_SHARED)) {
+	if (!(desc->vm_flags & VM_SHARED)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
@@ -1055,12 +1056,9 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	memset(plr->kmem + off, 0, vsize);
 
-	if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
-			    vsize, vma->vm_page_prot)) {
-		mutex_unlock(&rdtgroup_mutex);
-		return -EAGAIN;
-	}
-	vma->vm_ops = &pseudo_mmap_ops;
+	desc->vm_ops = &pseudo_mmap_ops;
+	mmap_action_remap_full(desc, physical + desc->pgoff);
+
 	mutex_unlock(&rdtgroup_mutex);
 	return 0;
 }
@@ -1071,7 +1069,7 @@ static const struct file_operations pseudo_lock_dev_fops = {
 	.write =	NULL,
 	.open =		pseudo_lock_dev_open,
 	.release =	pseudo_lock_dev_release,
-	.mmap =		pseudo_lock_dev_mmap,
+	.mmap_prepare =	pseudo_lock_dev_mmap_prepare,
 };
 
 int rdt_pseudo_lock_init(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 13/13] iommufd: update to use mmap_prepare
  2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (11 preceding siblings ...)
  2025-09-16 14:11 ` [PATCH v3 12/13] mm: update resctl " Lorenzo Stoakes
@ 2025-09-16 14:11 ` Lorenzo Stoakes
  2025-09-16 15:40   ` Jason Gunthorpe
  12 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe, iommu, Kevin Tian,
	Will Deacon, Robin Murphy

Make use of the new mmap_prepare functionality to perform an I/O remap in
favour of the deprecated f_op->mmap hook, hooking the success path to
correctly update the users refcount.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 drivers/iommu/iommufd/main.c | 47 ++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 15af7ced0501..b8b9c0e7520d 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -535,46 +535,51 @@ static const struct vm_operations_struct iommufd_vma_ops = {
 	.close = iommufd_fops_vma_close,
 };
 
+static int iommufd_fops_mmap_success(const struct vm_area_struct *vma)
+{
+	struct iommufd_mmap *immap = vma->vm_private_data;
+
+	/* vm_ops.open won't be called for mmap itself. */
+	refcount_inc(&immap->owner->users);
+
+	return 0;
+}
+
 /* The vm_pgoff must be pre-allocated from mt_mmap, and given to user space */
-static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+static int iommufd_fops_mmap_prepare(struct vm_area_desc *desc)
 {
+	struct file *filp = desc->file;
 	struct iommufd_ctx *ictx = filp->private_data;
-	size_t length = vma->vm_end - vma->vm_start;
+	const size_t length = vma_desc_size(desc);
 	struct iommufd_mmap *immap;
-	int rc;
 
 	if (!PAGE_ALIGNED(length))
 		return -EINVAL;
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(desc->vm_flags & VM_SHARED))
 		return -EINVAL;
-	if (vma->vm_flags & VM_EXEC)
+	if (desc->vm_flags & VM_EXEC)
 		return -EPERM;
 
-	/* vma->vm_pgoff carries a page-shifted start position to an immap */
-	immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff << PAGE_SHIFT);
+	/* desc->pgoff carries a page-shifted start position to an immap */
+	immap = mtree_load(&ictx->mt_mmap, desc->pgoff << PAGE_SHIFT);
 	if (!immap)
 		return -ENXIO;
 	/*
 	 * mtree_load() returns the immap for any contained mmio_addr, so only
 	 * allow the exact immap thing to be mapped
 	 */
-	if (vma->vm_pgoff != immap->vm_pgoff || length != immap->length)
+	if (desc->pgoff != immap->vm_pgoff || length != immap->length)
 		return -ENXIO;
 
-	vma->vm_pgoff = 0;
-	vma->vm_private_data = immap;
-	vma->vm_ops = &iommufd_vma_ops;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	desc->pgoff = 0;
+	desc->private_data = immap;
+	desc->vm_ops = &iommufd_vma_ops;
+	desc->page_prot = pgprot_noncached(desc->page_prot);
 
-	rc = io_remap_pfn_range(vma, vma->vm_start,
-				immap->mmio_addr >> PAGE_SHIFT, length,
-				vma->vm_page_prot);
-	if (rc)
-		return rc;
+	mmap_action_ioremap_full(desc, immap->mmio_addr >> PAGE_SHIFT);
+	desc->action.success_hook = iommufd_fops_mmap_success;
 
-	/* vm_ops.open won't be called for mmap itself. */
-	refcount_inc(&immap->owner->users);
-	return rc;
+	return 0;
 }
 
 static const struct file_operations iommufd_fops = {
@@ -582,7 +587,7 @@ static const struct file_operations iommufd_fops = {
 	.open = iommufd_fops_open,
 	.release = iommufd_fops_release,
 	.unlocked_ioctl = iommufd_fops_ioctl,
-	.mmap = iommufd_fops_mmap,
+	.mmap_prepare = iommufd_fops_mmap_prepare,
 };
 
 /**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 13/13] iommufd: update to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 13/13] iommufd: update " Lorenzo Stoakes
@ 2025-09-16 15:40   ` Jason Gunthorpe
  2025-09-16 16:23     ` Lorenzo Stoakes
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 15:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:59PM +0100, Lorenzo Stoakes wrote:

> -static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> +static int iommufd_fops_mmap_prepare(struct vm_area_desc *desc)
>  {
> +	struct file *filp = desc->file;
>  	struct iommufd_ctx *ictx = filp->private_data;
> -	size_t length = vma->vm_end - vma->vm_start;
> +	const size_t length = vma_desc_size(desc);
>  	struct iommufd_mmap *immap;
> -	int rc;
>  
>  	if (!PAGE_ALIGNED(length))
>  		return -EINVAL;

This is for sure redundant? Ie vma_desc_size() is always page
multiples? Lets drop it

> -	if (!(vma->vm_flags & VM_SHARED))
> +	if (!(desc->vm_flags & VM_SHARED))
>  		return -EINVAL;

This should be that no COW helper David found

> -	/* vma->vm_pgoff carries a page-shifted start position to an immap */
> -	immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff << PAGE_SHIFT);
> +	/* desc->pgoff carries a page-shifted start position to an immap */
> +	immap = mtree_load(&ictx->mt_mmap, desc->pgoff << PAGE_SHIFT);
>  	if (!immap)
>  		return -ENXIO;
>  	/*
>  	 * mtree_load() returns the immap for any contained mmio_addr, so only
>  	 * allow the exact immap thing to be mapped
>  	 */
> -	if (vma->vm_pgoff != immap->vm_pgoff || length != immap->length)
> +	if (desc->pgoff != immap->vm_pgoff || length != immap->length)
>  		return -ENXIO;
>  
> -	vma->vm_pgoff = 0;

I think this is an existing bug, I must have missed it when I reviewed
this. If we drop it then the vma will naturally get pgoff right?

> -	vma->vm_private_data = immap;
> -	vma->vm_ops = &iommufd_vma_ops;
> -	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +	desc->pgoff = 0;
> +	desc->private_data = immap;
> +	desc->vm_ops = &iommufd_vma_ops;
> +	desc->page_prot = pgprot_noncached(desc->page_prot);
>  
> -	rc = io_remap_pfn_range(vma, vma->vm_start,
> -				immap->mmio_addr >> PAGE_SHIFT, length,
> -				vma->vm_page_prot);
> -	if (rc)
> -		return rc;
> +	mmap_action_ioremap_full(desc, immap->mmio_addr >> PAGE_SHIFT);
> +	desc->action.success_hook = iommufd_fops_mmap_success;
>  
> -	/* vm_ops.open won't be called for mmap itself. */
> -	refcount_inc(&immap->owner->users);

Ooh this is racey existing bug, I'm going to send a patch for it
right now.. So success_hook won't work here.

@@ -551,15 +551,24 @@ static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
                return -EPERM;
 
        /* vma->vm_pgoff carries a page-shifted start position to an immap */
+       mtree_lock(&ictx->mt_mmap);
        immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff << PAGE_SHIFT);
-       if (!immap)
+       if (!immap) {
+               mtree_unlock(&ictx->mt_mmap);
                return -ENXIO;
+       }
+       /* vm_ops.open won't be called for mmap itself. */
+       refcount_inc(&immap->owner->users);
+       mtree_unlock(&ictx->mt_mmap);
+
        /*
         * mtree_load() returns the immap for any contained mmio_addr, so only
         * allow the exact immap thing to be mapped
         */
-       if (vma->vm_pgoff != immap->vm_pgoff || length != immap->length)
-               return -ENXIO;
+       if (vma->vm_pgoff != immap->vm_pgoff || length != immap->length) {
+               rc = -ENXIO;
+               goto err_refcount;
+       }
 
        vma->vm_pgoff = 0;
        vma->vm_private_data = immap;
@@ -570,10 +579,11 @@ static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
                                immap->mmio_addr >> PAGE_SHIFT, length,
                                vma->vm_page_prot);
        if (rc)
-               return rc;
+               goto err_refcount;
+       return 0;
 
-       /* vm_ops.open won't be called for mmap itself. */
-       refcount_inc(&immap->owner->users);
+err_refcount:
+       refcount_dec(&immap->owner->users);
        return rc;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 13/13] iommufd: update to use mmap_prepare
  2025-09-16 15:40   ` Jason Gunthorpe
@ 2025-09-16 16:23     ` Lorenzo Stoakes
  2025-09-17  1:32       ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 16:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

Andrew - Jason has sent a conflicting patch against this file so it's not
reasonable to include it in this series any more, please drop it.

Sigh.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 16:42   ` Jason Gunthorpe
  2025-09-17 10:30   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 16:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:47PM +0100, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  mm/shmem.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 02/13] device/dax: update devdax to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 02/13] device/dax: update devdax " Lorenzo Stoakes
@ 2025-09-16 16:43   ` Jason Gunthorpe
  2025-09-17 10:37   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 16:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:48PM +0100, Lorenzo Stoakes wrote:
> The devdax driver does nothing special in its f_op->mmap hook, so
> straightforwardly update it to use the mmap_prepare hook instead.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  drivers/dax/device.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-16 14:11 ` [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
@ 2025-09-16 16:46   ` Jason Gunthorpe
  2025-09-17 10:39   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 16:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:49PM +0100, Lorenzo Stoakes wrote:
> It's useful to be able to determine the size of a VMA descriptor range
> used on f_op->mmap_prepare, expressed both in bytes and pages, so add
> helpers for both and update code that could make use of it to do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/ntfs3/file.c    |  2 +-
>  include/linux/mm.h | 10 ++++++++++
>  mm/secretmem.c     |  2 +-
>  3 files changed, 12 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 04/13] relay: update relay to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 04/13] relay: update relay to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 16:48   ` Jason Gunthorpe
  2025-09-17 10:41   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 16:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:50PM +0100, Lorenzo Stoakes wrote:
> It is relatively trivial to update this code to use the f_op->mmap_prepare
> hook in favour of the deprecated f_op->mmap hook, so do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  kernel/relay.c | 33 +++++++++++++++++----------------
>  1 file changed, 17 insertions(+), 16 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion
  2025-09-16 14:11 ` [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
@ 2025-09-16 16:48   ` Jason Gunthorpe
  2025-09-17 10:49   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 16:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:51PM +0100, Lorenzo Stoakes wrote:
> Now we have the f_op->mmap_prepare() hook, having a static function called
> __mmap_prepare() that has nothing to do with it is confusing, so rename
> the function to __mmap_setup().
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/vma.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-16 14:11 ` [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
@ 2025-09-16 17:07   ` Jason Gunthorpe
  2025-09-16 17:37     ` Lorenzo Stoakes
  2025-09-17 11:07   ` Pedro Falcato
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:52PM +0100, Lorenzo Stoakes wrote:
> We need the ability to split PFN remap between updating the VMA and
> performing the actual remap, in order to do away with the legacy
> f_op->mmap hook.
> 
> To do so, update the PFN remap code to provide shared logic, and also make
> remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
> was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").
> 
> Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor
> and PFN parameters, and remap_pfn_range_complete() which accepts the same
> parameters as remap_pfn_rangte().
> 
> remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> it must be supplied with a correct PFN to do so.  If the caller must hold
> locks to be able to do this, those locks should be held across the
> operation, and mmap_abort() should be provided to revoke the lock should
> an error arise.

It looks like the patches have changed to remove mmap_abort so this
paragraph can probably be dropped.

>  static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
> -		unsigned long pfn, unsigned long size, pgprot_t prot)
> +		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
>  {
>  	pgd_t *pgd;
>  	unsigned long next;
> @@ -2912,32 +2931,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
>  	if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
>  		return -EINVAL;
>  
> -	/*
> -	 * Physically remapped pages are special. Tell the
> -	 * rest of the world about it:
> -	 *   VM_IO tells people not to look at these pages
> -	 *	(accesses can have side effects).
> -	 *   VM_PFNMAP tells the core MM that the base pages are just
> -	 *	raw PFN mappings, and do not have a "struct page" associated
> -	 *	with them.
> -	 *   VM_DONTEXPAND
> -	 *      Disable vma merging and expanding with mremap().
> -	 *   VM_DONTDUMP
> -	 *      Omit vma from core dump, even when VM_IO turned off.
> -	 *
> -	 * There's a horrible special case to handle copy-on-write
> -	 * behaviour that some programs depend on. We mark the "original"
> -	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
> -	 * See vm_normal_page() for details.
> -	 */
> -	if (is_cow_mapping(vma->vm_flags)) {
> -		if (addr != vma->vm_start || end != vma->vm_end)
> -			return -EINVAL;
> -		vma->vm_pgoff = pfn;
> +	if (set_vma) {
> +		err = get_remap_pgoff(vma->vm_flags, addr, end,
> +				      vma->vm_start, vma->vm_end,
> +				      pfn, &vma->vm_pgoff);
> +		if (err)
> +			return err;
> +		vm_flags_set(vma, VM_REMAP_FLAGS);
> +	} else {
> +		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);
>  	}

It looks like you can avoid the changes to add set_vma by making
remap_pfn_range_internal() only do the complete portion and giving
the legacy calls a helper to do prepare in line:

int remap_pfn_range_prepare_vma(..)
{
	int err;

	err = get_remap_pgoff(vma->vm_flags, addr, end,
			      vma->vm_start, vma->vm_end,
			      pfn, &vma->vm_pgoff);
	if (err)
		return err;
	vm_flags_set(vma, VM_REMAP_FLAGS);
	return 0;
}

int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
	    	    unsigned long pfn, unsigned long size, pgprot_t prot)
{
	int err;

	err = remap_pfn_range_prepare_vma(vma, addr, pfn, size)
	if (err)
	     return err;

	if (IS_ENABLED(__HAVE_PFNMAP_TRACKING))
		return remap_pfn_range_track(vma, addr, pfn, size, prot);
	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
}

(fix pgtable_Types.h to #define to 1 so IS_ENABLED works)

But the logic here is all fine

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]()
  2025-09-16 14:11 ` [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
@ 2025-09-16 17:19   ` Jason Gunthorpe
  2025-09-16 17:34     ` Lorenzo Stoakes
  2025-09-17 11:12   ` Pedro Falcato
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:53PM +0100, Lorenzo Stoakes wrote:
>  
> -int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
> -		unsigned long pfn, unsigned long size, pgprot_t prot)
> +static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
>  {
>  	phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);
>  
> -	return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
> +	return phys_addr >> PAGE_SHIFT;
> +}

Given you changed all of these to add a calc_pfn why not make that
the arch abstraction?

static unsigned long arch_io_remap_remap_pfn(unsigned long pfn, unsigned long size)
{
..
}
#define arch_io_remap_remap_pfn arch_io_remap_remap_pfn

[..]

#ifndef arch_io_remap_remap_pfn
static inline unsigned long arch_io_remap_remap_pfn(unsigned long pfn, unsigned long size)
{
	return pfn;
}
#endif

static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
	unsigned long size)
{
	return remap_pfn_range_prepare(desc, arch_io_remap_remap_pfn(pfn));
}

etc

Removes alot of the maze here.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 14:11 ` [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
@ 2025-09-16 17:28   ` Jason Gunthorpe
  2025-09-16 17:57     ` Lorenzo Stoakes
  2025-09-17 11:32   ` Pedro Falcato
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:54PM +0100, Lorenzo Stoakes wrote:

> +/* What action should be taken after an .mmap_prepare call is complete? */
> +enum mmap_action_type {
> +	MMAP_NOTHING,		/* Mapping is complete, no further action. */
> +	MMAP_REMAP_PFN,		/* Remap PFN range. */

Seems like it would be a bit tider to include MMAP_IO_REMAP_PFN here
instead of having the is_io_remap bool.

> @@ -1155,15 +1155,18 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
>  		.vm_file = vma->vm_file,
>  		.vm_flags = vma->vm_flags,
>  		.page_prot = vma->vm_page_prot,
> +
> +		.action.type = MMAP_NOTHING, /* Default */
>  	};
>  	int err;
>  
>  	err = f_op->mmap_prepare(&desc);
>  	if (err)
>  		return err;
> -	set_vma_from_desc(vma, &desc);
>  
> -	return 0;
> +	mmap_action_prepare(&desc.action, &desc);
> +	set_vma_from_desc(vma, &desc);
> +	return mmap_action_complete(&desc.action, vma);
>  }
>  EXPORT_SYMBOL(__compat_vma_mmap_prepare);

A function called prepare that now calls complete has become a bit oddly named??

> +int mmap_action_complete(struct mmap_action *action,
> +			 struct vm_area_struct *vma)
> +{
> +	int err = 0;
> +
> +	switch (action->type) {
> +	case MMAP_NOTHING:
> +		break;
> +	case MMAP_REMAP_PFN:
> +		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) !=
> +				VM_REMAP_FLAGS);

This is checked in remap_pfn_range_complete() IIRC? Probably not
needed here as well then.

> +		if (action->remap.is_io_remap)
> +			err = io_remap_pfn_range_complete(vma, action->remap.start,
> +				action->remap.start_pfn, action->remap.size,
> +				action->remap.pgprot);
> +		else
> +			err = remap_pfn_range_complete(vma, action->remap.start,
> +				action->remap.start_pfn, action->remap.size,
> +				action->remap.pgprot);
> +		break;
> +	}
> +
> +	/*
> +	 * If an error occurs, unmap the VMA altogether and return an error. We
> +	 * only clear the newly allocated VMA, since this function is only
> +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> +	 */
> +	if (err) {
> +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> +
> +		do_munmap(current->mm, vma->vm_start, len, NULL);
> +
> +		if (action->error_hook) {
> +			/* We may want to filter the error. */
> +			err = action->error_hook(err);
> +
> +			/* The caller should not clear the error. */
> +			VM_WARN_ON_ONCE(!err);
> +		}
> +		return err;
> +	}
> +
> +	if (action->success_hook)
> +		err = action->success_hook(vma);
> +
> +	return err;

I would write this as

	if (action->success_hook)
		return action->success_hook(vma);

	return 0;

Just for emphasis this is the success path.

> +int mmap_action_complete(struct mmap_action *action,
> +			struct vm_area_struct *vma)
> +{
> +	int err = 0;
> +
> +	switch (action->type) {
> +	case MMAP_NOTHING:
> +		break;
> +	case MMAP_REMAP_PFN:
> +		WARN_ON_ONCE(1); /* nommu cannot handle this. */
> +
> +		break;
> +	}
> +
> +	/*
> +	 * If an error occurs, unmap the VMA altogether and return an error. We
> +	 * only clear the newly allocated VMA, since this function is only
> +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> +	 */
> +	if (err) {
> +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> +
> +		do_munmap(current->mm, vma->vm_start, len, NULL);
> +
> +		if (action->error_hook) {
> +			/* We may want to filter the error. */
> +			err = action->error_hook(err);
> +
> +			/* The caller should not clear the error. */
> +			VM_WARN_ON_ONCE(!err);
> +		}
> +		return err;
> +	}

err is never !0 here, so this should go to a later patch/series.

Also seems like this cleanup wants to be in a function that is not
protected by #ifdef nommu since the code is identical on both branches.

> +	if (action->success_hook)
> +		err = action->success_hook(vma);
> +
> +	return 0;

return err, though prefer to match above, and probably this sequence
should be pulled into the same shared function as above too.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare Lorenzo Stoakes
@ 2025-09-16 17:30   ` Jason Gunthorpe
  0 siblings, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:56PM +0100, Lorenzo Stoakes wrote:
> Since we can now perform actions after the VMA is established via
> mmap_prepare, use desc->action_success_hook to set up the hugetlb lock
> once the VMA is setup.
> 
> We also make changes throughout hugetlbfs to make this possible.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  fs/hugetlbfs/inode.c           | 36 ++++++++++------
>  include/linux/hugetlb.h        |  9 +++-
>  include/linux/hugetlb_inline.h | 15 ++++---
>  mm/hugetlb.c                   | 77 ++++++++++++++++++++--------------
>  4 files changed, 85 insertions(+), 52 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]()
  2025-09-16 17:19   ` Jason Gunthorpe
@ 2025-09-16 17:34     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 17:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:11:53PM +0100, Lorenzo Stoakes wrote:
> >
> > -int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
> > -		unsigned long pfn, unsigned long size, pgprot_t prot)
> > +static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
> >  {
> >  	phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);
> >
> > -	return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
> > +	return phys_addr >> PAGE_SHIFT;
> > +}
>
> Given you changed all of these to add a calc_pfn why not make that
> the arch abstraction?

OK that's reasonable, will do.

>
> static unsigned long arch_io_remap_remap_pfn(unsigned long pfn, unsigned long size)
> {
> ..
> }
> #define arch_io_remap_remap_pfn arch_io_remap_remap_pfn
>
> [..]
>
> #ifndef arch_io_remap_remap_pfn
> static inline unsigned long arch_io_remap_remap_pfn(unsigned long pfn, unsigned long size)
> {
> 	return pfn;
> }
> #endif
>
> static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
> 	unsigned long size)
> {
> 	return remap_pfn_range_prepare(desc, arch_io_remap_remap_pfn(pfn));
> }
>
> etc
>
> Removes alot of the maze here.

Actually nice to restrict what arches can do here also... :)

>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-16 17:07   ` Jason Gunthorpe
@ 2025-09-16 17:37     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 17:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 02:07:23PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:11:52PM +0100, Lorenzo Stoakes wrote:
> > We need the ability to split PFN remap between updating the VMA and
> > performing the actual remap, in order to do away with the legacy
> > f_op->mmap hook.
> >
> > To do so, update the PFN remap code to provide shared logic, and also make
> > remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
> > was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").
> >
> > Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor
> > and PFN parameters, and remap_pfn_range_complete() which accepts the same
> > parameters as remap_pfn_rangte().
> >
> > remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> > it must be supplied with a correct PFN to do so.  If the caller must hold
> > locks to be able to do this, those locks should be held across the
> > operation, and mmap_abort() should be provided to revoke the lock should
> > an error arise.
>
> It looks like the patches have changed to remove mmap_abort so this
> paragraph can probably be dropped.

Ugh, thought I'd caught all of these, oops. Will fixup...

>
> >  static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
> > -		unsigned long pfn, unsigned long size, pgprot_t prot)
> > +		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
> >  {
> >  	pgd_t *pgd;
> >  	unsigned long next;
> > @@ -2912,32 +2931,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
> >  	if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
> >  		return -EINVAL;
> >
> > -	/*
> > -	 * Physically remapped pages are special. Tell the
> > -	 * rest of the world about it:
> > -	 *   VM_IO tells people not to look at these pages
> > -	 *	(accesses can have side effects).
> > -	 *   VM_PFNMAP tells the core MM that the base pages are just
> > -	 *	raw PFN mappings, and do not have a "struct page" associated
> > -	 *	with them.
> > -	 *   VM_DONTEXPAND
> > -	 *      Disable vma merging and expanding with mremap().
> > -	 *   VM_DONTDUMP
> > -	 *      Omit vma from core dump, even when VM_IO turned off.
> > -	 *
> > -	 * There's a horrible special case to handle copy-on-write
> > -	 * behaviour that some programs depend on. We mark the "original"
> > -	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
> > -	 * See vm_normal_page() for details.
> > -	 */
> > -	if (is_cow_mapping(vma->vm_flags)) {
> > -		if (addr != vma->vm_start || end != vma->vm_end)
> > -			return -EINVAL;
> > -		vma->vm_pgoff = pfn;
> > +	if (set_vma) {
> > +		err = get_remap_pgoff(vma->vm_flags, addr, end,
> > +				      vma->vm_start, vma->vm_end,
> > +				      pfn, &vma->vm_pgoff);
> > +		if (err)
> > +			return err;
> > +		vm_flags_set(vma, VM_REMAP_FLAGS);
> > +	} else {
> > +		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);
> >  	}
>
> It looks like you can avoid the changes to add set_vma by making
> remap_pfn_range_internal() only do the complete portion and giving
> the legacy calls a helper to do prepare in line:

OK nice, yeah I would always prefer to avoid a boolean parameter if possible.

Will do something similar to below.

>
> int remap_pfn_range_prepare_vma(..)
> {
> 	int err;
>
> 	err = get_remap_pgoff(vma->vm_flags, addr, end,
> 			      vma->vm_start, vma->vm_end,
> 			      pfn, &vma->vm_pgoff);
> 	if (err)
> 		return err;
> 	vm_flags_set(vma, VM_REMAP_FLAGS);
> 	return 0;
> }
>
> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> 	    	    unsigned long pfn, unsigned long size, pgprot_t prot)
> {
> 	int err;
>
> 	err = remap_pfn_range_prepare_vma(vma, addr, pfn, size)
> 	if (err)
> 	     return err;
>
> 	if (IS_ENABLED(__HAVE_PFNMAP_TRACKING))
> 		return remap_pfn_range_track(vma, addr, pfn, size, prot);
> 	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> }
>
> (fix pgtable_Types.h to #define to 1 so IS_ENABLED works)
>
> But the logic here is all fine
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks!

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 11/13] mm: update mem char driver to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 11/13] mm: update mem char driver " Lorenzo Stoakes
@ 2025-09-16 17:40   ` Jason Gunthorpe
  2025-09-16 18:02     ` Lorenzo Stoakes
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:57PM +0100, Lorenzo Stoakes wrote:
> Update the mem char driver (backing /dev/mem and /dev/zero) to use
> f_op->mmap_prepare hook rather than the deprecated f_op->mmap.
> 
> The /dev/zero implementation has a very unique and rather concerning
> characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
> when they are, in fact, not.
> 
> The new f_op->mmap_prepare() can support this, but rather than introducing
> a helper function to perform this hack (and risk introducing other users),
> simply set desc->vm_op to NULL here and add a comment describing what's
> going on.
> 
> We also introduce shmem_zero_setup_desc() to allow for the shared mapping
> case via an f_op->mmap_prepare() hook, and generalise the code between
> this and shmem_zero_setup().
> 
> We also use the desc->action_error_hook to filter the remap error to
> -EAGAIN to keep behaviour consistent.

Hurm, in practice this converts reserve_pfn_range()/etc conflicts into
from EINVAL into EAGAIN and converts all the unlikely OOM ENOMEM
failures to EAGAIN. Seems wrong/unnecessary to me, I wouldn't have
preserved it.

But oh well

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 0e47465ef0fd..5b368f9549d6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h

This little bit should probably be its own patch "Add
shmem_zero_setup_desc()", and I wonder if the caller from vma.c can
call the desc version now?

Too bad the usage in ppc is such an adventure through sysfs :\

But the code looks fine

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 12/13] mm: update resctl to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 12/13] mm: update resctl " Lorenzo Stoakes
@ 2025-09-16 17:40   ` Jason Gunthorpe
  2025-09-16 18:02     ` Lorenzo Stoakes
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 17:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:58PM +0100, Lorenzo Stoakes wrote:
> Make use of the ability to specify a remap action within mmap_prepare to
> update the resctl pseudo-lock to use mmap_prepare in favour of the
> deprecated mmap hook.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  fs/resctrl/pseudo_lock.c | 20 +++++++++-----------
>  1 file changed, 9 insertions(+), 11 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 17:28   ` Jason Gunthorpe
@ 2025-09-16 17:57     ` Lorenzo Stoakes
  2025-09-16 18:08       ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 02:28:36PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:11:54PM +0100, Lorenzo Stoakes wrote:
>
> > +/* What action should be taken after an .mmap_prepare call is complete? */
> > +enum mmap_action_type {
> > +	MMAP_NOTHING,		/* Mapping is complete, no further action. */
> > +	MMAP_REMAP_PFN,		/* Remap PFN range. */
>
> Seems like it would be a bit tider to include MMAP_IO_REMAP_PFN here
> instead of having the is_io_remap bool.

Well, I did start with this, but felt simpler to treat it as a remap, and also
semantically, as it's more-or-less a remap with maybe different PFN...

But you know, thinking about it, yeah that's probably nicer, will change.

Often with these things you are back and forth on 'hmm maybe this maybe that' :)

>
> > @@ -1155,15 +1155,18 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
> >  		.vm_file = vma->vm_file,
> >  		.vm_flags = vma->vm_flags,
> >  		.page_prot = vma->vm_page_prot,
> > +
> > +		.action.type = MMAP_NOTHING, /* Default */
> >  	};
> >  	int err;
> >
> >  	err = f_op->mmap_prepare(&desc);
> >  	if (err)
> >  		return err;
> > -	set_vma_from_desc(vma, &desc);
> >
> > -	return 0;
> > +	mmap_action_prepare(&desc.action, &desc);
> > +	set_vma_from_desc(vma, &desc);
> > +	return mmap_action_complete(&desc.action, vma);
> >  }
> >  EXPORT_SYMBOL(__compat_vma_mmap_prepare);
>
> A function called prepare that now calls complete has become a bit oddly named??

That's a very good point... :) I mean it's kinda right in a way because it is a
compatibility layer for mmap_prepare for stacked filesystems etc. that can only
(for now) call .mmap() and are confronted with an underlying thing that has
.mmap_prepare, but also it's confusing now.

Will rename.

>
> > +int mmap_action_complete(struct mmap_action *action,
> > +			 struct vm_area_struct *vma)
> > +{
> > +	int err = 0;
> > +
> > +	switch (action->type) {
> > +	case MMAP_NOTHING:
> > +		break;
> > +	case MMAP_REMAP_PFN:
> > +		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) !=
> > +				VM_REMAP_FLAGS);
>
> This is checked in remap_pfn_range_complete() IIRC? Probably not
> needed here as well then.

Ah ok will drop then.

>
> > +		if (action->remap.is_io_remap)
> > +			err = io_remap_pfn_range_complete(vma, action->remap.start,
> > +				action->remap.start_pfn, action->remap.size,
> > +				action->remap.pgprot);
> > +		else
> > +			err = remap_pfn_range_complete(vma, action->remap.start,
> > +				action->remap.start_pfn, action->remap.size,
> > +				action->remap.pgprot);
> > +		break;
> > +	}
> > +
> > +	/*
> > +	 * If an error occurs, unmap the VMA altogether and return an error. We
> > +	 * only clear the newly allocated VMA, since this function is only
> > +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> > +	 */
> > +	if (err) {
> > +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > +
> > +		do_munmap(current->mm, vma->vm_start, len, NULL);
> > +
> > +		if (action->error_hook) {
> > +			/* We may want to filter the error. */
> > +			err = action->error_hook(err);
> > +
> > +			/* The caller should not clear the error. */
> > +			VM_WARN_ON_ONCE(!err);
> > +		}
> > +		return err;
> > +	}
> > +
> > +	if (action->success_hook)
> > +		err = action->success_hook(vma);
> > +
> > +	return err;
>
> I would write this as
>
> 	if (action->success_hook)
> 		return action->success_hook(vma);
>
> 	return 0;
>
> Just for emphasis this is the success path.

Ack. That is nicer actually.

>
> > +int mmap_action_complete(struct mmap_action *action,
> > +			struct vm_area_struct *vma)
> > +{
> > +	int err = 0;
> > +
> > +	switch (action->type) {
> > +	case MMAP_NOTHING:
> > +		break;
> > +	case MMAP_REMAP_PFN:
> > +		WARN_ON_ONCE(1); /* nommu cannot handle this. */
> > +
> > +		break;
> > +	}
> > +
> > +	/*
> > +	 * If an error occurs, unmap the VMA altogether and return an error. We
> > +	 * only clear the newly allocated VMA, since this function is only
> > +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> > +	 */
> > +	if (err) {
> > +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > +
> > +		do_munmap(current->mm, vma->vm_start, len, NULL);
> > +
> > +		if (action->error_hook) {
> > +			/* We may want to filter the error. */
> > +			err = action->error_hook(err);
> > +
> > +			/* The caller should not clear the error. */
> > +			VM_WARN_ON_ONCE(!err);
> > +		}
> > +		return err;
> > +	}
>
> err is never !0 here, so this should go to a later patch/series.

Right yeah. Doh! Will drop.

>
> Also seems like this cleanup wants to be in a function that is not
> protected by #ifdef nommu since the code is identical on both branches.

Not sure which cleanup you mean, this is new code :)

I don't at all like functions that if #ifdef nommu embedded in them.

And I frankly resent that we support nommu so I'm not inclined to share code
between that and code for arches that people actually use.

Anyway, we can probably simplify this quite a bit.

	WARN_ON_ONCE(action->type != MMAP_NOTHING);
	return 0;

>
> > +	if (action->success_hook)
> > +		err = action->success_hook(vma);
> > +
> > +	return 0;
>
> return err, though prefer to match above, and probably this sequence
> should be pulled into the same shared function as above too.

Yeah I mean, you're not going to make me actually have to ack nommu properly are
you?..

I suppose we could be tasteful and have a separate 'handle hooks' function or
something here or something.

Let me put my bias aside and take a look at that.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 11/13] mm: update mem char driver to use mmap_prepare
  2025-09-16 17:40   ` Jason Gunthorpe
@ 2025-09-16 18:02     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 18:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 02:40:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:11:57PM +0100, Lorenzo Stoakes wrote:
> > Update the mem char driver (backing /dev/mem and /dev/zero) to use
> > f_op->mmap_prepare hook rather than the deprecated f_op->mmap.
> >
> > The /dev/zero implementation has a very unique and rather concerning
> > characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
> > when they are, in fact, not.
> >
> > The new f_op->mmap_prepare() can support this, but rather than introducing
> > a helper function to perform this hack (and risk introducing other users),
> > simply set desc->vm_op to NULL here and add a comment describing what's
> > going on.
> >
> > We also introduce shmem_zero_setup_desc() to allow for the shared mapping
> > case via an f_op->mmap_prepare() hook, and generalise the code between
> > this and shmem_zero_setup().
> >
> > We also use the desc->action_error_hook to filter the remap error to
> > -EAGAIN to keep behaviour consistent.
>
> Hurm, in practice this converts reserve_pfn_range()/etc conflicts into
> from EINVAL into EAGAIN and converts all the unlikely OOM ENOMEM
> failures to EAGAIN. Seems wrong/unnecessary to me, I wouldn't have
> preserved it.

Yeah I don't love it, people get antsy sometimes about changing what the
error code is.

I'd rather pass through than filter but there we are.

>
> But oh well
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks!

>
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index 0e47465ef0fd..5b368f9549d6 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
>
> This little bit should probably be its own patch "Add
> shmem_zero_setup_desc()", and I wonder if the caller from vma.c can
> call the desc version now?

Ack! Let me look into whether vma.c caller can use that also.

>
> Too bad the usage in ppc is such an adventure through sysfs :\

The arch code sure can be fun!

>
> But the code looks fine

Cheers!

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 12/13] mm: update resctl to use mmap_prepare
  2025-09-16 17:40   ` Jason Gunthorpe
@ 2025-09-16 18:02     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 18:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 02:40:27PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:11:58PM +0100, Lorenzo Stoakes wrote:
> > Make use of the ability to specify a remap action within mmap_prepare to
> > update the resctl pseudo-lock to use mmap_prepare in favour of the
> > deprecated mmap hook.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: Reinette Chatre <reinette.chatre@intel.com>
> > ---
> >  fs/resctrl/pseudo_lock.c | 20 +++++++++-----------
> >  1 file changed, 9 insertions(+), 11 deletions(-)
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks for this + all other tags, very much appreciated! :)

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 17:57     ` Lorenzo Stoakes
@ 2025-09-16 18:08       ` Jason Gunthorpe
  2025-09-16 19:31         ` Lorenzo Stoakes
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 18:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 06:57:56PM +0100, Lorenzo Stoakes wrote:
> > > +	/*
> > > +	 * If an error occurs, unmap the VMA altogether and return an error. We
> > > +	 * only clear the newly allocated VMA, since this function is only
> > > +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> > > +	 */
> > > +	if (err) {
> > > +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > +
> > > +		do_munmap(current->mm, vma->vm_start, len, NULL);
> > > +
> > > +		if (action->error_hook) {
> > > +			/* We may want to filter the error. */
> > > +			err = action->error_hook(err);
> > > +
> > > +			/* The caller should not clear the error. */
> > > +			VM_WARN_ON_ONCE(!err);
> > > +		}
> > > +		return err;
> > > +	}
> > Also seems like this cleanup wants to be in a function that is not
> > protected by #ifdef nommu since the code is identical on both branches.
> 
> Not sure which cleanup you mean, this is new code :)

I mean the code I quoted right abouve that cleans up the VMA on
error.. It is always the same finishing sequence, there is no nommu
dependency in it.

Just put it all in some "finish mmap complete" function and call it in
both mmu and nommu versions.

Jason


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 18:08       ` Jason Gunthorpe
@ 2025-09-16 19:31         ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 19:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:08:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 06:57:56PM +0100, Lorenzo Stoakes wrote:
> > > > +	/*
> > > > +	 * If an error occurs, unmap the VMA altogether and return an error. We
> > > > +	 * only clear the newly allocated VMA, since this function is only
> > > > +	 * invoked if we do NOT merge, so we only clean up the VMA we created.
> > > > +	 */
> > > > +	if (err) {
> > > > +		const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > +
> > > > +		do_munmap(current->mm, vma->vm_start, len, NULL);
> > > > +
> > > > +		if (action->error_hook) {
> > > > +			/* We may want to filter the error. */
> > > > +			err = action->error_hook(err);
> > > > +
> > > > +			/* The caller should not clear the error. */
> > > > +			VM_WARN_ON_ONCE(!err);
> > > > +		}
> > > > +		return err;
> > > > +	}
> > > Also seems like this cleanup wants to be in a function that is not
> > > protected by #ifdef nommu since the code is identical on both branches.
> >
> > Not sure which cleanup you mean, this is new code :)
>
> I mean the code I quoted right abouve that cleans up the VMA on
> error.. It is always the same finishing sequence, there is no nommu
> dependency in it.

Ah right yeah.

I wonder if it's useful if err != 0 for nommu anyway.

>
> Just put it all in some "finish mmap complete" function and call it in
> both mmu and nommu versions.

Ack!

>
> Jason
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 13/13] iommufd: update to use mmap_prepare
  2025-09-16 16:23     ` Lorenzo Stoakes
@ 2025-09-17  1:32       ` Andrew Morton
  2025-09-17 10:13         ` Lorenzo Stoakes
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2025-09-17  1:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jason Gunthorpe, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, 16 Sep 2025 17:23:31 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Andrew - Jason has sent a conflicting patch against this file so it's not
> reasonable to include it in this series any more, please drop it.

No probs.

All added to mm-new, thanks.  emails suppressed due to mercy.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 13/13] iommufd: update to use mmap_prepare
  2025-09-17  1:32       ` Andrew Morton
@ 2025-09-17 10:13         ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 10:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 06:32:53PM -0700, Andrew Morton wrote:
> On Tue, 16 Sep 2025 17:23:31 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Andrew - Jason has sent a conflicting patch against this file so it's not
> > reasonable to include it in this series any more, please drop it.
>
> No probs.
>
> All added to mm-new, thanks.  emails suppressed due to mercy.

Thanks, should have a new respin based on Jason's feedback today (with copious
tags everywhere other than the bits I need to fixup so we should hopefully have
this finalised very soon).

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
  2025-09-16 16:42   ` Jason Gunthorpe
@ 2025-09-17 10:30   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 10:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:47PM +0100, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Reviewed-by: Pedro Faclato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 02/13] device/dax: update devdax to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 02/13] device/dax: update devdax " Lorenzo Stoakes
  2025-09-16 16:43   ` Jason Gunthorpe
@ 2025-09-17 10:37   ` Pedro Falcato
  2025-09-17 13:54     ` Lorenzo Stoakes
  1 sibling, 1 reply; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 10:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:48PM +0100, Lorenzo Stoakes wrote:
> The devdax driver does nothing special in its f_op->mmap hook, so
> straightforwardly update it to use the mmap_prepare hook instead.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Acked-by: Pedro Falcato <pfalcato@suse.de>

> ---
>  drivers/dax/device.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 2bb40a6060af..c2181439f925 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -13,8 +13,9 @@
>  #include "dax-private.h"
>  #include "bus.h"
>  
> -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> -		const char *func)
> +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> +		       unsigned long start, unsigned long end, struct file *file,
> +		       const char *func)
>  {
>  	struct device *dev = &dev_dax->dev;
>  	unsigned long mask;
> @@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
>  		return -ENXIO;
>  
>  	/* prevent private mappings from being established */
> -	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> +	if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
>  		dev_info_ratelimited(dev,
>  				"%s: %s: fail, attempted private mapping\n",
>  				current->comm, func);
> @@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
>  	}
>  
>  	mask = dev_dax->align - 1;
> -	if (vma->vm_start & mask || vma->vm_end & mask) {
> +	if (start & mask || end & mask) {
>  		dev_info_ratelimited(dev,
>  				"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
> -				current->comm, func, vma->vm_start, vma->vm_end,
> +				current->comm, func, start, end,
>  				mask);
>  		return -EINVAL;
>  	}
>  
> -	if (!vma_is_dax(vma)) {
> +	if (!file_is_dax(file)) {
>  		dev_info_ratelimited(dev,
>  				"%s: %s: fail, vma is not DAX capable\n",
>  				current->comm, func);
> @@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> +		     const char *func)
> +{
> +	return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
> +			   vma->vm_file, func);
> +}
> +

Side comment: I'm no DAX expert at all, but this check_vma() thing looks... smelly?
Besides the !dax_alive() check, I don't see the need to recheck vma limits at
every ->huge_fault() call. Even taking mremap() into account,
->get_unmapped_area() should Do The Right Thing, no?

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-16 14:11 ` [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
  2025-09-16 16:46   ` Jason Gunthorpe
@ 2025-09-17 10:39   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 10:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:49PM +0100, Lorenzo Stoakes wrote:
> It's useful to be able to determine the size of a VMA descriptor range
> used on f_op->mmap_prepare, expressed both in bytes and pages, so add
> helpers for both and update code that could make use of it to do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 04/13] relay: update relay to use mmap_prepare
  2025-09-16 14:11 ` [PATCH v3 04/13] relay: update relay to use mmap_prepare Lorenzo Stoakes
  2025-09-16 16:48   ` Jason Gunthorpe
@ 2025-09-17 10:41   ` Pedro Falcato
  1 sibling, 0 replies; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 10:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:50PM +0100, Lorenzo Stoakes wrote:
> It is relatively trivial to update this code to use the f_op->mmap_prepare
> hook in favour of the deprecated f_op->mmap hook, so do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion
  2025-09-16 14:11 ` [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
  2025-09-16 16:48   ` Jason Gunthorpe
@ 2025-09-17 10:49   ` Pedro Falcato
  2025-09-17 13:24     ` Lorenzo Stoakes
  1 sibling, 1 reply; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 10:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:51PM +0100, Lorenzo Stoakes wrote:
> Now we have the f_op->mmap_prepare() hook, having a static function called
> __mmap_prepare() that has nothing to do with it is confusing, so rename
> the function to __mmap_setup().
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>

I would love to bikeshed on the new name (maybe something more descriptive?),
but I don't really mind.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-16 14:11 ` [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
  2025-09-16 17:07   ` Jason Gunthorpe
@ 2025-09-17 11:07   ` Pedro Falcato
  2025-09-17 11:16     ` Lorenzo Stoakes
  1 sibling, 1 reply; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 11:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:52PM +0100, Lorenzo Stoakes wrote:
> We need the ability to split PFN remap between updating the VMA and
> performing the actual remap, in order to do away with the legacy
> f_op->mmap hook.
> 
> To do so, update the PFN remap code to provide shared logic, and also make
> remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
> was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").
> 
> Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor
> and PFN parameters, and remap_pfn_range_complete() which accepts the same
> parameters as remap_pfn_rangte().
                remap_pfn_range

> 
> remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> it must be supplied with a correct PFN to do so.  If the caller must hold
> locks to be able to do this, those locks should be held across the
> operation, and mmap_abort() should be provided to revoke the lock should
> an error arise.
> 
> While we're here, also clean up the duplicated #ifdef
> __HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.
> 
> We would prefer to define these functions in mm/internal.h, however we
> will do the same for io_remap*() and these have arch defines that require
> access to the remap functions.
>

I'm confused. What's stopping us from declaring these new functions in
internal.h? It's supposed to be used by core mm only anyway?


> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

The changes themselves look OK to me, but I'm not super familiar with these
bits anyway.

Acked-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]()
  2025-09-16 14:11 ` [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
  2025-09-16 17:19   ` Jason Gunthorpe
@ 2025-09-17 11:12   ` Pedro Falcato
  2025-09-17 11:15     ` Lorenzo Stoakes
  1 sibling, 1 reply; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 11:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:53PM +0100, Lorenzo Stoakes wrote:
> We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
> remap_pfn_range_complete() to allow for I/O remapping via mmap_prepare.
> 
> We have to make some architecture-specific changes for those architectures
> which define customised handlers.
> 
> It doesn't really make sense to make this internal-only as arches specify
> their version of these functions so we declare these in mm.h.

Similar question to the remap_pfn_range patch.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Looks ok, but again, i'm no expert on this.

Acked-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]()
  2025-09-17 11:12   ` Pedro Falcato
@ 2025-09-17 11:15     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 11:15 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Wed, Sep 17, 2025 at 12:12:17PM +0100, Pedro Falcato wrote:
> On Tue, Sep 16, 2025 at 03:11:53PM +0100, Lorenzo Stoakes wrote:
> > We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
> > remap_pfn_range_complete() to allow for I/O remapping via mmap_prepare.
> >
> > We have to make some architecture-specific changes for those architectures
> > which define customised handlers.
> >
> > It doesn't really make sense to make this internal-only as arches specify
> > their version of these functions so we declare these in mm.h.
>
> Similar question to the remap_pfn_range patch.

There's arch-specific implementations, which in turn invoke the new
prepare/complete helpers.

(This answers your query here and on the remap_pfn_prepare/complete patch).

With the abstraction of the get pfn function suggested by Jason it may be
possible to move these over and just utilise that in internal.h/util.c.

I will look into that.

> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Looks ok, but again, i'm no expert on this.
>
> Acked-by: Pedro Falcato <pfalcato@suse.de>

Thanks!

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-17 11:07   ` Pedro Falcato
@ 2025-09-17 11:16     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 11:16 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Wed, Sep 17, 2025 at 12:07:52PM +0100, Pedro Falcato wrote:
> On Tue, Sep 16, 2025 at 03:11:52PM +0100, Lorenzo Stoakes wrote:
> > We need the ability to split PFN remap between updating the VMA and
> > performing the actual remap, in order to do away with the legacy
> > f_op->mmap hook.
> >
> > To do so, update the PFN remap code to provide shared logic, and also make
> > remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
> > was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").
> >
> > Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor
> > and PFN parameters, and remap_pfn_range_complete() which accepts the same
> > parameters as remap_pfn_rangte().
>                 remap_pfn_range
>
> >
> > remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> > it must be supplied with a correct PFN to do so.  If the caller must hold
> > locks to be able to do this, those locks should be held across the
> > operation, and mmap_abort() should be provided to revoke the lock should
> > an error arise.
> >
> > While we're here, also clean up the duplicated #ifdef
> > __HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.
> >
> > We would prefer to define these functions in mm/internal.h, however we
> > will do the same for io_remap*() and these have arch defines that require
> > access to the remap functions.
> >
>
> I'm confused. What's stopping us from declaring these new functions in
> internal.h? It's supposed to be used by core mm only anyway?

See reply to io_remap_pfn_range_*() patch :)

>
>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> The changes themselves look OK to me, but I'm not super familiar with these
> bits anyway.
>
> Acked-by: Pedro Falcato <pfalcato@suse.de>

Thanks! :)

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-16 14:11 ` [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
  2025-09-16 17:28   ` Jason Gunthorpe
@ 2025-09-17 11:32   ` Pedro Falcato
  2025-09-17 15:34     ` Lorenzo Stoakes
  1 sibling, 1 reply; 49+ messages in thread
From: Pedro Falcato @ 2025-09-17 11:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Tue, Sep 16, 2025 at 03:11:54PM +0100, Lorenzo Stoakes wrote:
> Some drivers/filesystems need to perform additional tasks after the VMA is
> set up.  This is typically in the form of pre-population.
> 
> The forms of pre-population most likely to be performed are a PFN remap
> or the insertion of normal folios and PFNs into a mixed map.
> 
> We start by implementing the PFN remap functionality, ensuring that we
> perform the appropriate actions at the appropriate time - that is setting
> flags at the point of .mmap_prepare, and performing the actual remap at the
> point at which the VMA is fully established.
> 
> This prevents the driver from doing anything too crazy with a VMA at any
> stage, and we retain complete control over how the mm functionality is
> applied.
> 
> Unfortunately callers still do often require some kind of custom action,
> so we add an optional success/error _hook to allow the caller to do
> something after the action has succeeded or failed.

Do we have any idea for rules regarding ->mmap_prepare() and ->*_hook()?
It feels spooky to e.g grab locks in mmap_prepare, and hold them across core
mmap(). And I guess it might be needed?

> 
> This is done at the point when the VMA has already been established, so
> the harm that can be done is limited.
> 
> The error hook can be used to filter errors if necessary.
> 
> If any error arises on these final actions, we simply unmap the VMA
> altogether.
> 
> Also update the stacked filesystem compatibility layer to utilise the
> action behaviour, and update the VMA tests accordingly.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
<snip>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 31b27086586d..aa1e2003f366 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -775,6 +775,49 @@ struct pfnmap_track_ctx {
>  };
>  #endif
>  
> +/* What action should be taken after an .mmap_prepare call is complete? */
> +enum mmap_action_type {
> +	MMAP_NOTHING,		/* Mapping is complete, no further action. */
> +	MMAP_REMAP_PFN,		/* Remap PFN range. */
> +};
> +
> +/*
> + * Describes an action an mmap_prepare hook can instruct to be taken to complete
> + * the mapping of a VMA. Specified in vm_area_desc.
> + */
> +struct mmap_action {
> +	union {
> +		/* Remap range. */
> +		struct {
> +			unsigned long start;
> +			unsigned long start_pfn;
> +			unsigned long size;
> +			pgprot_t pgprot;
> +			bool is_io_remap;
> +		} remap;
> +	};
> +	enum mmap_action_type type;
> +
> +	/*
> +	 * If specified, this hook is invoked after the selected action has been
> +	 * successfully completed. Note that the VMA write lock still held.
> +	 *
> +	 * The absolute minimum ought to be done here.
> +	 *
> +	 * Returns 0 on success, or an error code.
> +	 */
> +	int (*success_hook)(const struct vm_area_struct *vma);
> +
> +	/*
> +	 * If specified, this hook is invoked when an error occurred when
> +	 * attempting the selection action.
> +	 *
> +	 * The hook can return an error code in order to filter the error, but
> +	 * it is not valid to clear the error here.
> +	 */
> +	int (*error_hook)(int err);

Do we need two hooks? It might be more ergonomic to simply have a:

	int (*finish)(int err);


	int random_driver_finish(int err)
	{
		if (err)
			pr_err("ahhhhhhhhh\n");
		mutex_unlock(&big_lock);
		return err;
	}

It's also unclear to me if/why we need the capability to switch error codes,
but I might've missed some discussion on this.


-- 
Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion
  2025-09-17 10:49   ` Pedro Falcato
@ 2025-09-17 13:24     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 13:24 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Wed, Sep 17, 2025 at 11:49:18AM +0100, Pedro Falcato wrote:
> On Tue, Sep 16, 2025 at 03:11:51PM +0100, Lorenzo Stoakes wrote:
> > Now we have the f_op->mmap_prepare() hook, having a static function called
> > __mmap_prepare() that has nothing to do with it is confusing, so rename
> > the function to __mmap_setup().
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
>
> I would love to bikeshed on the new name (maybe something more descriptive?),
> but I don't really mind.

Lol thanks, I think let's get this in :P

>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Cheers!

>
> --
> Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 02/13] device/dax: update devdax to use mmap_prepare
  2025-09-17 10:37   ` Pedro Falcato
@ 2025-09-17 13:54     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 13:54 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Wed, Sep 17, 2025 at 11:37:07AM +0100, Pedro Falcato wrote:
> On Tue, Sep 16, 2025 at 03:11:48PM +0100, Lorenzo Stoakes wrote:
> > The devdax driver does nothing special in its f_op->mmap hook, so
> > straightforwardly update it to use the mmap_prepare hook instead.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
>
> Acked-by: Pedro Falcato <pfalcato@suse.de>

Thanks!

>
> > ---
> >  drivers/dax/device.c | 32 +++++++++++++++++++++-----------
> >  1 file changed, 21 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index 2bb40a6060af..c2181439f925 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -13,8 +13,9 @@
> >  #include "dax-private.h"
> >  #include "bus.h"
> >
> > -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> > -		const char *func)
> > +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> > +		       unsigned long start, unsigned long end, struct file *file,
> > +		       const char *func)
> >  {
> >  	struct device *dev = &dev_dax->dev;
> >  	unsigned long mask;
> > @@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> >  		return -ENXIO;
> >
> >  	/* prevent private mappings from being established */
> > -	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> > +	if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> >  		dev_info_ratelimited(dev,
> >  				"%s: %s: fail, attempted private mapping\n",
> >  				current->comm, func);
> > @@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> >  	}
> >
> >  	mask = dev_dax->align - 1;
> > -	if (vma->vm_start & mask || vma->vm_end & mask) {
> > +	if (start & mask || end & mask) {
> >  		dev_info_ratelimited(dev,
> >  				"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
> > -				current->comm, func, vma->vm_start, vma->vm_end,
> > +				current->comm, func, start, end,
> >  				mask);
> >  		return -EINVAL;
> >  	}
> >
> > -	if (!vma_is_dax(vma)) {
> > +	if (!file_is_dax(file)) {
> >  		dev_info_ratelimited(dev,
> >  				"%s: %s: fail, vma is not DAX capable\n",
> >  				current->comm, func);
> > @@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >
> > +static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> > +		     const char *func)
> > +{
> > +	return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
> > +			   vma->vm_file, func);
> > +}
> > +
>
> Side comment: I'm no DAX expert at all, but this check_vma() thing looks... smelly?
> Besides the !dax_alive() check, I don't see the need to recheck vma limits at
> every ->huge_fault() call. Even taking mremap() into account,
> ->get_unmapped_area() should Do The Right Thing, no?

Let's keep this out of the series though please, am only humbly converting ->
mmap_prepare, this series isn't about solving the problems of every mmap caller
:)

>
> --
> Pedro

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc
  2025-09-17 11:32   ` Pedro Falcato
@ 2025-09-17 15:34     ` Lorenzo Stoakes
  0 siblings, 0 replies; 49+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 15:34 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, linux-doc, linux-kernel,
	linux-fsdevel, linux-csky, linux-mips, linux-s390, sparclinux,
	nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe, iommu, Kevin Tian, Will Deacon, Robin Murphy

On Wed, Sep 17, 2025 at 12:32:10PM +0100, Pedro Falcato wrote:
> On Tue, Sep 16, 2025 at 03:11:54PM +0100, Lorenzo Stoakes wrote:
> > Some drivers/filesystems need to perform additional tasks after the VMA is
> > set up.  This is typically in the form of pre-population.
> >
> > The forms of pre-population most likely to be performed are a PFN remap
> > or the insertion of normal folios and PFNs into a mixed map.
> >
> > We start by implementing the PFN remap functionality, ensuring that we
> > perform the appropriate actions at the appropriate time - that is setting
> > flags at the point of .mmap_prepare, and performing the actual remap at the
> > point at which the VMA is fully established.
> >
> > This prevents the driver from doing anything too crazy with a VMA at any
> > stage, and we retain complete control over how the mm functionality is
> > applied.
> >
> > Unfortunately callers still do often require some kind of custom action,
> > so we add an optional success/error _hook to allow the caller to do
> > something after the action has succeeded or failed.
>
> Do we have any idea for rules regarding ->mmap_prepare() and ->*_hook()?
> It feels spooky to e.g grab locks in mmap_prepare, and hold them across core
> mmap(). And I guess it might be needed?

I already did a bunch of logic around this, but several respins later and we
don't curently support it as Jason pointed out probably we actually don't need
to, at least so far.

I don't think it's really worth saying 'do this don't do that'. As wayward
drivers will do whatever.

Sadly we do need those hooks because of error filtering and e.g. debug output on
success.

However on success though, I discourage anything too stupid by making the vma
parameter const so you'd have to do a const cast there.

On error you only get the error code so good luck with that.

Obviously there could be a static mutex, but I think that's unavoidable.

>
> >
> > This is done at the point when the VMA has already been established, so
> > the harm that can be done is limited.
> >
> > The error hook can be used to filter errors if necessary.
> >
> > If any error arises on these final actions, we simply unmap the VMA
> > altogether.
> >
> > Also update the stacked filesystem compatibility layer to utilise the
> > action behaviour, and update the VMA tests accordingly.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> <snip>
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 31b27086586d..aa1e2003f366 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -775,6 +775,49 @@ struct pfnmap_track_ctx {
> >  };
> >  #endif
> >
> > +/* What action should be taken after an .mmap_prepare call is complete? */
> > +enum mmap_action_type {
> > +	MMAP_NOTHING,		/* Mapping is complete, no further action. */
> > +	MMAP_REMAP_PFN,		/* Remap PFN range. */
> > +};
> > +
> > +/*
> > + * Describes an action an mmap_prepare hook can instruct to be taken to complete
> > + * the mapping of a VMA. Specified in vm_area_desc.
> > + */
> > +struct mmap_action {
> > +	union {
> > +		/* Remap range. */
> > +		struct {
> > +			unsigned long start;
> > +			unsigned long start_pfn;
> > +			unsigned long size;
> > +			pgprot_t pgprot;
> > +			bool is_io_remap;
> > +		} remap;
> > +	};
> > +	enum mmap_action_type type;
> > +
> > +	/*
> > +	 * If specified, this hook is invoked after the selected action has been
> > +	 * successfully completed. Note that the VMA write lock still held.
> > +	 *
> > +	 * The absolute minimum ought to be done here.
> > +	 *
> > +	 * Returns 0 on success, or an error code.
> > +	 */
> > +	int (*success_hook)(const struct vm_area_struct *vma);
> > +
> > +	/*
> > +	 * If specified, this hook is invoked when an error occurred when
> > +	 * attempting the selection action.
> > +	 *
> > +	 * The hook can return an error code in order to filter the error, but
> > +	 * it is not valid to clear the error here.
> > +	 */
> > +	int (*error_hook)(int err);
>
> Do we need two hooks? It might be more ergonomic to simply have a:
>
> 	int (*finish)(int err);
>
>
> 	int random_driver_finish(int err)
> 	{
> 		if (err)
> 			pr_err("ahhhhhhhhh\n");
> 		mutex_unlock(&big_lock);
> 		return err;
> 	}

No I think that's less clear. Better to spell it out.

>
> It's also unclear to me if/why we need the capability to switch error codes,
> but I might've missed some discussion on this.

There's drivers that do it. That's why.

>
>
> --
> Pedro

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-09-17 15:35 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-16 14:11 [PATCH v3 00/13] expand mmap_prepare functionality, port more users Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 01/13] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
2025-09-16 16:42   ` Jason Gunthorpe
2025-09-17 10:30   ` Pedro Falcato
2025-09-16 14:11 ` [PATCH v3 02/13] device/dax: update devdax " Lorenzo Stoakes
2025-09-16 16:43   ` Jason Gunthorpe
2025-09-17 10:37   ` Pedro Falcato
2025-09-17 13:54     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 03/13] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
2025-09-16 16:46   ` Jason Gunthorpe
2025-09-17 10:39   ` Pedro Falcato
2025-09-16 14:11 ` [PATCH v3 04/13] relay: update relay to use mmap_prepare Lorenzo Stoakes
2025-09-16 16:48   ` Jason Gunthorpe
2025-09-17 10:41   ` Pedro Falcato
2025-09-16 14:11 ` [PATCH v3 05/13] mm/vma: rename __mmap_prepare() function to avoid confusion Lorenzo Stoakes
2025-09-16 16:48   ` Jason Gunthorpe
2025-09-17 10:49   ` Pedro Falcato
2025-09-17 13:24     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 06/13] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
2025-09-16 17:07   ` Jason Gunthorpe
2025-09-16 17:37     ` Lorenzo Stoakes
2025-09-17 11:07   ` Pedro Falcato
2025-09-17 11:16     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 07/13] mm: introduce io_remap_pfn_range_[prepare, complete]() Lorenzo Stoakes
2025-09-16 17:19   ` Jason Gunthorpe
2025-09-16 17:34     ` Lorenzo Stoakes
2025-09-17 11:12   ` Pedro Falcato
2025-09-17 11:15     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 08/13] mm: add ability to take further action in vm_area_desc Lorenzo Stoakes
2025-09-16 17:28   ` Jason Gunthorpe
2025-09-16 17:57     ` Lorenzo Stoakes
2025-09-16 18:08       ` Jason Gunthorpe
2025-09-16 19:31         ` Lorenzo Stoakes
2025-09-17 11:32   ` Pedro Falcato
2025-09-17 15:34     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 09/13] doc: update porting, vfs documentation for mmap_prepare actions Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 10/13] mm/hugetlbfs: update hugetlbfs to use mmap_prepare Lorenzo Stoakes
2025-09-16 17:30   ` Jason Gunthorpe
2025-09-16 14:11 ` [PATCH v3 11/13] mm: update mem char driver " Lorenzo Stoakes
2025-09-16 17:40   ` Jason Gunthorpe
2025-09-16 18:02     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 12/13] mm: update resctl " Lorenzo Stoakes
2025-09-16 17:40   ` Jason Gunthorpe
2025-09-16 18:02     ` Lorenzo Stoakes
2025-09-16 14:11 ` [PATCH v3 13/13] iommufd: update " Lorenzo Stoakes
2025-09-16 15:40   ` Jason Gunthorpe
2025-09-16 16:23     ` Lorenzo Stoakes
2025-09-17  1:32       ` Andrew Morton
2025-09-17 10:13         ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).