Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout
@ 2023-08-01 12:48 David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

This is agains mm/mm-unstable, but everything except patch #7 and #8
should apply on current master. Especially patch #1 and #2 should go
upstream first, so we can let the other stuff mature a bit longer.


Next attempt to handle the fallout of 474098edac26
("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
accidentially missed that follow_page() and smaps implicitly kept the
FOLL_NUMA flag clear by not setting it if FOLL_FORCE is absent, to
not trigger faults on PROT_NONE-mapped PTEs.

Patch #1 fixes the known issues by reintroducing FOLL_NUMA as
FOLL_HONOR_NUMA_FAULT and decoupling it from FOLL_FORCE.

Patch #2 is a cleanup that I think actually fixes some corner cases, so
I added a Fixes: tag.

Patch #3 makes KVM explicitly set FOLL_HONOR_NUMA_FAULT in the single
case where it is required, and documents the situation.

Patch #4 then stops implicitly setting FOLL_HONOR_NUMA_FAULT. But note that
for FOLL_WRITE we always implicitly honor NUMA hinting faults.

Patch #5 and patch #6 cleanup some comments.

Patch #7 improves the KVM functional tests such that patch #8 can
actually check for one of the known issues: KSM no longer working on
PROT_NONE mappings on x86-64 with CONFIG_NUMA_BALANCING.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>

David Hildenbrand (8):
  mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()
  mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT
  pgtable: improve pte_protnone() comment
  mm/huge_memory: remove stale NUMA hinting comment from
    follow_trans_huge_pmd()
  selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if
    anything got merged
  selftest/mm: ksm_functional_tests: Add PROT_NONE test

 fs/proc/task_mmu.c                            |   3 +-
 include/linux/mm.h                            |  21 +++-
 include/linux/mm_types.h                      |   9 ++
 include/linux/pgtable.h                       |  16 ++-
 mm/gup.c                                      |  23 +++-
 mm/huge_memory.c                              |   3 +-
 .../selftests/mm/ksm_functional_tests.c       | 106 ++++++++++++++++--
 virt/kvm/kvm_main.c                           |  13 ++-
 8 files changed, 164 insertions(+), 30 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-01 15:48   ` Peter Xu
  2023-08-02 15:08   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() David Hildenbrand
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.

As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
faults from gup/gup_fast"): "Other follow_page callers like KSM should not
use FOLL_NUMA, or they would fail to get the pages if they use follow_page
instead of get_user_pages."

liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE. Further, it's easy
to reproduce that KSM no longer works on inaccessible VMAs on x86-64,
because pte_protnone()/pmd_protnone() also indictaes "true" in
inaccessible VMAs, and follow_page() refuses to return such pages right
now.

As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not really
an option.

To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.

Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.

Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.

1) get_dump_page(): we really don't want to handle NUMA hinting
   faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
   hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
   faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
   honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
   hinting faults.

To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().

As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback
to ordinary GUP when stumbling over pte_protnone()/pmd_protnone().

As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.

So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default.
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag
and adding appropriate documenation.

[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com

Reported-by: liubo <liubo254@huawei.com>
Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
Reported-by: Peter Xu <peterx@redhat.com>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h       | 21 +++++++++++++++------
 include/linux/mm_types.h |  9 +++++++++
 mm/gup.c                 | 29 +++++++++++++++++++++++------
 mm/huge_memory.c         |  2 +-
 4 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2fbc6c631764..165830a95641 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3455,15 +3455,24 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
  * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
  * a (NUMA hinting) fault is required.
  */
-static inline bool gup_can_follow_protnone(unsigned int flags)
+static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
+					   unsigned int flags)
 {
 	/*
-	 * FOLL_FORCE has to be able to make progress even if the VMA is
-	 * inaccessible. Further, FOLL_FORCE access usually does not represent
-	 * application behaviour and we should avoid triggering NUMA hinting
-	 * faults.
+	 * If callers don't want to honor NUMA hinting faults, no need to
+	 * determine if we would actually have to trigger a NUMA hinting fault.
 	 */
-	return flags & FOLL_FORCE;
+	if (!(flags & FOLL_HONOR_NUMA_FAULT))
+		return true;
+
+	/*
+	 * NUMA hinting faults don't apply in inaccessible (PROT_NONE) VMAs.
+	 *
+	 * Requiring a fault here even for inaccessible VMAs would mean that
+	 * FOLL_FORCE cannot make any progress, because handle_mm_fault()
+	 * refuses to process NUMA hinting faults in inaccessible VMAs.
+	 */
+	return !vma_is_accessible(vma);
 }
 
 typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index da538ff68953..18c8c3d793b0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1296,6 +1296,15 @@ enum {
 	FOLL_PCI_P2PDMA = 1 << 10,
 	/* allow interrupts from generic signals */
 	FOLL_INTERRUPTIBLE = 1 << 11,
+	/*
+	 * Always honor (trigger) NUMA hinting faults.
+	 *
+	 * FOLL_WRITE implicitly honors NUMA hinting faults because a
+	 * PROT_NONE-mapped page is not writable (exceptions with FOLL_FORCE
+	 * apply). get_user_pages_fast_only() always implicitly honors NUMA
+	 * hinting faults.
+	 */
+	FOLL_HONOR_NUMA_FAULT = 1 << 12,
 
 	/* See also internal only FOLL flags in mm/internal.h */
 };
diff --git a/mm/gup.c b/mm/gup.c
index 2493ffa10f4b..f463d3004ddc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	pte = ptep_get(ptep);
 	if (!pte_present(pte))
 		goto no_page;
-	if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
 		goto no_page;
 
 	page = vm_normal_page(vma, address, pte);
@@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	if (likely(!pmd_trans_huge(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
-	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
+	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
 		return no_page_table(vma, flags);
 
 	ptl = pmd_lock(mm, pmd);
@@ -844,6 +844,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
 		return NULL;
 
+	/*
+	 * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
+	 * to fail on PROT_NONE-mapped pages.
+	 */
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 		gup_flags |= FOLL_UNLOCKABLE;
 	}
 
+	/*
+	 * For now, always trigger NUMA hinting faults. Some GUP users like
+	 * KVM really require it to benefit from autonuma.
+	 */
+	gup_flags |= FOLL_HONOR_NUMA_FAULT;
+
 	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
 	if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
 			 (FOLL_PIN | FOLL_GET)))
@@ -2564,7 +2574,14 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		struct page *page;
 		struct folio *folio;
 
-		if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+		/*
+		 * Always fallback to ordinary GUP on PROT_NONE-mapped pages:
+		 * pte_access_permitted() better should reject these pages
+		 * either way: otherwise, GUP-fast might succeed in
+		 * cases where ordinary GUP would fail due to VMA access
+		 * permissions.
+		 */
+		if (pte_protnone(pte))
 			goto pte_unmap;
 
 		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
@@ -2983,8 +3000,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
 			     pmd_devmap(pmd))) {
-			if (pmd_protnone(pmd) &&
-			    !gup_can_follow_protnone(flags))
+			/* See gup_pte_range() */
+			if (pmd_protnone(pmd))
 				return 0;
 
 			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
@@ -3164,7 +3181,7 @@ static int internal_get_user_pages_fast(unsigned long start,
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
 				       FOLL_FAST_ONLY | FOLL_NOFAULT |
-				       FOLL_PCI_P2PDMA)))
+				       FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e2e8a24cc71..2cd3e5502180 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		return ERR_PTR(-EFAULT);
 
 	/* Full NUMA hinting faults to serialise migration in fault paths */
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
 		return NULL;
 
 	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-02 15:16   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() David Hildenbrand
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

We shouldn't be using a GUP-internal helper if it can be avoided.

Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.

In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():

(1) Will always return the head page, not a tail page of a THP.

 If we'd ever call smaps_account with a tail page while setting "compound
 = true", we could be in trouble, because smaps_account() would look at
 the memmap of unrelated pages.

 If we're unlucky, that memmap does not exist at all. Before we removed
 PG_doublemap, we could have triggered something similar as in
 commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
 migration entry").

 This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
 smaps_rollup: do not stall write attempts on mmap_lock"):

  (a) We're in show_smaps_rollup() and processed a VMA
  (b) We release the mmap lock in show_smaps_rollup() because it is
      contended
  (c) We merged that VMA with another VMA
  (d) We collapsed a THP in that merged VMA at that position

 If the end address of the original VMA falls into the middle of a THP
 area, we would call smap_gather_stats() with a start address that falls
 into a PMD-mapped THP. It's probably very rare to trigger when not
 really forced.

(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()

 Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
 If such pages would be anonymous, we most certainly would want to
 account them.

(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()

 As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
 pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
 entries.

 Especially, follow_pmd_mask() never ends up calling
 follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
 follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
 is set.

 So skipping pmd_devmap() pages seems to be the right thing to do.

(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()

 We won't be returning a memmap that should be ignored by core-mm, or
 worse, a memmap that does not even exist. Note that while
 walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.

 Most probably this case doesn't currently really happen on the PMD level,
 otherwise we'd already be able to trigger kernel crashes when reading
 smaps / smaps_rollup.

So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.

Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/task_mmu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index bf25178ae66a..7a7d6e2e6a14 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -571,8 +571,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	bool migration = false;
 
 	if (pmd_present(*pmd)) {
-		/* FOLL_DUMP will return -EFAULT on huge zero page */
-		page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+		page = vm_normal_page_pmd(vma, addr, *pmd);
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-02 15:27   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT David Hildenbrand
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

KVM is *the* case we know that really wants to honor NUMA hinting falls.
As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
to map them into a secondary MMU, and add a comment why.

Do that unconditionally in hva_to_pfn_slow() when calling
get_user_pages_unlocked().

kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
gfn_to_page_many_atomic() are similarly used to map pages into a
secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
implicitly honor NUMA hinting faults -- as documented for
FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
for now.

Don't set it in check_user_page_hwpoison(), where we really only want to
check if the mapped page is HW-poisoned.

We won't set it for other KVM users of get_user_pages()/pin_user_pages()
* arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
  secondary MMU.
* arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
* arch/s390/kvm/*: s390x only supports a single NUMA node either way
* arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.

This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
implicitly be set by get_user_pages() and friends.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..6e4f2b81541e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2517,7 +2517,18 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	/*
+	 * When a VCPU accesses a page that is not mapped into the secondary
+	 * MMU, we lookup the page using GUP to map it, so the guest VCPU can
+	 * make progress. We always want to honor NUMA hinting faults in that
+	 * case, because GUP usage corresponds to memory accesses from the VCPU.
+	 * Otherwise, we'd not trigger NUMA hinting faults once a page is
+	 * mapped into the secondary MMU and gets accessed by a VCPU.
+	 *
+	 * Note that get_user_page_fast_only() and FOLL_WRITE for now
+	 * implicitly honor NUMA hinting faults and don't need this flag.
+	 */
+	unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT;
 	struct page *page;
 	int npages;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
                   ` (2 preceding siblings ...)
  2023-08-01 12:48 ` [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-02 15:28   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 5/8] pgtable: improve pte_protnone() comment David Hildenbrand
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

Commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from
gup/gup_fast") from 2012 documented as the primary reason why we would want
to handle NUMA hinting faults from GUP:

  KVM secondary MMU page faults will trigger the NUMA hinting page
  faults through gup_fast -> get_user_pages -> follow_page ->
  handle_mm_fault.

That is still the case today, and relevant KVM code has been converted to
manually set FOLL_HONOR_NUMA_FAULT. So let's stop setting
FOLL_HONOR_NUMA_FAULT for all GUP users and cross fingers that not that
many other ones that really require such handling for autonuma remain.

Possible interaction with MMU notifiers:

 Assume a driver obtains a page using get_user_pages() to map it into
 a secondary MMU, and uses the MMU notifier framework to get notified on
 changes.

 Assume get_user_pages() succeeded on a PROT_NONE-mapped page (because
 FOLL_HONOR_NUMA_FAULT is not set) in an accessible VMA and the page is
 mapped into a secondary MMU. Once user space would turn that mapping
 inaccessible using mprotect(PROT_NONE), the actual PTE in the page table
 might not change. If the MMU notifier would be smart and optimize for that
 case "why notify if the PTE didn't change", that could be problematic.

 At least change_pmd_range() with MMU_NOTIFY_PROTECTION_VMA for now does an
 unconditional mmu_notifier_invalidate_range_start() ->
 mmu_notifier_invalidate_range_end() and should be fine.

 Note that even if a PTE in an accessible VMA is pte_protnone(), the
 underlying page might be accessed by a secondary MMU that does not set
 FOLL_HONOR_NUMA_FAULT, and test_young() MMU notifiers would return "true".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f463d3004ddc..ee4fc15ce88e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2244,12 +2244,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 		gup_flags |= FOLL_UNLOCKABLE;
 	}
 
-	/*
-	 * For now, always trigger NUMA hinting faults. Some GUP users like
-	 * KVM really require it to benefit from autonuma.
-	 */
-	gup_flags |= FOLL_HONOR_NUMA_FAULT;
-
 	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
 	if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
 			 (FOLL_PIN | FOLL_GET)))
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/8] pgtable: improve pte_protnone() comment
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
                   ` (3 preceding siblings ...)
  2023-08-01 12:48 ` [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-02 15:35   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd() David Hildenbrand
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

Especially the "For PROT_NONE VMAs, the PTEs are not marked
_PAGE_PROTNONE" is wrong: doing an mprotect(PROT_NONE) will end up
marking all PTEs on x86 as _PAGE_PROTNONE, making pte_protnone()
indicate "yes".

So let's improve the comment, so it's easier to grasp which semantics
pte_protnone() actually has.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f34e0f2cb4d8..6064f454c8e3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1333,12 +1333,16 @@ static inline int pud_trans_unstable(pud_t *pud)
 
 #ifndef CONFIG_NUMA_BALANCING
 /*
- * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
- * the only case the kernel cares is for NUMA balancing and is only ever set
- * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
- * _PAGE_PROTNONE so by default, implement the helper as "always no". It
- * is the responsibility of the caller to distinguish between PROT_NONE
- * protections and NUMA hinting fault protections.
+ * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
+ * perfectly valid to indicate "no" in that case, which is why our default
+ * implementation defaults to "always no".
+ *
+ * In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE
+ * page protection due to NUMA hinting. NUMA hinting faults only apply in
+ * accessible VMAs.
+ *
+ * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault,
+ * looking at the VMA accessibility is sufficient.
  */
 static inline int pte_protnone(pte_t pte)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
                   ` (4 preceding siblings ...)
  2023-08-01 12:48 ` [PATCH v2 5/8] pgtable: improve pte_protnone() comment David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-01 16:07   ` Peter Xu
  2023-08-02 15:34   ` Mel Gorman
  2023-08-01 12:48 ` [PATCH v2 7/8] selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 8/8] selftest/mm: ksm_functional_tests: Add PROT_NONE test David Hildenbrand
  7 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

That comment for pmd_protnone() was added in commit 2b4847e73004
("mm: numa: serialise parallel get_user_page against THP migration"), which
noted:

	THP does not unmap pages due to a lack of support for migration
	entries at a PMD level.  This allows races with get_user_pages

Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2cd3e5502180..0b709d2c46c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
-	/* Full NUMA hinting faults to serialise migration in fault paths */
 	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
 		return NULL;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 7/8] selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
                   ` (5 preceding siblings ...)
  2023-08-01 12:48 ` [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd() David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  2023-08-01 12:48 ` [PATCH v2 8/8] selftest/mm: ksm_functional_tests: Add PROT_NONE test David Hildenbrand
  7 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

Let's extend mmap_and_merge_range() to test if anything in the current
process was merged. range_maps_duplicates() is too unreliable for that
use case, so instead look at KSM stats.

Trigger a complete unmerge first, to cleanup the stable tree and
stabilize accounting of merged pages.

Note that we're using /proc/self/ksm_merging_pages instead of
/proc/self/ksm_stat, because that one is available in more existing
kernels.

If /proc/self/ksm_merging_pages can't be opened, we can't perform any
checks and simply skip them.

We have to special-case the shared zeropage for now. But the only user
-- test_unmerge_zero_pages() -- performs its own merge checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 .../selftests/mm/ksm_functional_tests.c       | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index 0de9d33cd565..cb63b600cb4f 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -30,6 +30,7 @@
 static int ksm_fd;
 static int ksm_full_scans_fd;
 static int proc_self_ksm_stat_fd;
+static int proc_self_ksm_merging_pages_fd;
 static int ksm_use_zero_pages_fd;
 static int pagemap_fd;
 static size_t pagesize;
@@ -88,6 +89,22 @@ static long get_my_ksm_zero_pages(void)
 	return my_ksm_zero_pages;
 }
 
+static long get_my_merging_pages(void)
+{
+	char buf[10];
+	ssize_t ret;
+
+	if (proc_self_ksm_merging_pages_fd < 0)
+		return proc_self_ksm_merging_pages_fd;
+
+	ret = pread(proc_self_ksm_merging_pages_fd, buf, sizeof(buf) - 1, 0);
+	if (ret <= 0)
+		return -errno;
+	buf[ret] = 0;
+
+	return strtol(buf, NULL, 10);
+}
+
 static long ksm_get_full_scans(void)
 {
 	char buf[10];
@@ -120,11 +137,29 @@ static int ksm_merge(void)
 	return 0;
 }
 
+static int ksm_unmerge(void)
+{
+	if (write(ksm_fd, "2", 1) != 1)
+		return -errno;
+	return 0;
+}
+
 static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
 {
 	char *map;
 	int ret;
 
+	/* Stabilize accounting by disabling KSM completely. */
+	if (ksm_unmerge()) {
+		ksft_test_result_fail("Disabling (unmerging) KSM failed\n");
+		goto unmap;
+	}
+
+	if (get_my_merging_pages() > 0) {
+		ksft_test_result_fail("Still pages merged\n");
+		goto unmap;
+	}
+
 	map = mmap(NULL, size, PROT_READ|PROT_WRITE,
 		   MAP_PRIVATE|MAP_ANON, -1, 0);
 	if (map == MAP_FAILED) {
@@ -160,6 +195,16 @@ static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
 		ksft_test_result_fail("Running KSM failed\n");
 		goto unmap;
 	}
+
+	/*
+	 * Check if anything was merged at all. Ignore the zero page that is
+	 * accounted differently (depending on kernel support).
+	 */
+	if (val && !get_my_merging_pages()) {
+		ksft_test_result_fail("No pages got merged\n");
+		goto unmap;
+	}
+
 	return map;
 unmap:
 	munmap(map, size);
@@ -473,6 +518,8 @@ int main(int argc, char **argv)
 	if (pagemap_fd < 0)
 		ksft_exit_skip("open(\"/proc/self/pagemap\") failed\n");
 	proc_self_ksm_stat_fd = open("/proc/self/ksm_stat", O_RDONLY);
+	proc_self_ksm_merging_pages_fd = open("/proc/self/ksm_merging_pages",
+					      O_RDONLY);
 	ksm_use_zero_pages_fd = open("/sys/kernel/mm/ksm/use_zero_pages", O_RDWR);
 
 	test_unmerge();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 8/8] selftest/mm: ksm_functional_tests: Add PROT_NONE test
  2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
                   ` (6 preceding siblings ...)
  2023-08-01 12:48 ` [PATCH v2 7/8] selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged David Hildenbrand
@ 2023-08-01 12:48 ` David Hildenbrand
  7 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 12:48 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, kvm, linux-kselftest, David Hildenbrand,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

Let's test whether merging and unmerging in PROT_NONE areas works as
expected.

Pass a page protection to mmap_and_merge_range(), which will trigger
an mprotect() after writing to the pages, but before enabling merging.

Make sure that unsharing works as expected, by performing a ptrace write
(using /proc/self/mem) and by setting MADV_UNMERGEABLE.

Note that this implicitly tests that ptrace writes in an inaccessible
(PROT_NONE) mapping work as expected.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 .../selftests/mm/ksm_functional_tests.c       | 59 ++++++++++++++++---
 1 file changed, 52 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index cb63b600cb4f..8fa4889ab4f3 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -27,6 +27,7 @@
 #define KiB 1024u
 #define MiB (1024 * KiB)
 
+static int mem_fd;
 static int ksm_fd;
 static int ksm_full_scans_fd;
 static int proc_self_ksm_stat_fd;
@@ -144,7 +145,8 @@ static int ksm_unmerge(void)
 	return 0;
 }
 
-static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
+static char *mmap_and_merge_range(char val, unsigned long size, int prot,
+				  bool use_prctl)
 {
 	char *map;
 	int ret;
@@ -176,6 +178,11 @@ static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
 	/* Make sure each page contains the same values to merge them. */
 	memset(map, val, size);
 
+	if (mprotect(map, size, prot)) {
+		ksft_test_result_skip("mprotect() failed\n");
+		goto unmap;
+	}
+
 	if (use_prctl) {
 		ret = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0);
 		if (ret < 0 && errno == EINVAL) {
@@ -218,7 +225,7 @@ static void test_unmerge(void)
 
 	ksft_print_msg("[RUN] %s\n", __func__);
 
-	map = mmap_and_merge_range(0xcf, size, false);
+	map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
 	if (map == MAP_FAILED)
 		return;
 
@@ -256,7 +263,7 @@ static void test_unmerge_zero_pages(void)
 	}
 
 	/* Let KSM deduplicate zero pages. */
-	map = mmap_and_merge_range(0x00, size, false);
+	map = mmap_and_merge_range(0x00, size, PROT_READ | PROT_WRITE, false);
 	if (map == MAP_FAILED)
 		return;
 
@@ -304,7 +311,7 @@ static void test_unmerge_discarded(void)
 
 	ksft_print_msg("[RUN] %s\n", __func__);
 
-	map = mmap_and_merge_range(0xcf, size, false);
+	map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
 	if (map == MAP_FAILED)
 		return;
 
@@ -336,7 +343,7 @@ static void test_unmerge_uffd_wp(void)
 
 	ksft_print_msg("[RUN] %s\n", __func__);
 
-	map = mmap_and_merge_range(0xcf, size, false);
+	map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
 	if (map == MAP_FAILED)
 		return;
 
@@ -479,7 +486,7 @@ static void test_prctl_unmerge(void)
 
 	ksft_print_msg("[RUN] %s\n", __func__);
 
-	map = mmap_and_merge_range(0xcf, size, true);
+	map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, true);
 	if (map == MAP_FAILED)
 		return;
 
@@ -494,9 +501,42 @@ static void test_prctl_unmerge(void)
 	munmap(map, size);
 }
 
+static void test_prot_none(void)
+{
+	const unsigned int size = 2 * MiB;
+	char *map;
+	int i;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	map = mmap_and_merge_range(0x11, size, PROT_NONE, false);
+	if (map == MAP_FAILED)
+		goto unmap;
+
+	/* Store a unique value in each page on one half using ptrace */
+	for (i = 0; i < size / 2; i += pagesize) {
+		lseek(mem_fd, (uintptr_t) map + i, SEEK_SET);
+		if (write(mem_fd, &i, sizeof(size)) != sizeof(size)) {
+			ksft_test_result_fail("ptrace write failed\n");
+			goto unmap;
+		}
+	}
+
+	/* Trigger unsharing on the other half. */
+	if (madvise(map + size / 2, size / 2, MADV_UNMERGEABLE)) {
+		ksft_test_result_fail("MADV_UNMERGEABLE failed\n");
+		goto unmap;
+	}
+
+	ksft_test_result(!range_maps_duplicates(map, size),
+			 "Pages were unmerged\n");
+unmap:
+	munmap(map, size);
+}
+
 int main(int argc, char **argv)
 {
-	unsigned int tests = 6;
+	unsigned int tests = 7;
 	int err;
 
 #ifdef __NR_userfaultfd
@@ -508,6 +548,9 @@ int main(int argc, char **argv)
 
 	pagesize = getpagesize();
 
+	mem_fd = open("/proc/self/mem", O_RDWR);
+	if (mem_fd < 0)
+		ksft_exit_fail_msg("opening /proc/self/mem failed\n");
 	ksm_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
 	if (ksm_fd < 0)
 		ksft_exit_skip("open(\"/sys/kernel/mm/ksm/run\") failed\n");
@@ -529,6 +572,8 @@ int main(int argc, char **argv)
 	test_unmerge_uffd_wp();
 #endif
 
+	test_prot_none();
+
 	test_prctl();
 	test_prctl_fork();
 	test_prctl_unmerge();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
@ 2023-08-01 15:48   ` Peter Xu
  2023-08-01 16:15     ` David Hildenbrand
  2023-08-02 15:08   ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2023-08-01 15:48 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote:
> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>  		gup_flags |= FOLL_UNLOCKABLE;
>  	}
>  
> +	/*
> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
> +	 * KVM really require it to benefit from autonuma.
> +	 */
> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;

Since at it, do we want to not set it for FOLL_REMOTE, which still sounds
like a good thing to have?

Other than that, looks good here.

Side note: when I was looking at the flags again just to check the
interactions over numa balancing, I found FOLL_NOFAULT and I highly suspect
that's not needed, instead it just wants to use follow_page[_mask]() with
some proper gup flags passed over.. but that's off topic.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()
  2023-08-01 12:48 ` [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd() David Hildenbrand
@ 2023-08-01 16:07   ` Peter Xu
  2023-08-01 16:16     ` David Hildenbrand
  2023-08-02 15:34   ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2023-08-01 16:07 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2cd3e5502180..0b709d2c46c6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
>  		return ERR_PTR(-EFAULT);
>  
> -	/* Full NUMA hinting faults to serialise migration in fault paths */
>  	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
>  		return NULL;

Perhaps squashing into patch 1?  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 15:48   ` Peter Xu
@ 2023-08-01 16:15     ` David Hildenbrand
  2023-08-01 17:04       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 16:15 UTC (permalink / raw
  To: Peter Xu
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

On 01.08.23 17:48, Peter Xu wrote:
> On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote:
>> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>>   		gup_flags |= FOLL_UNLOCKABLE;
>>   	}
>>   
>> +	/*
>> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
>> +	 * KVM really require it to benefit from autonuma.
>> +	 */
>> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
> 
> Since at it, do we want to not set it for FOLL_REMOTE, which still sounds
> like a good thing to have?

I thought about that, but decided against making that patch here more 
complicated to eventually rip it again all out in #4.

I fully agree that FOLL_REMOTE does not make too much sense, but let's 
rather keep it simple for this patch.


Thanks!

> 
> Other than that, looks good here.
> 
> Side note: when I was looking at the flags again just to check the
> interactions over numa balancing, I found FOLL_NOFAULT and I highly suspect
> that's not needed, instead it just wants to use follow_page[_mask]() with
> some proper gup flags passed over.. but that's off topic.

Be prepared for my proposal of removing foll_flags from follow_page() ;)

(accompanied by a proper documentation)

Especially as we have FOLL_PIN users of FOLL_NOFAULT, follow_page() is a 
bad fit.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()
  2023-08-01 16:07   ` Peter Xu
@ 2023-08-01 16:16     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 16:16 UTC (permalink / raw
  To: Peter Xu
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On 01.08.23 18:07, Peter Xu wrote:
> On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2cd3e5502180..0b709d2c46c6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>   	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
>>   		return ERR_PTR(-EFAULT);
>>   
>> -	/* Full NUMA hinting faults to serialise migration in fault paths */
>>   	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
>>   		return NULL;
> 
> Perhaps squashing into patch 1?  Thanks,

I decided against it so I don't have to make patch description of patch 
#1 even longer with something that's mostly unrelated to the core change.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 16:15     ` David Hildenbrand
@ 2023-08-01 17:04       ` Peter Xu
  2023-08-01 17:09         ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2023-08-01 17:04 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

On Tue, Aug 01, 2023 at 06:15:48PM +0200, David Hildenbrand wrote:
> On 01.08.23 17:48, Peter Xu wrote:
> > On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote:
> > > @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
> > >   		gup_flags |= FOLL_UNLOCKABLE;
> > >   	}
> > > +	/*
> > > +	 * For now, always trigger NUMA hinting faults. Some GUP users like
> > > +	 * KVM really require it to benefit from autonuma.
> > > +	 */
> > > +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
> > 
> > Since at it, do we want to not set it for FOLL_REMOTE, which still sounds
> > like a good thing to have?
> 
> I thought about that, but decided against making that patch here more
> complicated to eventually rip it again all out in #4.

I thought that was the whole point of having patch 4 separate, because we
should assume patch 4 may not exist in (at least) some trees, so I ignored
patch 4 when commenting here, and we should not assume it's destined to be
removed here.

> 
> I fully agree that FOLL_REMOTE does not make too much sense, but let's
> rather keep it simple for this patch.

It's fine - I suppose this patch fixes whatever we're aware of that's
broken with FOLL_NUMA's removal, so it can even be anything on top when
needed.  I assume I'm happy to ack this with/without that change, then:

Acked-by: Peter Xu <peterx@redhat.com>

PS: I still hope that the other oneliner can be squashed here directly; it
literally changes exact the same line above so reading this patch alone can
be affected.  You said there you didn't want the commit message to be too
long here, but this is definitely not long at all!  I bet you have similar
understanding to me on defining "long commit message". :) I'd never worry
that.  Your call.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 17:04       ` Peter Xu
@ 2023-08-01 17:09         ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-01 17:09 UTC (permalink / raw
  To: Peter Xu
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

On 01.08.23 19:04, Peter Xu wrote:
> On Tue, Aug 01, 2023 at 06:15:48PM +0200, David Hildenbrand wrote:
>> On 01.08.23 17:48, Peter Xu wrote:
>>> On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote:
>>>> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>>>>    		gup_flags |= FOLL_UNLOCKABLE;
>>>>    	}
>>>> +	/*
>>>> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
>>>> +	 * KVM really require it to benefit from autonuma.
>>>> +	 */
>>>> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
>>>
>>> Since at it, do we want to not set it for FOLL_REMOTE, which still sounds
>>> like a good thing to have?
>>
>> I thought about that, but decided against making that patch here more
>> complicated to eventually rip it again all out in #4.
> 
> I thought that was the whole point of having patch 4 separate, because we
> should assume patch 4 may not exist in (at least) some trees, so I ignored
> patch 4 when commenting here, and we should not assume it's destined to be
> removed here.

For me, the goal of this patch is to bring it *as close as possible* to 
the previous state as before, so we can backport it to stable without 
too many surprises (effectively, only a handful of FOLL_FORCE/ptrace 
user will get a different behavior).

I could add a separate patch that does the FOLL_REMOTE thing, but then, 
maybe we should do that if patch #4 runs into real trouble :)

But no strong opinion if this is what everybody wants in this patch.

> 
>>
>> I fully agree that FOLL_REMOTE does not make too much sense, but let's
>> rather keep it simple for this patch.
> 
> It's fine - I suppose this patch fixes whatever we're aware of that's
> broken with FOLL_NUMA's removal, so it can even be anything on top when
> needed.  I assume I'm happy to ack this with/without that change, then:
> 
> Acked-by: Peter Xu <peterx@redhat.com>

Thanks!

> 
> PS: I still hope that the other oneliner can be squashed here directly; it
> literally changes exact the same line above so reading this patch alone can
> be affected.  You said there you didn't want the commit message to be too
> long here, but this is definitely not long at all!  I bet you have similar
> understanding to me on defining "long commit message". :) I'd never worry
> that.  Your call.

No strong opinion, it just felt cleaner to not start adding what I have 
in that separate patch commit message in here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
  2023-08-01 15:48   ` Peter Xu
@ 2023-08-02 15:08   ` Mel Gorman
  2023-08-02 15:12     ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:08 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote:
> Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
> gup_can_follow_protnone()") missed that follow_page() and
> follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
> don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
> or due to inaccessible (PROT_NONE) VMAs.
> 
> As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
> faults from gup/gup_fast"): "Other follow_page callers like KSM should not
> use FOLL_NUMA, or they would fail to get the pages if they use follow_page
> instead of get_user_pages."
> 
> liubo reported [1] that smaps_rollup results are imprecise, because they
> miss accounting of pages that are mapped PROT_NONE. Further, it's easy
> to reproduce that KSM no longer works on inaccessible VMAs on x86-64,
> because pte_protnone()/pmd_protnone() also indictaes "true" in
> inaccessible VMAs, and follow_page() refuses to return such pages right
> now.
> 
> As KVM really depends on these NUMA hinting faults, removing the
> pte_protnone()/pmd_protnone() handling in GUP code completely is not really
> an option.
> 
> To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
> to restore the original behavior for now and add better comments.
> 
> Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
> is_valid_gup_args(), to add that flag for all external GUP users.
> 
> Note that there are three GUP-internal __get_user_pages() users that don't
> end up calling is_valid_gup_args() and consequently won't get
> FOLL_HONOR_NUMA_FAULT set.
> 
> 1) get_dump_page(): we really don't want to handle NUMA hinting
>    faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
>    hinting faults already.
> 2) populate_vma_page_range(): we really don't want to handle NUMA hinting
>    faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
>    honored NUMA hinting faults already.
> 3) faultin_vma_page_range(): we similarly don't want to handle NUMA
>    hinting faults.
> 
> To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
> inaccessible VMAs properly, we have to perform VMA accessibility checks in
> gup_can_follow_protnone().
> 
> As GUP-fast should reject such pages either way in
> pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
> arm64 that both implement pte_protnone() -- let's just always fallback
> to ordinary GUP when stumbling over pte_protnone()/pmd_protnone().
> 
> As Linus notes [2], honoring NUMA faults might only make sense for
> selected GUP users.
> 
> So we should really see if we can instead let relevant GUP callers specify
> it manually, and not trigger NUMA hinting faults from GUP as default.
> Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag
> and adding appropriate documenation.
> 
> [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
> [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
> 
> Reported-by: liubo <liubo254@huawei.com>
> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
> Reported-by: Peter Xu <peterx@redhat.com>
> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I agree that FOLL_REMOTE probably needs separate treatment but also agree
that it's outside the context of this patch, particularly as a -stable
candidate so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

I've a minor nit below that would be nice to get fixed up, but not
mandatory.

> ---
>  include/linux/mm.h       | 21 +++++++++++++++------
>  include/linux/mm_types.h |  9 +++++++++
>  mm/gup.c                 | 29 +++++++++++++++++++++++------
>  mm/huge_memory.c         |  2 +-
>  4 files changed, 48 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 2493ffa10f4b..f463d3004ddc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>  		gup_flags |= FOLL_UNLOCKABLE;
>  	}
>  
> +	/*
> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
> +	 * KVM really require it to benefit from autonuma.
> +	 */
> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
> +
>  	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>  	if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
>  			 (FOLL_PIN | FOLL_GET)))

Expand on *why* KVM requires it even though I suspect this changes later
in the series. Maybe "Some GUP users like KVM require the hint to be as
the calling context of GUP is functionally similar to a memory reference
from task context"?

Also minor nit -- s/autonuma/NUMA Balancing/ or numab. autonuma refers to
a specific implementation of automatic balancing that operated similar to
khugepaged but not merged. If you grep for it, you'll find that autonuma
is only referenced in powerpc-specific code. It's not important these
days but very early on, it was very confusing if AutoNUMA was mentioned
when NUMAB was intended.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
  2023-08-02 15:08   ` Mel Gorman
@ 2023-08-02 15:12     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-02 15:12 UTC (permalink / raw
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini, stable

>> Reported-by: liubo <liubo254@huawei.com>
>> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
>> Reported-by: Peter Xu <peterx@redhat.com>
>> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
>> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I agree that FOLL_REMOTE probably needs separate treatment but also agree
> that it's outside the context of this patch, particularly as a -stable
> candidate so
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> I've a minor nit below that would be nice to get fixed up, but not
> mandatory.

Thanks Mel for taking a look, so I don't mess up once more :)

> 
>> ---
>>   include/linux/mm.h       | 21 +++++++++++++++------
>>   include/linux/mm_types.h |  9 +++++++++
>>   mm/gup.c                 | 29 +++++++++++++++++++++++------
>>   mm/huge_memory.c         |  2 +-
>>   4 files changed, 48 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 2493ffa10f4b..f463d3004ddc 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>>   		gup_flags |= FOLL_UNLOCKABLE;
>>   	}
>>   
>> +	/*
>> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
>> +	 * KVM really require it to benefit from autonuma.
>> +	 */
>> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
>> +
>>   	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>>   	if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
>>   			 (FOLL_PIN | FOLL_GET)))
> 
> Expand on *why* KVM requires it even though I suspect this changes later
> in the series. Maybe "Some GUP users like KVM require the hint to be as
> the calling context of GUP is functionally similar to a memory reference
> from task context"?

It's raised later in this series but it doesn't hurt to discuss it here 
in a bit more detail.

Sounds good to me.

> 
> Also minor nit -- s/autonuma/NUMA Balancing/ or numab. autonuma refers to
> a specific implementation of automatic balancing that operated similar to
> khugepaged but not merged. If you grep for it, you'll find that autonuma
> is only referenced in powerpc-specific code. It's not important these
> days but very early on, it was very confusing if AutoNUMA was mentioned
> when NUMAB was intended.

Ah, yes, thanks. That's the one of the only place where that terminology 
accidentally slipped in.

I'll wait for more feedback and resend!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  2023-08-01 12:48 ` [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() David Hildenbrand
@ 2023-08-02 15:16   ` Mel Gorman
  2023-08-02 15:34     ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:16 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:38PM +0200, David Hildenbrand wrote:
> We shouldn't be using a GUP-internal helper if it can be avoided.
> 
> Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
> vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
> 
> In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
> 
> (1) Will always return the head page, not a tail page of a THP.
> 
>  If we'd ever call smaps_account with a tail page while setting "compound
>  = true", we could be in trouble, because smaps_account() would look at
>  the memmap of unrelated pages.
> 
>  If we're unlucky, that memmap does not exist at all. Before we removed
>  PG_doublemap, we could have triggered something similar as in
>  commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
>  migration entry").
> 
>  This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
>  smaps_rollup: do not stall write attempts on mmap_lock"):
> 
>   (a) We're in show_smaps_rollup() and processed a VMA
>   (b) We release the mmap lock in show_smaps_rollup() because it is
>       contended
>   (c) We merged that VMA with another VMA
>   (d) We collapsed a THP in that merged VMA at that position
> 
>  If the end address of the original VMA falls into the middle of a THP
>  area, we would call smap_gather_stats() with a start address that falls
>  into a PMD-mapped THP. It's probably very rare to trigger when not
>  really forced.
> 
> (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
> 
>  Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
>  If such pages would be anonymous, we most certainly would want to
>  account them.
> 
> (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
> 
>  As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
>  pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
>  entries.
> 
>  Especially, follow_pmd_mask() never ends up calling
>  follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
>  follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
>  is set.
> 
>  So skipping pmd_devmap() pages seems to be the right thing to do.
> 
> (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
> 
>  We won't be returning a memmap that should be ignored by core-mm, or
>  worse, a memmap that does not even exist. Note that while
>  walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
> 
>  Most probably this case doesn't currently really happen on the PMD level,
>  otherwise we'd already be able to trigger kernel crashes when reading
>  smaps / smaps_rollup.
> 
> So most probably only (1) is relevant in practice as of now, but could only
> cause trouble in extreme corner cases.
> 
> Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
> Signed-off-by: David Hildenbrand <david@redhat.com>

Maybe move the follow_trans_huge_pmd() declaration from linux/huge_mm.h
to mm/internal.h to discourage future mistakes? Otherwise

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()
  2023-08-01 12:48 ` [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() David Hildenbrand
@ 2023-08-02 15:27   ` Mel Gorman
  2023-08-02 15:29     ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:27 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:39PM +0200, David Hildenbrand wrote:
> KVM is *the* case we know that really wants to honor NUMA hinting falls.
> As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
> FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
> to map them into a secondary MMU, and add a comment why.
> 
> Do that unconditionally in hva_to_pfn_slow() when calling
> get_user_pages_unlocked().
> 
> kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
> gfn_to_page_many_atomic() are similarly used to map pages into a
> secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
> implicitly honor NUMA hinting faults -- as documented for
> FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
> for now.
> 
> Don't set it in check_user_page_hwpoison(), where we really only want to
> check if the mapped page is HW-poisoned.
> 
> We won't set it for other KVM users of get_user_pages()/pin_user_pages()
> * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
>   secondary MMU.
> * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
> * arch/s390/kvm/*: s390x only supports a single NUMA node either way
> * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
> 
> This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
> implicitly be set by get_user_pages() and friends.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Seems sane but I don't know KVM well enough to know if this is the only
relevant case so didn't ack.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT
  2023-08-01 12:48 ` [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT David Hildenbrand
@ 2023-08-02 15:28   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:28 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:40PM +0200, David Hildenbrand wrote:
> Commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from
> gup/gup_fast") from 2012 documented as the primary reason why we would want
> to handle NUMA hinting faults from GUP:
> 
>   KVM secondary MMU page faults will trigger the NUMA hinting page
>   faults through gup_fast -> get_user_pages -> follow_page ->
>   handle_mm_fault.
> 
> That is still the case today, and relevant KVM code has been converted to
> manually set FOLL_HONOR_NUMA_FAULT. So let's stop setting
> FOLL_HONOR_NUMA_FAULT for all GUP users and cross fingers that not that
> many other ones that really require such handling for autonuma remain.
> 
> Possible interaction with MMU notifiers:
> 
>  Assume a driver obtains a page using get_user_pages() to map it into
>  a secondary MMU, and uses the MMU notifier framework to get notified on
>  changes.
> 
>  Assume get_user_pages() succeeded on a PROT_NONE-mapped page (because
>  FOLL_HONOR_NUMA_FAULT is not set) in an accessible VMA and the page is
>  mapped into a secondary MMU. Once user space would turn that mapping
>  inaccessible using mprotect(PROT_NONE), the actual PTE in the page table
>  might not change. If the MMU notifier would be smart and optimize for that
>  case "why notify if the PTE didn't change", that could be problematic.
> 
>  At least change_pmd_range() with MMU_NOTIFY_PROTECTION_VMA for now does an
>  unconditional mmu_notifier_invalidate_range_start() ->
>  mmu_notifier_invalidate_range_end() and should be fine.
> 
>  Note that even if a PTE in an accessible VMA is pte_protnone(), the
>  underlying page might be accessed by a secondary MMU that does not set
>  FOLL_HONOR_NUMA_FAULT, and test_young() MMU notifiers would return "true".
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Also seems sane but a large portion of its correctness also depends on
patch 3 being correct.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()
  2023-08-02 15:27   ` Mel Gorman
@ 2023-08-02 15:29     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-02 15:29 UTC (permalink / raw
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On 02.08.23 17:27, Mel Gorman wrote:
> On Tue, Aug 01, 2023 at 02:48:39PM +0200, David Hildenbrand wrote:
>> KVM is *the* case we know that really wants to honor NUMA hinting falls.
>> As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
>> FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
>> to map them into a secondary MMU, and add a comment why.
>>
>> Do that unconditionally in hva_to_pfn_slow() when calling
>> get_user_pages_unlocked().
>>
>> kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
>> gfn_to_page_many_atomic() are similarly used to map pages into a
>> secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
>> implicitly honor NUMA hinting faults -- as documented for
>> FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
>> for now.
>>
>> Don't set it in check_user_page_hwpoison(), where we really only want to
>> check if the mapped page is HW-poisoned.
>>
>> We won't set it for other KVM users of get_user_pages()/pin_user_pages()
>> * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
>>    secondary MMU.
>> * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
>> * arch/s390/kvm/*: s390x only supports a single NUMA node either way
>> * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
>>
>> This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
>> implicitly be set by get_user_pages() and friends.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Seems sane but I don't know KVM well enough to know if this is the only
> relevant case so didn't ack.

Makes sense, some careful eyes from KVM people would be appreciated.

At least from kvm_main.c POV, I'm pretty confident that that's it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()
  2023-08-01 12:48 ` [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd() David Hildenbrand
  2023-08-01 16:07   ` Peter Xu
@ 2023-08-02 15:34   ` Mel Gorman
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:34 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
> That comment for pmd_protnone() was added in commit 2b4847e73004
> ("mm: numa: serialise parallel get_user_page against THP migration"), which
> noted:
> 
> 	THP does not unmap pages due to a lack of support for migration
> 	entries at a PMD level.  This allows races with get_user_pages
> 
> Nowadays, we do have PMD migration entries, so the comment no longer
> applies. Let's drop it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  2023-08-02 15:16   ` Mel Gorman
@ 2023-08-02 15:34     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2023-08-02 15:34 UTC (permalink / raw
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On 02.08.23 17:16, Mel Gorman wrote:
> On Tue, Aug 01, 2023 at 02:48:38PM +0200, David Hildenbrand wrote:
>> We shouldn't be using a GUP-internal helper if it can be avoided.
>>
>> Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
>> vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
>>
>> In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
>>
>> (1) Will always return the head page, not a tail page of a THP.
>>
>>   If we'd ever call smaps_account with a tail page while setting "compound
>>   = true", we could be in trouble, because smaps_account() would look at
>>   the memmap of unrelated pages.
>>
>>   If we're unlucky, that memmap does not exist at all. Before we removed
>>   PG_doublemap, we could have triggered something similar as in
>>   commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
>>   migration entry").
>>
>>   This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
>>   smaps_rollup: do not stall write attempts on mmap_lock"):
>>
>>    (a) We're in show_smaps_rollup() and processed a VMA
>>    (b) We release the mmap lock in show_smaps_rollup() because it is
>>        contended
>>    (c) We merged that VMA with another VMA
>>    (d) We collapsed a THP in that merged VMA at that position
>>
>>   If the end address of the original VMA falls into the middle of a THP
>>   area, we would call smap_gather_stats() with a start address that falls
>>   into a PMD-mapped THP. It's probably very rare to trigger when not
>>   really forced.
>>
>> (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
>>
>>   Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
>>   If such pages would be anonymous, we most certainly would want to
>>   account them.
>>
>> (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
>>
>>   As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
>>   pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
>>   entries.
>>
>>   Especially, follow_pmd_mask() never ends up calling
>>   follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
>>   follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
>>   is set.
>>
>>   So skipping pmd_devmap() pages seems to be the right thing to do.
>>
>> (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
>>
>>   We won't be returning a memmap that should be ignored by core-mm, or
>>   worse, a memmap that does not even exist. Note that while
>>   walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
>>
>>   Most probably this case doesn't currently really happen on the PMD level,
>>   otherwise we'd already be able to trigger kernel crashes when reading
>>   smaps / smaps_rollup.
>>
>> So most probably only (1) is relevant in practice as of now, but could only
>> cause trouble in extreme corner cases.
>>
>> Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Maybe move the follow_trans_huge_pmd() declaration from linux/huge_mm.h
> to mm/internal.h to discourage future mistakes? Otherwise
> 

Makes sense.

> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/8] pgtable: improve pte_protnone() comment
  2023-08-01 12:48 ` [PATCH v2 5/8] pgtable: improve pte_protnone() comment David Hildenbrand
@ 2023-08-02 15:35   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2023-08-02 15:35 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, kvm, linux-kselftest,
	Andrew Morton, Linus Torvalds, liubo, Peter Xu, Matthew Wilcox,
	Hugh Dickins, Jason Gunthorpe, John Hubbard, Mel Gorman,
	Shuah Khan, Paolo Bonzini

On Tue, Aug 01, 2023 at 02:48:41PM +0200, David Hildenbrand wrote:
> Especially the "For PROT_NONE VMAs, the PTEs are not marked
> _PAGE_PROTNONE" is wrong: doing an mprotect(PROT_NONE) will end up
> marking all PTEs on x86 as _PAGE_PROTNONE, making pte_protnone()
> indicate "yes".
> 
> So let's improve the comment, so it's easier to grasp which semantics
> pte_protnone() actually has.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-08-02 15:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-01 12:48 [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
2023-08-01 12:48 ` [PATCH v2 1/8] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
2023-08-01 15:48   ` Peter Xu
2023-08-01 16:15     ` David Hildenbrand
2023-08-01 17:04       ` Peter Xu
2023-08-01 17:09         ` David Hildenbrand
2023-08-02 15:08   ` Mel Gorman
2023-08-02 15:12     ` David Hildenbrand
2023-08-01 12:48 ` [PATCH v2 2/8] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() David Hildenbrand
2023-08-02 15:16   ` Mel Gorman
2023-08-02 15:34     ` David Hildenbrand
2023-08-01 12:48 ` [PATCH v2 3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() David Hildenbrand
2023-08-02 15:27   ` Mel Gorman
2023-08-02 15:29     ` David Hildenbrand
2023-08-01 12:48 ` [PATCH v2 4/8] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT David Hildenbrand
2023-08-02 15:28   ` Mel Gorman
2023-08-01 12:48 ` [PATCH v2 5/8] pgtable: improve pte_protnone() comment David Hildenbrand
2023-08-02 15:35   ` Mel Gorman
2023-08-01 12:48 ` [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd() David Hildenbrand
2023-08-01 16:07   ` Peter Xu
2023-08-01 16:16     ` David Hildenbrand
2023-08-02 15:34   ` Mel Gorman
2023-08-01 12:48 ` [PATCH v2 7/8] selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged David Hildenbrand
2023-08-01 12:48 ` [PATCH v2 8/8] selftest/mm: ksm_functional_tests: Add PROT_NONE test David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).