All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD
@ 2023-07-06 22:50 Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general Axel Rasmussen
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.

The series is based on Linus master (partial 6.5 merge window), and
structured like this:

- Patches 1-3 are preparation / refactoring
- Patches 4-6 implement and advertise the new feature
- Patches 7-8 implement a unit test for the new feature

Changelog:

v2 -> v3:
 - Rebase onto current Linus master.
 - Don't overwrite existing PTE markers for non-hugetlb UFFDIO_POISON.
   Before, non-hugetlb would override them, but hugetlb would not. I don't
   think there's a use case where we *want* to override a UFFD_WP marker
   for example, so take the more conservative behavior for all kinds of
   memory.
 - [Peter] Drop hugetlb mfill atomic refactoring, since it isn't needed
   for this series (we don't touch that code directly anyway).
 - [Peter] Switch to re-using PTE_MARKER_SWAPIN_ERROR instead of defining
   new PTE_MARKER_UFFD_POISON.
 - [Peter] Extract start / len range overflow check into existing
   validate_range helper; this fixes the style issue of unnecessary braces
   in the UFFDIO_POISON implementation, because this code is just deleted.
 - [Peter] Extract file size check out into a new helper.
 - [Peter] Defer actually "enabling" the new feature until the last commit
   in the series; combine this with adding the documentation. As a
   consequence, move the selftest commits after this one.
 - [Randy] Fix typo in documentation.

v1 -> v2:
 - [Peter] Return VM_FAULT_HWPOISON not VM_FAULT_SIGBUS, to yield the
   correct behavior for KVM (guest MCE).
 - [Peter] Rename UFFDIO_SIGBUS to UFFDIO_POISON.
 - [Peter] Implement hugetlbfs support for UFFDIO_POISON.

Axel Rasmussen (8):
  mm: make PTE_MARKER_SWAPIN_ERROR more general
  mm: userfaultfd: check for start + len overflow in validate_range
  mm: userfaultfd: extract file size check out into a helper
  mm: userfaultfd: add new UFFDIO_POISON ioctl
  mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
  mm: userfaultfd: document and enable new UFFDIO_POISON feature
  selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  selftests/mm: add uffd unit test for UFFDIO_POISON

 Documentation/admin-guide/mm/userfaultfd.rst |  15 +++
 fs/userfaultfd.c                             |  73 ++++++++++--
 include/linux/mm_inline.h                    |  19 +++
 include/linux/swapops.h                      |  10 +-
 include/linux/userfaultfd_k.h                |   4 +
 include/uapi/linux/userfaultfd.h             |  25 +++-
 mm/hugetlb.c                                 |  51 ++++++--
 mm/madvise.c                                 |   2 +-
 mm/memory.c                                  |  15 ++-
 mm/mprotect.c                                |   4 +-
 mm/shmem.c                                   |   4 +-
 mm/swapfile.c                                |   2 +-
 mm/userfaultfd.c                             |  83 ++++++++++---
 tools/testing/selftests/mm/uffd-common.c     |   5 +-
 tools/testing/selftests/mm/uffd-common.h     |   3 +
 tools/testing/selftests/mm/uffd-stress.c     |  12 +-
 tools/testing/selftests/mm/uffd-unit-tests.c | 117 +++++++++++++++++++
 17 files changed, 377 insertions(+), 67 deletions(-)

--
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:10   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range Axel Rasmussen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

Future patches will re-use PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:

First, rename it to just PTE_MARKER_ERROR. The "SWAPIN" can be confusing
since we're going to re-use it for something not really related to swap.
This can be particularly confusing for things like hugetlbfs, which
doesn't support swap whatsoever. Also rename some various helper
functions.

Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today. Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places. While we're at it, also make it slightly more explicit in
its handling of e.g. uffd wp markers.

For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for
an error entry, return VM_FAULT_HWPOISON. For most cases this change
doesn't matter, e.g. a userspace program would receive a SIGBUS either
way. But for UFFDIO_POISON, this change will let KVM guests get an MCE
out of the box, instead of giving a SIGBUS to the hypervisor and
requiring it to somehow inject an MCE.

Finally, for hugetlbfs faults, handle PTE_MARKER_ERROR, and return
VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today
because the lack of swap support means we'll never end up with such a
PTE anyway, but this behavior will be needed once such entries *can*
show up via UFFDIO_POISON.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/mm_inline.h | 19 +++++++++++++++++++
 include/linux/swapops.h   | 10 +++++-----
 mm/hugetlb.c              | 32 +++++++++++++++++++++-----------
 mm/madvise.c              |  2 +-
 mm/memory.c               | 15 +++++++++------
 mm/mprotect.c             |  4 ++--
 mm/shmem.c                |  4 ++--
 mm/swapfile.c             |  2 +-
 8 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 21d6c72bcc71..329bd9370b49 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -523,6 +523,25 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending) > 1;
 }
 
+/*
+ * Computes the pte marker to copy from the given source entry into dst_vma.
+ * If no marker should be copied, returns 0.
+ * The caller should insert a new pte created with make_pte_marker().
+ */
+static inline pte_marker copy_pte_marker(
+		swp_entry_t entry, struct vm_area_struct *dst_vma)
+{
+	pte_marker srcm = pte_marker_get(entry);
+	/* Always copy error entries. */
+	pte_marker dstm = srcm & PTE_MARKER_ERROR;
+
+	/* Only copy PTE markers if UFFD register matches. */
+	if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
+		dstm |= PTE_MARKER_UFFD_WP;
+
+	return dstm;
+}
+
 /*
  * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
  * replace a none pte.  NOTE!  This should only be called when *pte is already
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4c932cb45e0b..5f1818d48dd6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -393,7 +393,7 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
 typedef unsigned long pte_marker;
 
 #define  PTE_MARKER_UFFD_WP			BIT(0)
-#define  PTE_MARKER_SWAPIN_ERROR		BIT(1)
+#define  PTE_MARKER_ERROR			BIT(1)
 #define  PTE_MARKER_MASK			(BIT(2) - 1)
 
 static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
@@ -421,15 +421,15 @@ static inline pte_t make_pte_marker(pte_marker marker)
 	return swp_entry_to_pte(make_pte_marker_entry(marker));
 }
 
-static inline swp_entry_t make_swapin_error_entry(void)
+static inline swp_entry_t make_error_swp_entry(void)
 {
-	return make_pte_marker_entry(PTE_MARKER_SWAPIN_ERROR);
+	return make_pte_marker_entry(PTE_MARKER_ERROR);
 }
 
-static inline int is_swapin_error_entry(swp_entry_t entry)
+static inline int is_error_swp_entry(swp_entry_t entry)
 {
 	return is_pte_marker_entry(entry) &&
-	    (pte_marker_get(entry) & PTE_MARKER_SWAPIN_ERROR);
+	    (pte_marker_get(entry) & PTE_MARKER_ERROR);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bce28cca73a1..934e129d9939 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
 #include <linux/nospec.h>
 #include <linux/delayacct.h>
 #include <linux/memory.h>
+#include <linux/mm_inline.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -5101,15 +5102,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				entry = huge_pte_clear_uffd_wp(entry);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else if (unlikely(is_pte_marker(entry))) {
-			/* No swap on hugetlb */
-			WARN_ON_ONCE(
-			    is_swapin_error_entry(pte_to_swp_entry(entry)));
-			/*
-			 * We copy the pte marker only if the dst vma has
-			 * uffd-wp enabled.
-			 */
-			if (userfaultfd_wp(dst_vma))
-				set_huge_pte_at(dst, addr, dst_pte, entry);
+			pte_marker marker = copy_pte_marker(
+				pte_to_swp_entry(entry), dst_vma);
+
+			if (marker)
+				set_huge_pte_at(dst, addr, dst_pte,
+						make_pte_marker(marker));
 		} else {
 			entry = huge_ptep_get(src_pte);
 			pte_folio = page_folio(pte_page(entry));
@@ -6090,14 +6088,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	entry = huge_ptep_get(ptep);
-	/* PTE markers should be handled the same way as none pte */
-	if (huge_pte_none_mostly(entry))
+	if (huge_pte_none_mostly(entry)) {
+		if (is_pte_marker(entry)) {
+			pte_marker marker =
+				pte_marker_get(pte_to_swp_entry(entry));
+
+			if (marker & PTE_MARKER_ERROR) {
+				ret = VM_FAULT_HWPOISON_LARGE;
+				goto out_mutex;
+			}
+		}
+
 		/*
+		 * Other PTE markers should be handled the same way as none PTE.
+		 *
 		 * hugetlb_no_page will drop vma lock and hugetlb fault
 		 * mutex internally, which make us return immediately.
 		 */
 		return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
 				      entry, flags);
+	}
 
 	ret = 0;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 886f06066622..59e954586e2a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -660,7 +660,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				free_swap_and_cache(entry);
 				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			} else if (is_hwpoison_entry(entry) ||
-				   is_swapin_error_entry(entry)) {
+				   is_error_swp_entry(entry)) {
 				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			}
 			continue;
diff --git a/mm/memory.c b/mm/memory.c
index 0ae594703021..c8b6de99d14c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -860,8 +860,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			return -EBUSY;
 		return -ENOENT;
 	} else if (is_pte_marker_entry(entry)) {
-		if (is_swapin_error_entry(entry) || userfaultfd_wp(dst_vma))
-			set_pte_at(dst_mm, addr, dst_pte, pte);
+		pte_marker marker = copy_pte_marker(entry, dst_vma);
+
+		if (marker)
+			set_pte_at(dst_mm, addr, dst_pte,
+				   make_pte_marker(marker));
 		return 0;
 	}
 	if (!userfaultfd_wp(dst_vma))
@@ -1500,7 +1503,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			    !zap_drop_file_uffd_wp(details))
 				continue;
 		} else if (is_hwpoison_entry(entry) ||
-			   is_swapin_error_entry(entry)) {
+			   is_error_swp_entry(entry)) {
 			if (!should_zap_cows(details))
 				continue;
 		} else {
@@ -3647,7 +3650,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 	 * none pte.  Otherwise it means the pte could have changed, so retry.
 	 *
 	 * This should also cover the case where e.g. the pte changed
-	 * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_SWAPIN_ERROR.
+	 * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_ERROR.
 	 * So is_pte_marker() check is not enough to safely drop the pte.
 	 */
 	if (pte_same(vmf->orig_pte, ptep_get(vmf->pte)))
@@ -3693,8 +3696,8 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	/* Higher priority than uffd-wp when data corrupted */
-	if (marker & PTE_MARKER_SWAPIN_ERROR)
-		return VM_FAULT_SIGBUS;
+	if (marker & PTE_MARKER_ERROR)
+		return VM_FAULT_HWPOISON;
 
 	if (pte_marker_entry_uffd_wp(entry))
 		return pte_marker_handle_uffd_wp(vmf);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..47d255c8c2f2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -230,10 +230,10 @@ static long change_pte_range(struct mmu_gather *tlb,
 					newpte = pte_swp_mkuffd_wp(newpte);
 			} else if (is_pte_marker_entry(entry)) {
 				/*
-				 * Ignore swapin errors unconditionally,
+				 * Ignore error swap entries unconditionally,
 				 * because any access should sigbus anyway.
 				 */
-				if (is_swapin_error_entry(entry))
+				if (is_error_swp_entry(entry))
 					continue;
 				/*
 				 * If this is uffd-wp pte marker and we'd like
diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..c0f408c2c020 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1707,7 +1707,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	swp_entry_t swapin_error;
 	void *old;
 
-	swapin_error = make_swapin_error_entry();
+	swapin_error = make_error_swp_entry();
 	old = xa_cmpxchg_irq(&mapping->i_pages, index,
 			     swp_to_radix_entry(swap),
 			     swp_to_radix_entry(swapin_error), 0);
@@ -1752,7 +1752,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	swap = radix_to_swp_entry(*foliop);
 	*foliop = NULL;
 
-	if (is_swapin_error_entry(swap))
+	if (is_error_swp_entry(swap))
 		return -EIO;
 
 	si = get_swap_device(swap);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8e6dde68b389..72e110387e67 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1773,7 +1773,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 			swp_entry = make_hwpoison_entry(swapcache);
 			page = swapcache;
 		} else {
-			swp_entry = make_swapin_error_entry();
+			swp_entry = make_error_swp_entry();
 		}
 		new_pte = swp_entry_to_pte(swp_entry);
 		ret = 0;
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:14   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper Axel Rasmussen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

Most userfaultfd ioctls take a `start + len` range as an argument.
We have the validate_range helper to check that such ranges are valid.
However, some (but not all!) ioctls *also* check that `start + len`
doesn't wrap around (overflow).

Just check for this in validate_range. This saves some repetitive code,
and adds the check to some ioctls which weren't bothering to check for
it before.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..2e84684c46f0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1306,6 +1306,8 @@ static __always_inline int validate_range(struct mm_struct *mm,
 		return -EINVAL;
 	if (len > task_size - start)
 		return -EINVAL;
+	if (start + len <= start)
+		return -EINVAL;
 	return 0;
 }
 
@@ -1760,14 +1762,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
 	if (ret)
 		goto out;
-	/*
-	 * double check for wraparound just in case. copy_from_user()
-	 * will later check uffdio_copy.src + uffdio_copy.len to fit
-	 * in the userland range.
-	 */
+
 	ret = -EINVAL;
-	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
-		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
 	if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
@@ -1927,11 +1923,6 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 		goto out;
 
 	ret = -EINVAL;
-	/* double check for wraparound just in case. */
-	if (uffdio_continue.range.start + uffdio_continue.range.len <=
-	    uffdio_continue.range.start) {
-		goto out;
-	}
 	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE |
 				     UFFDIO_CONTINUE_MODE_WP))
 		goto out;
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:15   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl Axel Rasmussen
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

This code is already duplicated twice, and UFFDIO_POISON will do the
same check a third time. So, it's worth extracting into a helper to save
repetitive lines of code.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/userfaultfd.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a2bf37ee276d..4244ca7ee903 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -45,6 +45,22 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 	return dst_vma;
 }
 
+/* Check if dst_addr is outside of file's size. Must be called with ptl held. */
+static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
+				 unsigned long dst_addr)
+{
+	struct inode *inode;
+	pgoff_t offset, max_off;
+
+	if (!dst_vma->vm_file)
+		return false;
+
+	inode = dst_vma->vm_file->f_inode;
+	offset = linear_page_index(dst_vma, dst_addr);
+	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	return offset >= max_off;
+}
+
 /*
  * Install PTEs, to map dst_addr (within dst_vma) to page.
  *
@@ -64,8 +80,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	bool page_in_cache = page_mapping(page);
 	spinlock_t *ptl;
 	struct folio *folio;
-	struct inode *inode;
-	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
 	_dst_pte = pte_mkdirty(_dst_pte);
@@ -81,14 +95,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	if (!dst_pte)
 		goto out;
 
-	if (vma_is_shmem(dst_vma)) {
-		/* serialize against truncate with the page table lock */
-		inode = dst_vma->vm_file->f_inode;
-		offset = linear_page_index(dst_vma, dst_addr);
-		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	if (mfill_file_over_size(dst_vma, dst_addr)) {
 		ret = -EFAULT;
-		if (unlikely(offset >= max_off))
-			goto out_unlock;
+		goto out_unlock;
 	}
 
 	ret = -EEXIST;
@@ -211,8 +220,6 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
 	int ret;
-	pgoff_t offset, max_off;
-	struct inode *inode;
 
 	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
 					 dst_vma->vm_page_prot));
@@ -220,14 +227,9 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
 	dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl);
 	if (!dst_pte)
 		goto out;
-	if (dst_vma->vm_file) {
-		/* the shmem MAP_PRIVATE case requires checking the i_size */
-		inode = dst_vma->vm_file->f_inode;
-		offset = linear_page_index(dst_vma, dst_addr);
-		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	if (mfill_file_over_size(dst_vma, dst_addr)) {
 		ret = -EFAULT;
-		if (unlikely(offset >= max_off))
-			goto out_unlock;
+		goto out_unlock;
 	}
 	ret = -EEXIST;
 	if (!pte_none(ptep_get(dst_pte)))
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
                   ` (2 preceding siblings ...)
  2023-07-06 22:50 ` [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:37   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs Axel Rasmussen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

The basic idea here is to "simulate" memory poisoning for VMs. A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
once poisoned, pages can never become "un-poisoned". So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.

When live migrating, we try to get the guest running on its new host as
quickly as possible. So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
  either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
  communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
  so any future accesses will SIGBUS. Because the pte is now "present",
  future accesses won't generate more userfaultfd events, they'll just
  SIGBUS directly.

UFFDIO_POISON does not handle unmapping previously-present PTEs. This
isn't needed, because during live migration we want to intercept
all accesses with userfaultfd (not just writes, so WP mode isn't useful
for this). So whether minor or missing mode is being used (or both), the
PTE won't be present in any case, so handling that case isn't needed.

Similarly, UFFDIO_POISON won't replace existing PTE markers. This might
be okay to do, but it seems to be safer to just refuse to overwrite any
existing entry (like a UFFD_WP PTE marker).

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c                 | 58 ++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h    |  4 +++
 include/uapi/linux/userfaultfd.h | 16 +++++++++
 mm/userfaultfd.c                 | 48 +++++++++++++++++++++++++-
 4 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2e84684c46f0..53a7220c4679 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1956,6 +1956,61 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	return ret;
 }
 
+static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_poison uffdio_poison;
+	struct uffdio_poison __user *user_uffdio_poison;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_poison = (struct uffdio_poison __user *)arg;
+
+	ret = -EAGAIN;
+	if (atomic_read(&ctx->mmap_changing))
+		goto out;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_poison, user_uffdio_poison,
+			   /* don't copy the output fields */
+			   sizeof(uffdio_poison) - (sizeof(__s64))))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_poison.range.start,
+			     uffdio_poison.range.len);
+	if (ret)
+		goto out;
+
+	ret = -EINVAL;
+	if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE)
+		goto out;
+
+	if (mmget_not_zero(ctx->mm)) {
+		ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start,
+					  uffdio_poison.range.len,
+					  &ctx->mmap_changing, 0);
+		mmput(ctx->mm);
+	} else {
+		return -ESRCH;
+	}
+
+	if (unlikely(put_user(ret, &user_uffdio_poison->updated)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) {
+		range.start = uffdio_poison.range.start;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN;
+
+out:
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -2057,6 +2112,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_CONTINUE:
 		ret = userfaultfd_continue(ctx, arg);
 		break;
+	case UFFDIO_POISON:
+		ret = userfaultfd_poison(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac7b0c96d351..ac8c6854097c 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -46,6 +46,7 @@ enum mfill_atomic_mode {
 	MFILL_ATOMIC_COPY,
 	MFILL_ATOMIC_ZEROPAGE,
 	MFILL_ATOMIC_CONTINUE,
+	MFILL_ATOMIC_POISON,
 	NR_MFILL_ATOMIC_MODES,
 };
 
@@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
 extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
 				     unsigned long len, atomic_t *mmap_changing,
 				     uffd_flags_t flags);
+extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
+				   unsigned long len, atomic_t *mmap_changing,
+				   uffd_flags_t flags);
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
 			       bool enable_wp, atomic_t *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 66dd4cd277bd..b5f07eacc697 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -71,6 +71,7 @@
 #define _UFFDIO_ZEROPAGE		(0x04)
 #define _UFFDIO_WRITEPROTECT		(0x06)
 #define _UFFDIO_CONTINUE		(0x07)
+#define _UFFDIO_POISON			(0x08)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -91,6 +92,8 @@
 				      struct uffdio_writeprotect)
 #define UFFDIO_CONTINUE		_IOWR(UFFDIO, _UFFDIO_CONTINUE,	\
 				      struct uffdio_continue)
+#define UFFDIO_POISON		_IOWR(UFFDIO, _UFFDIO_POISON, \
+				      struct uffdio_poison)
 
 /* read() structure */
 struct uffd_msg {
@@ -225,6 +228,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
 #define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
+#define UFFD_FEATURE_POISON			(1<<14)
 	__u64 features;
 
 	__u64 ioctls;
@@ -321,6 +325,18 @@ struct uffdio_continue {
 	__s64 mapped;
 };
 
+struct uffdio_poison {
+	struct uffdio_range range;
+#define UFFDIO_POISON_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * Fields below here are written by the ioctl and must be at the end:
+	 * the copy_from_user will not read past here.
+	 */
+	__s64 updated;
+};
+
 /*
  * Flags for the userfaultfd(2) system call itself.
  */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4244ca7ee903..899aa621d7c1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -288,6 +288,40 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 	goto out;
 }
 
+/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
+static int mfill_atomic_pte_poison(pmd_t *dst_pmd,
+				   struct vm_area_struct *dst_vma,
+				   unsigned long dst_addr,
+				   uffd_flags_t flags)
+{
+	int ret;
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+
+	_dst_pte = make_pte_marker(PTE_MARKER_ERROR);
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+	if (mfill_file_over_size(dst_vma, dst_addr)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	ret = -EEXIST;
+	/* Refuse to overwrite any PTE, even a PTE marker (e.g. UFFD WP). */
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
@@ -339,7 +373,8 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	 * by THP.  Since we can not reliably insert a zero page, this
 	 * feature is not supported.
 	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
+	    uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
 		mmap_read_unlock(dst_mm);
 		return -EINVAL;
 	}
@@ -483,6 +518,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
 		return mfill_atomic_pte_continue(dst_pmd, dst_vma,
 						 dst_addr, flags);
+	} else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
+		return mfill_atomic_pte_poison(dst_pmd, dst_vma,
+					       dst_addr, flags);
 	}
 
 	/*
@@ -704,6 +742,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
 			    uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
 }
 
+ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
+			    unsigned long len, atomic_t *mmap_changing,
+			    uffd_flags_t flags)
+{
+	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+			    uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON));
+}
+
 long uffd_wp_range(struct vm_area_struct *dst_vma,
 		   unsigned long start, unsigned long len, bool enable_wp)
 {
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
                   ` (3 preceding siblings ...)
  2023-07-06 22:50 ` [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:39   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 6/8] mm: userfaultfd: document and enable new UFFDIO_POISON feature Axel Rasmussen
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

The behavior here is the same as it is for anon/shmem. This is done
separately because hugetlb pte marker handling is a bit different.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/hugetlb.c     | 19 +++++++++++++++++++
 mm/userfaultfd.c |  3 +--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 934e129d9939..20c5f6a5420a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6263,6 +6263,25 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	int writable;
 	bool folio_in_pagecache = false;
 
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
+		ptl = huge_pte_lock(h, dst_mm, dst_pte);
+
+		/* Don't overwrite any existing PTEs (even markers) */
+		if (!huge_pte_none(huge_ptep_get(dst_pte))) {
+			spin_unlock(ptl);
+			return -EEXIST;
+		}
+
+		_dst_pte = make_pte_marker(PTE_MARKER_ERROR);
+		set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+		spin_unlock(ptl);
+		return 0;
+	}
+
 	if (is_continue) {
 		ret = -EFAULT;
 		folio = filemap_lock_folio(mapping, idx);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 899aa621d7c1..9ce129fdd596 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -373,8 +373,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	 * by THP.  Since we can not reliably insert a zero page, this
 	 * feature is not supported.
 	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
 		mmap_read_unlock(dst_mm);
 		return -EINVAL;
 	}
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 6/8] mm: userfaultfd: document and enable new UFFDIO_POISON feature
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
                   ` (4 preceding siblings ...)
  2023-07-06 22:50 ` [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers Axel Rasmussen
  2023-07-06 22:50 ` [PATCH v3 8/8] selftests/mm: add uffd unit test for UFFDIO_POISON Axel Rasmussen
  7 siblings, 0 replies; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

Update the userfaultfd API to advertise this feature as part of feature
flags and supported ioctls (returned upon registration).

Add basic documentation describing the new feature.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 15 +++++++++++++++
 include/uapi/linux/userfaultfd.h             |  9 ++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 7c304e432205..4349a8c2b978 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
 support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
 respectively) to configure the mapping this way.
 
+Memory Poisioning Emulation
+---------------------------
+
+In response to a fault (either missing or minor), an action userspace can
+take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
+future faulters to either get a SIGBUS, or in KVM's case the guest will
+receive an MCE as if there were hardware memory poisoning.
+
+This is used to emulate hardware memory poisoning. Imagine a VM running on a
+machine which experiences a real hardware memory error. Later, we live migrate
+the VM to another physical machine. Since we want the migration to be
+transparent to the guest, we want that same address range to act as if it was
+still poisoned, even though it's on a new physical host which ostensibly
+doesn't have a memory error in the exact same spot.
+
 QEMU/KVM
 ========
 
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index b5f07eacc697..62151706c5a3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -39,7 +39,8 @@
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
-			   UFFD_FEATURE_WP_UNPOPULATED)
+			   UFFD_FEATURE_WP_UNPOPULATED |	\
+			   UFFD_FEATURE_POISON)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -49,12 +50,14 @@
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
 	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
-	 (__u64)1 << _UFFDIO_CONTINUE)
+	 (__u64)1 << _UFFDIO_CONTINUE |		\
+	 (__u64)1 << _UFFDIO_POISON)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
 	 (__u64)1 << _UFFDIO_CONTINUE |		\
-	 (__u64)1 << _UFFDIO_WRITEPROTECT)
+	 (__u64)1 << _UFFDIO_POISON)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
                   ` (5 preceding siblings ...)
  2023-07-06 22:50 ` [PATCH v3 6/8] mm: userfaultfd: document and enable new UFFDIO_POISON feature Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  2023-07-07 13:42   ` Peter Xu
  2023-07-06 22:50 ` [PATCH v3 8/8] selftests/mm: add uffd unit test for UFFDIO_POISON Axel Rasmussen
  7 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

Previously, we had "one fault handler to rule them all", which used
several branches to deal with all of the scenarios required by all of
the various tests.

In upcoming patches, I plan to add a new test, which has its own
slightly different fault handling logic. Instead of continuing to add
cruft to the existing fault handler, let's allow tests to define custom
ones, separate from other tests.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/mm/uffd-common.c |  5 ++++-
 tools/testing/selftests/mm/uffd-common.h |  3 +++
 tools/testing/selftests/mm/uffd-stress.c | 12 +++++++-----
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c
index ba20d7504022..02b89860e193 100644
--- a/tools/testing/selftests/mm/uffd-common.c
+++ b/tools/testing/selftests/mm/uffd-common.c
@@ -499,6 +499,9 @@ void *uffd_poll_thread(void *arg)
 	int ret;
 	char tmp_chr;
 
+	if (!args->handle_fault)
+		args->handle_fault = uffd_handle_page_fault;
+
 	pollfd[0].fd = uffd;
 	pollfd[0].events = POLLIN;
 	pollfd[1].fd = pipefd[cpu*2];
@@ -527,7 +530,7 @@ void *uffd_poll_thread(void *arg)
 			err("unexpected msg event %u\n", msg.event);
 			break;
 		case UFFD_EVENT_PAGEFAULT:
-			uffd_handle_page_fault(&msg, args);
+			args->handle_fault(&msg, args);
 			break;
 		case UFFD_EVENT_FORK:
 			close(uffd);
diff --git a/tools/testing/selftests/mm/uffd-common.h b/tools/testing/selftests/mm/uffd-common.h
index 197f5262fe0d..7c4fa964c3b0 100644
--- a/tools/testing/selftests/mm/uffd-common.h
+++ b/tools/testing/selftests/mm/uffd-common.h
@@ -77,6 +77,9 @@ struct uffd_args {
 	unsigned long missing_faults;
 	unsigned long wp_faults;
 	unsigned long minor_faults;
+
+	/* A custom fault handler; defaults to uffd_handle_page_fault. */
+	void (*handle_fault)(struct uffd_msg *msg, struct uffd_args *args);
 };
 
 struct uffd_test_ops {
diff --git a/tools/testing/selftests/mm/uffd-stress.c b/tools/testing/selftests/mm/uffd-stress.c
index 995ff13e74c7..50b1224d72c7 100644
--- a/tools/testing/selftests/mm/uffd-stress.c
+++ b/tools/testing/selftests/mm/uffd-stress.c
@@ -189,10 +189,8 @@ static int stress(struct uffd_args *args)
 				   locking_thread, (void *)cpu))
 			return 1;
 		if (bounces & BOUNCE_POLL) {
-			if (pthread_create(&uffd_threads[cpu], &attr,
-					   uffd_poll_thread,
-					   (void *)&args[cpu]))
-				return 1;
+			if (pthread_create(&uffd_threads[cpu], &attr, uffd_poll_thread, &args[cpu]))
+				err("uffd_poll_thread create");
 		} else {
 			if (pthread_create(&uffd_threads[cpu], &attr,
 					   uffd_read_thread,
@@ -247,9 +245,13 @@ static int userfaultfd_stress(void)
 {
 	void *area;
 	unsigned long nr;
-	struct uffd_args args[nr_cpus];
+	struct uffd_args *args;
 	uint64_t mem_size = nr_pages * page_size;
 
+	args = calloc(nr_cpus, sizeof(struct uffd_args));
+	if (!args)
+		err("allocating args array failed");
+
 	if (uffd_test_ctx_init(UFFD_FEATURE_WP_UNPOPULATED, NULL))
 		err("context init failed");
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 8/8] selftests/mm: add uffd unit test for UFFDIO_POISON
  2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
                   ` (6 preceding siblings ...)
  2023-07-06 22:50 ` [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers Axel Rasmussen
@ 2023-07-06 22:50 ` Axel Rasmussen
  7 siblings, 0 replies; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-06 22:50 UTC (permalink / raw
  To: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Peter Xu, Ryan Roberts, Shuah Khan,
	Suleiman Souhlal, Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao,
	ZhangPeng
  Cc: linux-doc, linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Axel Rasmussen

The test is pretty basic, and exercises UFFDIO_POISON straightforwardly.
We register a region with userfaultfd, in missing fault mode. For each
fault, we either UFFDIO_COPY a zeroed page (odd pages) or UFFDIO_POISON
(even pages). We do this mix to test "something like a real use case",
where guest memory would be some mix of poisoned and non-poisoned pages.

We read each page in the region, and assert that the odd pages are
zeroed as expected, and the even pages yield a SIGBUS as expected.

Why UFFDIO_COPY instead of UFFDIO_ZEROPAGE? Because hugetlb doesn't
support UFFDIO_ZEROPAGE, and we don't want to have special case code.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/mm/uffd-unit-tests.c | 117 +++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index 04d91f144d1c..2709a34a39c5 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -951,6 +951,117 @@ static void uffd_zeropage_test(uffd_test_args_t *args)
 	uffd_test_pass();
 }
 
+static void uffd_register_poison(int uffd, void *addr, uint64_t len)
+{
+	uint64_t ioctls = 0;
+	uint64_t expected = (1 << _UFFDIO_COPY) | (1 << _UFFDIO_POISON);
+
+	if (uffd_register_with_ioctls(uffd, addr, len, true,
+				      false, false, &ioctls))
+		err("poison register fail");
+
+	if ((ioctls & expected) != expected)
+		err("registered area doesn't support COPY and POISON ioctls");
+}
+
+static void do_uffdio_poison(int uffd, unsigned long offset)
+{
+	struct uffdio_poison uffdio_poison = { 0 };
+	int ret;
+	__s64 res;
+
+	uffdio_poison.range.start = (unsigned long) area_dst + offset;
+	uffdio_poison.range.len = page_size;
+	uffdio_poison.mode = 0;
+	ret = ioctl(uffd, UFFDIO_POISON, &uffdio_poison);
+	res = uffdio_poison.updated;
+
+	if (ret)
+		err("UFFDIO_POISON error: %"PRId64, (int64_t)res);
+	else if (res != page_size)
+		err("UFFDIO_POISON unexpected size: %"PRId64, (int64_t)res);
+}
+
+static void uffd_poison_handle_fault(
+	struct uffd_msg *msg, struct uffd_args *args)
+{
+	unsigned long offset;
+
+	if (msg->event != UFFD_EVENT_PAGEFAULT)
+		err("unexpected msg event %u", msg->event);
+
+	if (msg->arg.pagefault.flags &
+	    (UFFD_PAGEFAULT_FLAG_WP | UFFD_PAGEFAULT_FLAG_MINOR))
+		err("unexpected fault type %llu", msg->arg.pagefault.flags);
+
+	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+	offset &= ~(page_size-1);
+
+	/* Odd pages -> copy zeroed page; even pages -> poison. */
+	if (offset & page_size)
+		copy_page(uffd, offset, false);
+	else
+		do_uffdio_poison(uffd, offset);
+}
+
+static void uffd_poison_test(uffd_test_args_t *targs)
+{
+	pthread_t uffd_mon;
+	char c;
+	struct uffd_args args = { 0 };
+	struct sigaction act = { 0 };
+	unsigned long nr_sigbus = 0;
+	unsigned long nr;
+
+	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+
+	uffd_register_poison(uffd, area_dst, nr_pages * page_size);
+	memset(area_src, 0, nr_pages * page_size);
+
+	args.handle_fault = uffd_poison_handle_fault;
+	if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args))
+		err("uffd_poll_thread create");
+
+	sigbuf = &jbuf;
+	act.sa_sigaction = sighndl;
+	act.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &act, 0))
+		err("sigaction");
+
+	for (nr = 0; nr < nr_pages; ++nr) {
+		unsigned long offset = nr * page_size;
+		const char *bytes = (const char *) area_dst + offset;
+		const char *i;
+
+		if (sigsetjmp(*sigbuf, 1)) {
+			/*
+			 * Access below triggered a SIGBUS, which was caught by
+			 * sighndl, which then jumped here. Count this SIGBUS,
+			 * and move on to next page.
+			 */
+			++nr_sigbus;
+			continue;
+		}
+
+		for (i = bytes; i < bytes + page_size; ++i) {
+			if (*i)
+				err("nonzero byte in area_dst (%p) at %p: %u",
+				    area_dst, i, *i);
+		}
+	}
+
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
+	if (pthread_join(uffd_mon, NULL))
+		err("pthread_join()");
+
+	if (nr_sigbus != nr_pages / 2)
+		err("expected to receive %lu SIGBUS, actually received %lu",
+		    nr_pages / 2, nr_sigbus);
+
+	uffd_test_pass();
+}
+
 /*
  * Test the returned uffdio_register.ioctls with different register modes.
  * Note that _UFFDIO_ZEROPAGE is tested separately in the zeropage test.
@@ -1126,6 +1237,12 @@ uffd_test_case_t uffd_tests[] = {
 		UFFD_FEATURE_PAGEFAULT_FLAG_WP |
 		UFFD_FEATURE_WP_HUGETLBFS_SHMEM,
 	},
+	{
+		.name = "poison",
+		.uffd_fn = uffd_poison_test,
+		.mem_targets = MEM_ALL,
+		.uffd_feature_required = UFFD_FEATURE_POISON,
+	},
 };
 
 static void usage(const char *prog)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general
  2023-07-06 22:50 ` [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general Axel Rasmussen
@ 2023-07-07 13:10   ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:10 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:29PM -0700, Axel Rasmussen wrote:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 886f06066622..59e954586e2a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -660,7 +660,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				free_swap_and_cache(entry);
>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			} else if (is_hwpoison_entry(entry) ||
> -				   is_swapin_error_entry(entry)) {
> +				   is_error_swp_entry(entry)) {

is_error_swp_entry() made me think whether we should call this marker as
ERROR at all - it's too generic, is_error_swp_entry() can be easily read as
"this swap entry is not legal".  Can we come up with a less generic one?

PTE_MARKER_PAGE_POISONED / PTE_MARKER_POISONED / PTE_MARKER_PAGE_ERR / ...?

We do use poisoned only in real bad physical pages before, but I think we
can still keep using it when applying it to a pte.  I think it's fine to
call a pte poisoned if it's not legal to access, just like a poisoned page.

>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			}
>  			continue;

The code change across the whole patch look good to me otherwise, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range
  2023-07-06 22:50 ` [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range Axel Rasmussen
@ 2023-07-07 13:14   ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:14 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:30PM -0700, Axel Rasmussen wrote:
> Most userfaultfd ioctls take a `start + len` range as an argument.
> We have the validate_range helper to check that such ranges are valid.
> However, some (but not all!) ioctls *also* check that `start + len`
> doesn't wrap around (overflow).
> 
> Just check for this in validate_range. This saves some repetitive code,
> and adds the check to some ioctls which weren't bothering to check for
> it before.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper
  2023-07-06 22:50 ` [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper Axel Rasmussen
@ 2023-07-07 13:15   ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:15 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:31PM -0700, Axel Rasmussen wrote:
> This code is already duplicated twice, and UFFDIO_POISON will do the
> same check a third time. So, it's worth extracting into a helper to save
> repetitive lines of code.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl
  2023-07-06 22:50 ` [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl Axel Rasmussen
@ 2023-07-07 13:37   ` Peter Xu
  2023-07-07 19:56     ` Axel Rasmussen
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:37 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:32PM -0700, Axel Rasmussen wrote:
> The basic idea here is to "simulate" memory poisoning for VMs. A VM
> running on some host might encounter a memory error, after which some
> page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> once poisoned, pages can never become "un-poisoned". So, when we live
> migrate the VM, we need to preserve the poisoned status of these pages.
> 
> When live migrating, we try to get the guest running on its new host as
> quickly as possible. So, we start it running before all memory has been
> copied, and before we're certain which pages should be poisoned or not.
> 
> So the basic way to use this new feature is:
> 
> - On the new host, the guest's memory is registered with userfaultfd, in
>   either MISSING or MINOR mode (doesn't really matter for this purpose).
> - On any first access, we get a userfaultfd event. At this point we can
>   communicate with the old host to find out if the page was poisoned.
> - If so, we can respond with a UFFDIO_POISON - this places a swap marker
>   so any future accesses will SIGBUS. Because the pte is now "present",
>   future accesses won't generate more userfaultfd events, they'll just
>   SIGBUS directly.
> 
> UFFDIO_POISON does not handle unmapping previously-present PTEs. This
> isn't needed, because during live migration we want to intercept
> all accesses with userfaultfd (not just writes, so WP mode isn't useful
> for this). So whether minor or missing mode is being used (or both), the
> PTE won't be present in any case, so handling that case isn't needed.
> 
> Similarly, UFFDIO_POISON won't replace existing PTE markers. This might
> be okay to do, but it seems to be safer to just refuse to overwrite any
> existing entry (like a UFFD_WP PTE marker).
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

I agree the current behavior is not as clear, especially after hwpoison
introduced.

uffdio-copy is special right now that it can overwrite a marker, so a buggy
userapp can also overwrite a poisoned entry, but it also means the userapp
is broken already, so may not really matter much.

While zeropage wasn't doing that. I think that was just overlooked - i
assume it has the same reasoning as uffdio-copy otherwise.. and no one just
used zeropage over a wp marker yet, or just got it work-arounded by
unprotect+zeropage.

Not yet sure whether it'll make sense to unify this a bit, but making the
new poison api to be strict look fine.  If you have any thoughts after
reading feel free to keep the discussion going, I can ack this one I think
(besides my rename request in 1st patch):

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
  2023-07-06 22:50 ` [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs Axel Rasmussen
@ 2023-07-07 13:39   ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:39 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:33PM -0700, Axel Rasmussen wrote:
> The behavior here is the same as it is for anon/shmem. This is done
> separately because hugetlb pte marker handling is a bit different.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  2023-07-06 22:50 ` [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers Axel Rasmussen
@ 2023-07-07 13:42   ` Peter Xu
  2023-07-07 17:03     ` Axel Rasmussen
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2023-07-07 13:42 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Thu, Jul 06, 2023 at 03:50:35PM -0700, Axel Rasmussen wrote:
> @@ -247,9 +245,13 @@ static int userfaultfd_stress(void)
>  {
>  	void *area;
>  	unsigned long nr;
> -	struct uffd_args args[nr_cpus];
> +	struct uffd_args *args;
>  	uint64_t mem_size = nr_pages * page_size;
>  
> +	args = calloc(nr_cpus, sizeof(struct uffd_args));
> +	if (!args)
> +		err("allocating args array failed");

This is trivial, but I think I requested a "free" (or keep it allocate on
stack) in previous version but it didn't get a response on why we cannot
and it kept going..  could you help explain?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  2023-07-07 13:42   ` Peter Xu
@ 2023-07-07 17:03     ` Axel Rasmussen
  2023-07-07 20:38       ` Axel Rasmussen
  0 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-07 17:03 UTC (permalink / raw
  To: Peter Xu
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 7, 2023 at 6:42 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jul 06, 2023 at 03:50:35PM -0700, Axel Rasmussen wrote:
> > @@ -247,9 +245,13 @@ static int userfaultfd_stress(void)
> >  {
> >       void *area;
> >       unsigned long nr;
> > -     struct uffd_args args[nr_cpus];
> > +     struct uffd_args *args;
> >       uint64_t mem_size = nr_pages * page_size;
> >
> > +     args = calloc(nr_cpus, sizeof(struct uffd_args));
> > +     if (!args)
> > +             err("allocating args array failed");
>
> This is trivial, but I think I requested a "free" (or keep it allocate on
> stack) in previous version but it didn't get a response on why we cannot
> and it kept going..  could you help explain?

Oh, sorry! I had meant to change this after our discussion, and simply
overlooked it while reworking the patches.

I'll include this change in a v4 which also addresses e.g. the
comments on commit 1.

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl
  2023-07-07 13:37   ` Peter Xu
@ 2023-07-07 19:56     ` Axel Rasmussen
  0 siblings, 0 replies; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-07 19:56 UTC (permalink / raw
  To: Peter Xu
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 7, 2023 at 6:37 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jul 06, 2023 at 03:50:32PM -0700, Axel Rasmussen wrote:
> > The basic idea here is to "simulate" memory poisoning for VMs. A VM
> > running on some host might encounter a memory error, after which some
> > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> > once poisoned, pages can never become "un-poisoned". So, when we live
> > migrate the VM, we need to preserve the poisoned status of these pages.
> >
> > When live migrating, we try to get the guest running on its new host as
> > quickly as possible. So, we start it running before all memory has been
> > copied, and before we're certain which pages should be poisoned or not.
> >
> > So the basic way to use this new feature is:
> >
> > - On the new host, the guest's memory is registered with userfaultfd, in
> >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > - On any first access, we get a userfaultfd event. At this point we can
> >   communicate with the old host to find out if the page was poisoned.
> > - If so, we can respond with a UFFDIO_POISON - this places a swap marker
> >   so any future accesses will SIGBUS. Because the pte is now "present",
> >   future accesses won't generate more userfaultfd events, they'll just
> >   SIGBUS directly.
> >
> > UFFDIO_POISON does not handle unmapping previously-present PTEs. This
> > isn't needed, because during live migration we want to intercept
> > all accesses with userfaultfd (not just writes, so WP mode isn't useful
> > for this). So whether minor or missing mode is being used (or both), the
> > PTE won't be present in any case, so handling that case isn't needed.
> >
> > Similarly, UFFDIO_POISON won't replace existing PTE markers. This might
> > be okay to do, but it seems to be safer to just refuse to overwrite any
> > existing entry (like a UFFD_WP PTE marker).
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>
> I agree the current behavior is not as clear, especially after hwpoison
> introduced.
>
> uffdio-copy is special right now that it can overwrite a marker, so a buggy
> userapp can also overwrite a poisoned entry, but it also means the userapp
> is broken already, so may not really matter much.
>
> While zeropage wasn't doing that. I think that was just overlooked - i
> assume it has the same reasoning as uffdio-copy otherwise.. and no one just
> used zeropage over a wp marker yet, or just got it work-arounded by
> unprotect+zeropage.
>
> Not yet sure whether it'll make sense to unify this a bit, but making the
> new poison api to be strict look fine.  If you have any thoughts after
> reading feel free to keep the discussion going, I can ack this one I think
> (besides my rename request in 1st patch):

Agreed, it would be nice to unify things. In my v2 I had anon/shmem
and hugetlbfs behaving differently in this respect, for the same
reason - it was just overlooked / cargo culted from existing code. If
nothing else I think a single ioctl should be consistent across memory
types! Heh.

But I also think you're right and it's not exactly intentional that
copy / zeropage / etc are different in this respect. Some unification
would be nice, although I'm not 100% sure what that looks like
concretely.

My rule of thumb is, in cases where we can't imagine a real use case,
it's better to be too strict rather than too loose. And in the future,
it's less disruptive to loosen restrictions rather than tighten them
(potentially breaking something which used to work).

I'll leave untangling this to some future series, though.

>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  2023-07-07 17:03     ` Axel Rasmussen
@ 2023-07-07 20:38       ` Axel Rasmussen
  2023-07-07 21:00         ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Axel Rasmussen @ 2023-07-07 20:38 UTC (permalink / raw
  To: Peter Xu
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 7, 2023 at 10:03 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Fri, Jul 7, 2023 at 6:42 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jul 06, 2023 at 03:50:35PM -0700, Axel Rasmussen wrote:
> > > @@ -247,9 +245,13 @@ static int userfaultfd_stress(void)
> > >  {
> > >       void *area;
> > >       unsigned long nr;
> > > -     struct uffd_args args[nr_cpus];
> > > +     struct uffd_args *args;
> > >       uint64_t mem_size = nr_pages * page_size;
> > >
> > > +     args = calloc(nr_cpus, sizeof(struct uffd_args));
> > > +     if (!args)
> > > +             err("allocating args array failed");
> >
> > This is trivial, but I think I requested a "free" (or keep it allocate on
> > stack) in previous version but it didn't get a response on why we cannot
> > and it kept going..  could you help explain?
>
> Oh, sorry! I had meant to change this after our discussion, and simply
> overlooked it while reworking the patches.
>
> I'll include this change in a v4 which also addresses e.g. the
> comments on commit 1.

Ah, so I tried switching back to the {0} initializer, and was reminded
why I didn't do that in v1. :) Ignoring the missing braces warning I
talked about before, using {0} here is actually an error
("variable-sized object may not be initialized") because this is a
variable sized array (nr_cpus isn't constant). So, that option is out.

I'm not a huge fan of adding the free() cleanup and dealing with all
of the err() calls this function has.

Originally I switched to calloc() because I'm not a big fan of VLAs
anyway. But, as a compromise in v4 I'll leave it a VLA, and switch to
memset() for initializing it.

>
> >
> > --
> > Peter Xu
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
  2023-07-07 20:38       ` Axel Rasmussen
@ 2023-07-07 21:00         ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2023-07-07 21:00 UTC (permalink / raw
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrew Morton, Brian Geffon, Christian Brauner,
	David Hildenbrand, Gaosheng Cui, Huang Ying, Hugh Dickins,
	James Houghton, Jan Alexander Steffens (heftig), Jiaqi Yan,
	Jonathan Corbet, Kefeng Wang, Liam R. Howlett, Miaohe Lin,
	Mike Kravetz, Mike Rapoport (IBM), Muchun Song, Nadav Amit,
	Naoya Horiguchi, Ryan Roberts, Shuah Khan, Suleiman Souhlal,
	Suren Baghdasaryan, T.J. Alumbaugh, Yu Zhao, ZhangPeng, linux-doc,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Fri, Jul 07, 2023 at 01:38:16PM -0700, Axel Rasmussen wrote:
> Ah, so I tried switching back to the {0} initializer, and was reminded
> why I didn't do that in v1. :) Ignoring the missing braces warning I
> talked about before, using {0} here is actually an error
> ("variable-sized object may not be initialized") because this is a
> variable sized array (nr_cpus isn't constant). So, that option is out.
> 
> I'm not a huge fan of adding the free() cleanup and dealing with all
> of the err() calls this function has.

Oh, that's definitely not needed - as long as we know we're going to quit,
we let kernel clean everything is fine.

I just worry in the future there can be a loop of userfaultfd_stress() so
it can OOM a host even if no err() hit but by looping.  I hope I explained
what I meant.. so it's still good we make sure things freed properly when
in success paths and when we're at it.

> 
> Originally I switched to calloc() because I'm not a big fan of VLAs
> anyway. But, as a compromise in v4 I'll leave it a VLA, and switch to
> memset() for initializing it.

That'll be good enough to me.  Thanks a lot,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-07-07 21:01 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-06 22:50 [PATCH v3 0/8] add UFFDIO_POISON to simulate memory poisoning with UFFD Axel Rasmussen
2023-07-06 22:50 ` [PATCH v3 1/8] mm: make PTE_MARKER_SWAPIN_ERROR more general Axel Rasmussen
2023-07-07 13:10   ` Peter Xu
2023-07-06 22:50 ` [PATCH v3 2/8] mm: userfaultfd: check for start + len overflow in validate_range Axel Rasmussen
2023-07-07 13:14   ` Peter Xu
2023-07-06 22:50 ` [PATCH v3 3/8] mm: userfaultfd: extract file size check out into a helper Axel Rasmussen
2023-07-07 13:15   ` Peter Xu
2023-07-06 22:50 ` [PATCH v3 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl Axel Rasmussen
2023-07-07 13:37   ` Peter Xu
2023-07-07 19:56     ` Axel Rasmussen
2023-07-06 22:50 ` [PATCH v3 5/8] mm: userfaultfd: support UFFDIO_POISON for hugetlbfs Axel Rasmussen
2023-07-07 13:39   ` Peter Xu
2023-07-06 22:50 ` [PATCH v3 6/8] mm: userfaultfd: document and enable new UFFDIO_POISON feature Axel Rasmussen
2023-07-06 22:50 ` [PATCH v3 7/8] selftests/mm: refactor uffd_poll_thread to allow custom fault handlers Axel Rasmussen
2023-07-07 13:42   ` Peter Xu
2023-07-07 17:03     ` Axel Rasmussen
2023-07-07 20:38       ` Axel Rasmussen
2023-07-07 21:00         ` Peter Xu
2023-07-06 22:50 ` [PATCH v3 8/8] selftests/mm: add uffd unit test for UFFDIO_POISON Axel Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.