LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] large folios swap-in: handle refault cases first
@ 2024-05-03  0:50 Barry Song
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
                   ` (5 more replies)
  0 siblings, 6 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

From: Barry Song <v-songbaohua@oppo.com>

This patch is extracted from the large folio swapin series[1], primarily addressing
the handling of scenarios involving large folios in the swap cache. Currently, it is
particularly focused on addressing the refaulting of mTHP, which is still undergoing
reclamation. This approach aims to streamline code review and expedite the integration
of this segment into the MM tree.

It relies on Ryan's swap-out series[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap
cache before the large folio is released by vmscan. However, the code
should remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults
and eliminate most redundant checks and early exits for MTE restoration
in recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
will be split into separate patch sets and sent at a later time.

-v3:
 - optimize swap_free_nr using bitmap with single one "long"; "Huang, Ying"
 - drop swap_free() as suggested by "Huang, Ying", now hibernation can get
   batched;
 - lots of cleanup in do_swap_page() as commented by Ryan Roberts and "Huang,
   Ying";
 - handle arch_do_swap_page() with nr pages though the only platform which
   needs it, sparc, doesn't support THP_SWAPOUT as suggested by "Huang,
   Ying";
 - introduce pte_move_swp_offset() as suggested by "Huang, Ying";
 - drop the "any_shared" of checking swap entries with respect to David's
   comment;
 - drop the counter of swapin_refault and keep it for debug purpose per
   Ying
 - collect reviewed-by tags

-v2:
 - rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
   a lot.
 - remove folio_add_new_anon_rmap() for !folio_test_anon()
   as currently large folios are always anon(refault).
 - add mTHP swpin refault counters
  Link:
  https://lore.kernel.org/linux-mm/20240409082631.187483-1-21cnbao@gmail.com/

-v1:
  Link: https://lore.kernel.org/linux-mm/20240402073237.240995-1-21cnbao@gmail.com/

Differences with the original large folios swap-in series
 - collect r-o-b, acked;
 - rename swap_nr_free to swap_free_nr, according to Ryan;
 - limit the maximum kernel stack usage for swap_free_nr, Ryan;
 - add output argument in swap_pte_batch to expose if all entries are
   exclusive
 - many clean refinements, handle the corner case folio's virtual addr
   might not be naturally aligned

[1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmail.com/

Barry Song (3):
  mm: remove swap_free() and always use swap_free_nr()
  mm: introduce pte_move_swp_offset() helper which can move offset
    bidirectionally
  mm: introduce arch_do_swap_page_nr() which allows restore metadata for
    nr pages

Chuanhua Han (3):
  mm: swap: introduce swap_free_nr() for batched swap_free()
  mm: swap: make should_try_to_free_swap() support large-folio
  mm: swap: entirely map large folios found in swapcache

 include/linux/pgtable.h | 26 ++++++++++++++----
 include/linux/swap.h    |  4 +--
 kernel/power/swap.c     |  7 ++---
 mm/internal.h           | 25 ++++++++++++++---
 mm/memory.c             | 61 +++++++++++++++++++++++++++++++++--------
 mm/rmap.c               |  4 +--
 mm/shmem.c              |  4 +--
 mm/swapfile.c           | 50 +++++++++++++++++++++++++++++----
 8 files changed, 143 insertions(+), 38 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-03  9:26   ` Ryan Roberts
                     ` (2 more replies)
  2024-05-03  0:50 ` [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr() Barry Song
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

From: Chuanhua Han <hanchuanhua@oppo.com>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.
Furthermore, this new function, swap_free_nr(), is designed to efficiently
handle various scenarios for releasing a specified number, nr, of swap
entries.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  5 +++++
 mm/swapfile.c        | 47 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..d1d35e92d7e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
+extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
@@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
+static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+}
+
 static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6ca215fb92f..ec12f2b9d229 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
 		__swap_entry_free(p, entry);
 }
 
+static void cluster_swap_free_nr(struct swap_info_struct *sis,
+		unsigned long offset, int nr_pages)
+{
+	struct swap_cluster_info *ci;
+	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
+	int i, nr;
+
+	ci = lock_cluster_or_swap_info(sis, offset);
+	while (nr_pages) {
+		nr = min(BITS_PER_LONG, nr_pages);
+		for (i = 0; i < nr; i++) {
+			if (!__swap_entry_free_locked(sis, offset + i, 1))
+				bitmap_set(to_free, i, 1);
+		}
+		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
+			unlock_cluster_or_swap_info(sis, ci);
+			for_each_set_bit(i, to_free, BITS_PER_LONG)
+				free_swap_slot(swp_entry(sis->type, offset + i));
+			if (nr == nr_pages)
+				return;
+			bitmap_clear(to_free, 0, BITS_PER_LONG);
+			ci = lock_cluster_or_swap_info(sis, offset);
+		}
+		offset += nr;
+		nr_pages -= nr;
+	}
+	unlock_cluster_or_swap_info(sis, ci);
+}
+
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	int nr;
+	struct swap_info_struct *sis;
+	unsigned long offset = swp_offset(entry);
+
+	sis = _swap_info_get(entry);
+	if (!sis)
+		return;
+
+	while (nr_pages) {
+		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+		cluster_swap_free_nr(sis, offset, nr);
+		offset += nr;
+		nr_pages -= nr;
+	}
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-03  9:31   ` Ryan Roberts
  2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy, Rafael J. Wysocki,
	Pavel Machek, Len Brown

From: Barry Song <v-songbaohua@oppo.com>

To streamline maintenance efforts, we propose discontinuing the use of
swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
to 1. This adjustment offers the advantage of enabling batch processing
within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
a bitmap consisting of only one long, resulting in overhead that can be
ignored for cases where nr equals 1.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Cc: Hugh Dickins <hughd@google.com>
---
 include/linux/swap.h |  5 -----
 kernel/power/swap.c  |  7 +++----
 mm/memory.c          |  2 +-
 mm/rmap.c            |  4 ++--
 mm/shmem.c           |  4 ++--
 mm/swapfile.c        | 19 +++++--------------
 6 files changed, 13 insertions(+), 28 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1d35e92d7e9..f03cb446124e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
-extern void swap_free(swp_entry_t);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
 	return 0;
 }
 
-static inline void swap_free(swp_entry_t swp)
-{
-}
-
 static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 5bc04bfe2db1..6befaa88a342 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
 	offset = swp_offset(get_swap_page_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_nr(swp_entry(swap, offset), 1);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
 
 	while ((node = swsusp_extents.rb_node)) {
 		struct swsusp_extent *ext;
-		unsigned long offset;
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		for (offset = ext->start; offset <= ext->end; offset++)
-			swap_free(swp_entry(swap, offset));
+		swap_free_nr(swp_entry(swap, ext->start),
+			     ext->end - ext->start + 1);
 
 		kfree(ext);
 	}
diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..f033eb3528ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * We're already holding a reference on the page but haven't mapped it
 	 * yet.
 	 */
-	swap_free(entry);
+	swap_free_nr(entry, 1);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 087a79f1f611..39ec7742acec 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto walk_done_err;
 			}
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
+				swap_free_nr(entry, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_done_err;
 			}
@@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				swap_free(entry);
+				swap_free_nr(entry, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_done_err;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index fa2a0ed97507..bfc8a2beb24f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * in shmem_evict_inode().
 	 */
 	shmem_recalc_inode(inode, -1, -1);
-	swap_free(swap);
+	swap_free_nr(swap, 1);
 }
 
 /*
@@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
-	swap_free(swap);
+	swap_free_nr(swap, 1);
 	put_swap_device(si);
 
 	*foliop = folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ec12f2b9d229..ddcd0f24b9a1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	swap_range_free(p, offset, 1);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free(swp_entry_t entry)
-{
-	struct swap_info_struct *p;
-
-	p = _swap_info_get(entry);
-	if (p)
-		__swap_entry_free(p, entry);
-}
-
 static void cluster_swap_free_nr(struct swap_info_struct *sis,
 		unsigned long offset, int nr_pages)
 {
@@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
 	unlock_cluster_or_swap_info(sis, ci);
 }
 
+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
 void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	int nr;
@@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	swap_free(entry);
+	swap_free_nr(entry, 1);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-05-03  0:50 ` [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr() Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-03  9:41   ` Ryan Roberts
                     ` (2 more replies)
  2024-05-03  0:50 ` [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages Barry Song
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

From: Barry Song <v-songbaohua@oppo.com>

There could arise a necessity to obtain the first pte_t from a swap
pte_t located in the middle. For instance, this may occur within the
context of do_swap_page(), where a page fault can potentially occur in
any PTE of a large folio. To address this, the following patch introduces
pte_move_swp_offset(), a function capable of bidirectional movement by
a specified delta argument. Consequently, pte_increment_swp_offset()
will directly invoke it with delta = 1.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/internal.h | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c5552d35d995..cfe4aed66a5c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 }
 
 /**
- * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * pte_move_swp_offset - Move the swap entry offset field of a swap pte
+ *	 forward or backward by delta
  * @pte: The initial pte state; is_swap_pte(pte) must be true and
  *	 non_swap_entry() must be false.
+ * @delta: The direction and the offset we are moving; forward if delta
+ *	 is positive; backward if delta is negative
  *
- * Increments the swap offset, while maintaining all other fields, including
+ * Moves the swap offset, while maintaining all other fields, including
  * swap type, and any swp pte bits. The resulting pte is returned.
  */
-static inline pte_t pte_next_swp_offset(pte_t pte)
+static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
 	swp_entry_t entry = pte_to_swp_entry(pte);
 	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
-						   (swp_offset(entry) + 1)));
+						   (swp_offset(entry) + delta)));
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
@@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
 	return new;
 }
 
+
+/**
+ * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * @pte: The initial pte state; is_swap_pte(pte) must be true and
+ *	 non_swap_entry() must be false.
+ *
+ * Increments the swap offset, while maintaining all other fields, including
+ * swap type, and any swp pte bits. The resulting pte is returned.
+ */
+static inline pte_t pte_next_swp_offset(pte_t pte)
+{
+	return pte_move_swp_offset(pte, 1);
+}
+
 /**
  * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
  * @start_ptep: Page table pointer for the first entry.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
                   ` (2 preceding siblings ...)
  2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-03 10:02   ` Ryan Roberts
  2024-05-06 16:51   ` Khalid Aziz
  2024-05-03  0:50 ` [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
  2024-05-03  0:50 ` [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache Barry Song
  5 siblings, 2 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy, Khalid Aziz, David S. Miller,
	Andreas Larsson

From: Barry Song <v-songbaohua@oppo.com>

Should do_swap_page() have the capability to directly map a large folio,
metadata restoration becomes necessary for a specified number of pages
denoted as nr. It's important to highlight that metadata restoration is
solely required by the SPARC platform, which, however, does not enable
THP_SWAP. Consequently, in the present kernel configuration, there
exists no practical scenario where users necessitate the restoration of
nr metadata. Platforms implementing THP_SWAP might invoke this function
with nr values exceeding 1, subsequent to do_swap_page() successfully
mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
functions remain empty.

Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/pgtable.h | 26 ++++++++++++++++++++------
 mm/memory.c             |  3 ++-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f037bae..463e84c3de26 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
 })
 
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+				     struct vm_area_struct *vma,
+				     unsigned long addr,
+				     pte_t pte, pte_t oldpte,
+				     int nr)
+{
+
+}
+#else
 /*
  * Some architectures support metadata associated with a page. When a
  * page is being swapped out, this metadata must be saved so it can be
@@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
  * page as metadata for the page. arch_do_swap_page() can restore this
  * metadata when a page is swapped back in.
  */
-static inline void arch_do_swap_page(struct mm_struct *mm,
-				     struct vm_area_struct *vma,
-				     unsigned long addr,
-				     pte_t pte, pte_t oldpte)
-{
-
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long addr,
+					pte_t pte, pte_t oldpte,
+					int nr)
+{
+	for (int i = 0; i < nr; i++) {
+		arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+				pte_advance_pfn(pte, i),
+				pte_advance_pfn(oldpte, i));
+	}
 }
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index f033eb3528ba..74cdefd58f5f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
-	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+	arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
+			pte, vmf->orig_pte, 1);
 
 	folio_unlock(folio);
 	if (folio != swapcache && swapcache) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
                   ` (3 preceding siblings ...)
  2024-05-03  0:50 ` [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-06  8:14   ` David Hildenbrand
  2024-05-03  0:50 ` [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache Barry Song
  5 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

From: Chuanhua Han <hanchuanhua@oppo.com>

The function should_try_to_free_swap() operates under the assumption
that swap-in always occurs at the normal page granularity,
i.e., folio_nr_pages() = 1. However, in reality, for large folios,
add_to_swap_cache() will invoke folio_ref_add(folio, nr).  To accommodate
large folio swap-in, this patch eliminates this assumption.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 74cdefd58f5f..22e7c33cc747 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3877,7 +3877,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
 	 * reference only in case it's likely that we'll be the exlusive user.
 	 */
 	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == 2;
+		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
                   ` (4 preceding siblings ...)
  2024-05-03  0:50 ` [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-05-03  0:50 ` Barry Song
  2024-05-03 10:50   ` Ryan Roberts
  2024-05-06 12:05   ` David Hildenbrand
  5 siblings, 2 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03  0:50 UTC (permalink / raw
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

From: Chuanhua Han <hanchuanhua@oppo.com>

When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages
page faults. This patch opts to map the entire large folio at once to
minimize page faults. Additionally, redundant checks and early exits
for ARM64 MTE restoring are removed.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 22e7c33cc747..940fdbe69fa1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
+	int nr_pages = 1;
+	unsigned long page_idx = 0;
+	unsigned long address = vmf->address;
+	pte_t *ptep;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	ptep = vmf->pte;
+	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+		int nr = folio_nr_pages(folio);
+		unsigned long idx = folio_page_idx(folio, page);
+		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
+		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
+		pte_t *folio_ptep;
+		pte_t folio_pte;
+
+		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
+			goto check_folio;
+		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
+			goto check_folio;
+
+		folio_ptep = vmf->pte - idx;
+		folio_pte = ptep_get(folio_ptep);
+		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+			goto check_folio;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		nr_pages = nr;
+		entry = folio->swap;
+		page = &folio->page;
+	}
+
+check_folio:
+
 	/*
 	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
 	 * must never point at an anonymous page in the swapcache that is
@@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * We're already holding a reference on the page but haven't mapped it
 	 * yet.
 	 */
-	swap_free_nr(entry, 1);
+	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * exclusivity.
 	 */
 	if (!folio_test_ksm(folio) &&
-	    (exclusive || folio_ref_count(folio) == 1)) {
+	    (exclusive || (folio_ref_count(folio) == nr_pages &&
+			   folio_nr_pages(folio) == nr_pages))) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 		}
 		rmap_flags |= RMAP_EXCLUSIVE;
 	}
-	flush_icache_page(vma, page);
+	flush_icache_pages(vma, page, nr_pages);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	if (pte_swp_uffd_wp(vmf->orig_pte))
 		pte = pte_mkuffd_wp(pte);
-	vmf->orig_pte = pte;
+	vmf->orig_pte = pte_advance_pfn(pte, page_idx);
 
 	/* ksm created a completely new copy */
 	if (unlikely(folio != swapcache && swapcache)) {
-		folio_add_new_anon_rmap(folio, vma, vmf->address);
+		folio_add_new_anon_rmap(folio, vma, address);
 		folio_add_lru_vma(folio, vma);
 	} else {
-		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
 					rmap_flags);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
-	arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
-			pte, vmf->orig_pte, 1);
+	set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
+	arch_do_swap_page_nr(vma->vm_mm, vma, address,
+			pte, pte, nr_pages);
 
 	folio_unlock(folio);
 	if (folio != swapcache && swapcache) {
@@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
@ 2024-05-03  9:26   ` Ryan Roberts
  2024-05-03 20:25   ` Chris Li
  2024-05-08  7:35   ` Huang, Ying
  2 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-03  9:26 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy

On 03/05/2024 01:50, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

This looks much better!

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>


> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 47 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> +		unsigned long offset, int nr_pages)
> +{
> +	struct swap_cluster_info *ci;
> +	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> +	int i, nr;
> +
> +	ci = lock_cluster_or_swap_info(sis, offset);
> +	while (nr_pages) {
> +		nr = min(BITS_PER_LONG, nr_pages);
> +		for (i = 0; i < nr; i++) {
> +			if (!__swap_entry_free_locked(sis, offset + i, 1))
> +				bitmap_set(to_free, i, 1);
> +		}
> +		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> +			unlock_cluster_or_swap_info(sis, ci);
> +			for_each_set_bit(i, to_free, BITS_PER_LONG)
> +				free_swap_slot(swp_entry(sis->type, offset + i));
> +			if (nr == nr_pages)
> +				return;
> +			bitmap_clear(to_free, 0, BITS_PER_LONG);
> +			ci = lock_cluster_or_swap_info(sis, offset);
> +		}
> +		offset += nr;
> +		nr_pages -= nr;
> +	}
> +	unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int nr;
> +	struct swap_info_struct *sis;
> +	unsigned long offset = swp_offset(entry);
> +
> +	sis = _swap_info_get(entry);
> +	if (!sis)
> +		return;
> +
> +	while (nr_pages) {
> +		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> +		cluster_swap_free_nr(sis, offset, nr);
> +		offset += nr;
> +		nr_pages -= nr;
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-03  0:50 ` [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr() Barry Song
@ 2024-05-03  9:31   ` Ryan Roberts
  2024-05-03 20:37     ` Chris Li
  2024-05-08  7:56     ` Huang, Ying
  0 siblings, 2 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-03  9:31 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, Rafael J. Wysocki, Pavel Machek,
	Len Brown

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> To streamline maintenance efforts, we propose discontinuing the use of
> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> to 1. This adjustment offers the advantage of enabling batch processing
> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> a bitmap consisting of only one long, resulting in overhead that can be
> ignored for cases where nr equals 1.
> 
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/swap.h |  5 -----
>  kernel/power/swap.c  |  7 +++----
>  mm/memory.c          |  2 +-
>  mm/rmap.c            |  4 ++--
>  mm/shmem.c           |  4 ++--
>  mm/swapfile.c        | 19 +++++--------------
>  6 files changed, 13 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d1d35e92d7e9..f03cb446124e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
> -extern void swap_free(swp_entry_t);

I wonder if it would be cleaner to:

#define swap_free(entry) swap_free_nr((entry), 1)

To save all the churn for the callsites that just want to pass a single entry?

>  extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
>  	return 0;
>  }
>  
> -static inline void swap_free(swp_entry_t swp)
> -{
> -}
> -
>  static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>  {
>  }
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 5bc04bfe2db1..6befaa88a342 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
>  	offset = swp_offset(get_swap_page_of_type(swap));
>  	if (offset) {
>  		if (swsusp_extents_insert(offset))
> -			swap_free(swp_entry(swap, offset));
> +			swap_free_nr(swp_entry(swap, offset), 1);
>  		else
>  			return swapdev_block(swap, offset);
>  	}
> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>  
>  	while ((node = swsusp_extents.rb_node)) {
>  		struct swsusp_extent *ext;
> -		unsigned long offset;
>  
>  		ext = rb_entry(node, struct swsusp_extent, node);
>  		rb_erase(node, &swsusp_extents);
> -		for (offset = ext->start; offset <= ext->end; offset++)
> -			swap_free(swp_entry(swap, offset));
> +		swap_free_nr(swp_entry(swap, ext->start),
> +			     ext->end - ext->start + 1);
>  
>  		kfree(ext);
>  	}
> diff --git a/mm/memory.c b/mm/memory.c
> index eea6e4984eae..f033eb3528ba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * We're already holding a reference on the page but haven't mapped it
>  	 * yet.
>  	 */
> -	swap_free(entry);
> +	swap_free_nr(entry, 1);
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 087a79f1f611..39ec7742acec 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				goto walk_done_err;
>  			}
>  			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> -				swap_free(entry);
> +				swap_free_nr(entry, 1);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_done_err;
>  			}
> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>  			if (anon_exclusive &&
>  			    folio_try_share_anon_rmap_pte(folio, subpage)) {
> -				swap_free(entry);
> +				swap_free_nr(entry, 1);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_done_err;
>  			}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index fa2a0ed97507..bfc8a2beb24f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>  	 * in shmem_evict_inode().
>  	 */
>  	shmem_recalc_inode(inode, -1, -1);
> -	swap_free(swap);
> +	swap_free_nr(swap, 1);
>  }
>  
>  /*
> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  
>  	delete_from_swap_cache(folio);
>  	folio_mark_dirty(folio);
> -	swap_free(swap);
> +	swap_free_nr(swap, 1);
>  	put_swap_device(si);
>  
>  	*foliop = folio;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ec12f2b9d229..ddcd0f24b9a1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>  	swap_range_free(p, offset, 1);
>  }
>  
> -/*
> - * Caller has made sure that the swap device corresponding to entry
> - * is still around or has not been recycled.
> - */
> -void swap_free(swp_entry_t entry)
> -{
> -	struct swap_info_struct *p;
> -
> -	p = _swap_info_get(entry);
> -	if (p)
> -		__swap_entry_free(p, entry);
> -}
> -
>  static void cluster_swap_free_nr(struct swap_info_struct *sis,
>  		unsigned long offset, int nr_pages)
>  {
> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
>  	unlock_cluster_or_swap_info(sis, ci);
>  }
>  
> +/*
> + * Caller has made sure that the swap device corresponding to entry
> + * is still around or has not been recycled.
> + */
>  void swap_free_nr(swp_entry_t entry, int nr_pages)
>  {
>  	int nr;
> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		new_pte = pte_mkuffd_wp(new_pte);
>  setpte:
>  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> -	swap_free(entry);
> +	swap_free_nr(entry, 1);
>  out:
>  	if (pte)
>  		pte_unmap_unlock(pte, ptl);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
@ 2024-05-03  9:41   ` Ryan Roberts
  2024-05-03 23:40     ` Barry Song
  2024-05-03 20:51   ` Chris Li
  2024-05-08  8:08   ` Huang, Ying
  2 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-05-03  9:41 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()

You mean pte_next_swp_offset()?

> will directly invoke it with delta = 1.
> 
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>  
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *	 forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *	 non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *	 is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)

We have equivalent functions for pfn:

  pte_next_pfn()
  pte_advance_pfn()

Although the latter takes an unsigned long and only moves forward currently. I
wonder if it makes sense to have their naming and semantics match? i.e. change
pte_advance_pfn() to pte_move_pfn() and let it move backwards too.

I guess we don't have a need for that and it adds more churn.

Anyway:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>


>  {
>  	swp_entry_t entry = pte_to_swp_entry(pte);
>  	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -						   (swp_offset(entry) + 1)));
> +						   (swp_offset(entry) + delta)));
>  
>  	if (pte_swp_soft_dirty(pte))
>  		new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>  	return new;
>  }
>  
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *	 non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +	return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages
  2024-05-03  0:50 ` [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages Barry Song
@ 2024-05-03 10:02   ` Ryan Roberts
  2024-05-06 16:51   ` Khalid Aziz
  1 sibling, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-03 10:02 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, Khalid Aziz, David S. Miller,
	Andreas Larsson

On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Should do_swap_page() have the capability to directly map a large folio,
> metadata restoration becomes necessary for a specified number of pages
> denoted as nr. It's important to highlight that metadata restoration is
> solely required by the SPARC platform, which, however, does not enable
> THP_SWAP. Consequently, in the present kernel configuration, there
> exists no practical scenario where users necessitate the restoration of
> nr metadata. Platforms implementing THP_SWAP might invoke this function
> with nr values exceeding 1, subsequent to do_swap_page() successfully
> mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
> functions remain empty.
> 
> Cc: Khalid Aziz <khalid.aziz@oracle.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/pgtable.h | 26 ++++++++++++++++++++------
>  mm/memory.c             |  3 ++-
>  2 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 18019f037bae..463e84c3de26 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
>  })
>  
>  #ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     pte_t pte, pte_t oldpte,
> +				     int nr)
> +{
> +
> +}
> +#else
>  /*
>   * Some architectures support metadata associated with a page. When a
>   * page is being swapped out, this metadata must be saved so it can be
> @@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
>   * page as metadata for the page. arch_do_swap_page() can restore this
>   * metadata when a page is swapped back in.
>   */
> -static inline void arch_do_swap_page(struct mm_struct *mm,
> -				     struct vm_area_struct *vma,
> -				     unsigned long addr,
> -				     pte_t pte, pte_t oldpte)

This hook seems to be very similar to arch_swap_restore(), I wonder if it makes
sense to merge them. Out of scope for this patch series though.


> -{
> -
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long addr,
> +					pte_t pte, pte_t oldpte,
> +					int nr)
> +{
> +	for (int i = 0; i < nr; i++) {
> +		arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
> +				pte_advance_pfn(pte, i),
> +				pte_advance_pfn(oldpte, i));

It seems a bit odd to create a batched version of this, but not allow arches to
take advantage. Although I guess your point is that only SPARC implements it and
on that platform nr will always be 1? So no point right now? So this is just a
convenience for do_swap_page()? Makes sense.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> +	}
>  }
>  #endif
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index f033eb3528ba..74cdefd58f5f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	VM_BUG_ON(!folio_test_anon(folio) ||
>  			(pte_write(pte) && !PageAnonExclusive(page)));
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +	arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> +			pte, vmf->orig_pte, 1);
>  
>  	folio_unlock(folio);
>  	if (folio != swapcache && swapcache) {


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03  0:50 ` [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache Barry Song
@ 2024-05-03 10:50   ` Ryan Roberts
  2024-05-03 23:23     ` Barry Song
  2024-05-06 12:05   ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-05-03 10:50 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy

On 03/05/2024 01:50, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

With the suggested changes below:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 22e7c33cc747..940fdbe69fa1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
> +	int nr_pages = 1;
> +	unsigned long page_idx = 0;
> +	unsigned long address = vmf->address;
> +	pte_t *ptep;

nit: Personally I'd prefer all these to get initialised just before the "if
(folio_test_large()..." block below. That way it is clear they are fresh (incase
any logic between here and there makes an adjustment) and its clear that they
are only to be used after that block (the compiler will warn if using an
uninitialized value).

>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		goto out_nomap;
>  	}
>  
> +	ptep = vmf->pte;
> +	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> +		int nr = folio_nr_pages(folio);
> +		unsigned long idx = folio_page_idx(folio, page);
> +		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> +		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> +		pte_t *folio_ptep;
> +		pte_t folio_pte;
> +
> +		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> +			goto check_folio;
> +		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> +			goto check_folio;
> +
> +		folio_ptep = vmf->pte - idx;
> +		folio_pte = ptep_get(folio_ptep);
> +		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> +		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> +			goto check_folio;
> +
> +		page_idx = idx;
> +		address = folio_start;
> +		ptep = folio_ptep;
> +		nr_pages = nr;
> +		entry = folio->swap;
> +		page = &folio->page;
> +	}
> +
> +check_folio:

Is this still the correct label name, given the checks are now above the new
block? Perhaps "one_page" or something like that?

> +
>  	/*
>  	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>  	 * must never point at an anonymous page in the swapcache that is
> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * We're already holding a reference on the page but haven't mapped it
>  	 * yet.
>  	 */
> -	swap_free_nr(entry, 1);
> +	swap_free_nr(entry, nr_pages);
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * exclusivity.
>  	 */
>  	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	    (exclusive || (folio_ref_count(folio) == nr_pages &&
> +			   folio_nr_pages(folio) == nr_pages))) {

I think in practice there is no change here? If nr_pages > 1 then the folio is
in the swapcache, so there is an extra ref on it? I agree with the change for
robustness sake. Just checking my understanding.

>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  			vmf->flags &= ~FAULT_FLAG_WRITE;
>  		}
>  		rmap_flags |= RMAP_EXCLUSIVE;
>  	}
> -	flush_icache_page(vma, page);
> +	flush_icache_pages(vma, page, nr_pages);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  	if (pte_swp_uffd_wp(vmf->orig_pte))
>  		pte = pte_mkuffd_wp(pte);
> -	vmf->orig_pte = pte;
> +	vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>  
>  	/* ksm created a completely new copy */
>  	if (unlikely(folio != swapcache && swapcache)) {
> -		folio_add_new_anon_rmap(folio, vma, vmf->address);
> +		folio_add_new_anon_rmap(folio, vma, address);
>  		folio_add_lru_vma(folio, vma);
>  	} else {
> -		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> +		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>  					rmap_flags);
>  	}
>  
>  	VM_BUG_ON(!folio_test_anon(folio) ||
>  			(pte_write(pte) && !PageAnonExclusive(page)));
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> -			pte, vmf->orig_pte, 1);
> +	set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> +	arch_do_swap_page_nr(vma->vm_mm, vma, address,
> +			pte, pte, nr_pages);
>  
>  	folio_unlock(folio);
>  	if (folio != swapcache && swapcache) {
> @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-05-03  9:26   ` Ryan Roberts
@ 2024-05-03 20:25   ` Chris Li
  2024-05-08  7:35   ` Huang, Ying
  2 siblings, 0 replies; 49+ messages in thread
From: Chris Li @ 2024-05-03 20:25 UTC (permalink / raw
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, david, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

Hi Barry,

Looks good. Looking forward to the change to batch free to skipping
the swap slot cache.
All the entries are from the same swap device, it does not need the
sort function. Currently it goes through a lot of locking and and
unlocking inside the loop.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Thu, May 2, 2024 at 5:50 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 47 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
>                 __swap_entry_free(p, entry);
>  }
>
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> +               unsigned long offset, int nr_pages)
> +{
> +       struct swap_cluster_info *ci;
> +       DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> +       int i, nr;
> +
> +       ci = lock_cluster_or_swap_info(sis, offset);
> +       while (nr_pages) {
> +               nr = min(BITS_PER_LONG, nr_pages);
> +               for (i = 0; i < nr; i++) {
> +                       if (!__swap_entry_free_locked(sis, offset + i, 1))
> +                               bitmap_set(to_free, i, 1);
> +               }
> +               if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> +                       unlock_cluster_or_swap_info(sis, ci);
> +                       for_each_set_bit(i, to_free, BITS_PER_LONG)
> +                               free_swap_slot(swp_entry(sis->type, offset + i));
> +                       if (nr == nr_pages)
> +                               return;
> +                       bitmap_clear(to_free, 0, BITS_PER_LONG);
> +                       ci = lock_cluster_or_swap_info(sis, offset);
> +               }
> +               offset += nr;
> +               nr_pages -= nr;
> +       }
> +       unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +       int nr;
> +       struct swap_info_struct *sis;
> +       unsigned long offset = swp_offset(entry);
> +
> +       sis = _swap_info_get(entry);
> +       if (!sis)
> +               return;
> +
> +       while (nr_pages) {
> +               nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> +               cluster_swap_free_nr(sis, offset, nr);
> +               offset += nr;
> +               nr_pages -= nr;
> +       }
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-03  9:31   ` Ryan Roberts
@ 2024-05-03 20:37     ` Chris Li
  2024-05-04  4:03       ` Christoph Hellwig
  2024-05-08  7:56     ` Huang, Ying
  1 sibling, 1 reply; 49+ messages in thread
From: Chris Li @ 2024-05-03 20:37 UTC (permalink / raw
  To: Ryan Roberts
  Cc: Barry Song, akpm, linux-mm, baolin.wang, david, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy, Rafael J. Wysocki,
	Pavel Machek, Len Brown

On Fri, May 3, 2024 at 2:31 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > To streamline maintenance efforts, we propose discontinuing the use of
> > swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> > to 1. This adjustment offers the advantage of enabling batch processing
> > within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> > a bitmap consisting of only one long, resulting in overhead that can be
> > ignored for cases where nr equals 1.
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Cc: Pavel Machek <pavel@ucw.cz>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > ---
> >  include/linux/swap.h |  5 -----
> >  kernel/power/swap.c  |  7 +++----
> >  mm/memory.c          |  2 +-
> >  mm/rmap.c            |  4 ++--
> >  mm/shmem.c           |  4 ++--
> >  mm/swapfile.c        | 19 +++++--------------
> >  6 files changed, 13 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index d1d35e92d7e9..f03cb446124e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> >  extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> > -extern void swap_free(swp_entry_t);
>
> I wonder if it would be cleaner to:
>
> #define swap_free(entry) swap_free_nr((entry), 1)
>
> To save all the churn for the callsites that just want to pass a single entry?
>
Either way works. It will produce the same machine code. I have a
slight inclination to just drop swap_free(entry) API so that it
discourages the caller to do a for loop over swap_free().

Acked-by: Chris Li <chrisl@kernel.org>

Chris

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
  2024-05-03  9:41   ` Ryan Roberts
@ 2024-05-03 20:51   ` Chris Li
  2024-05-03 23:07     ` Barry Song
  2024-05-08  8:08   ` Huang, Ying
  2 siblings, 1 reply; 49+ messages in thread
From: Chris Li @ 2024-05-03 20:51 UTC (permalink / raw
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, david, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Thu, May 2, 2024 at 5:51 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.

BTW, this patch has conflict with the latest mm-unstable. You might
need to refresh it.

I do not see the delta = -1 usage case here. Maybe merge the patch
with the follow up patch that uses the delta = -1?
Does delta only make sense as 1 and -1?

This patch doesn't seem straightly necessary to me.

Chris


Chris

>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *      forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *      non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *      is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>  {
>         swp_entry_t entry = pte_to_swp_entry(pte);
>         pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -                                                  (swp_offset(entry) + 1)));
> +                                                  (swp_offset(entry) + delta)));
>
>         if (pte_swp_soft_dirty(pte))
>                 new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>         return new;
>  }
>
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *      non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +       return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03 20:51   ` Chris Li
@ 2024-05-03 23:07     ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03 23:07 UTC (permalink / raw
  To: Chris Li
  Cc: akpm, linux-mm, baolin.wang, david, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Sat, May 4, 2024 at 4:51 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, May 2, 2024 at 5:51 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
> > will directly invoke it with delta = 1.
>
> BTW, this patch has conflict with the latest mm-unstable. You might
> need to refresh it.
>
> I do not see the delta = -1 usage case here. Maybe merge the patch
> with the follow up patch that uses the delta = -1?
> Does delta only make sense as 1 and -1?

nop. delta can be any value from -1 to -(nr_pages - 1). This is used to
get the first pte from vmf->pte in patdh6.

it might be better to be a separate patch as it is introducing a new
helper? do_swap_page() might not be the only user in the future?

>
> This patch doesn't seem straightly necessary to me.
>
> Chris
>
>
> Chris
>
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/internal.h | 25 +++++++++++++++++++++----
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >  }
> >
> >  /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + *      forward or backward by delta
> >   * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >   *      non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + *      is positive; backward if delta is negative
> >   *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> >   * swap type, and any swp pte bits. The resulting pte is returned.
> >   */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >  {
> >         swp_entry_t entry = pte_to_swp_entry(pte);
> >         pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > -                                                  (swp_offset(entry) + 1)));
> > +                                                  (swp_offset(entry) + delta)));
> >
> >         if (pte_swp_soft_dirty(pte))
> >                 new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> >         return new;
> >  }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + *      non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > +       return pte_move_swp_offset(pte, 1);
> > +}
> > +
> >  /**
> >   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> >   * @start_ptep: Page table pointer for the first entry.
> > --
> > 2.34.1

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03 10:50   ` Ryan Roberts
@ 2024-05-03 23:23     ` Barry Song
  2024-05-06 12:07       ` David Hildenbrand
  2024-05-07  8:17       ` Ryan Roberts
  0 siblings, 2 replies; 49+ messages in thread
From: Barry Song @ 2024-05-03 23:23 UTC (permalink / raw
  To: Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> With the suggested changes below:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>
> > ---
> >  mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22e7c33cc747..940fdbe69fa1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long page_idx = 0;
> > +     unsigned long address = vmf->address;
> > +     pte_t *ptep;
>
> nit: Personally I'd prefer all these to get initialised just before the "if
> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> any logic between here and there makes an adjustment) and its clear that they
> are only to be used after that block (the compiler will warn if using an
> uninitialized value).

right. I agree this will make the code more readable.

>
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               goto out_nomap;
> >       }
> >
> > +     ptep = vmf->pte;
> > +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > +             int nr = folio_nr_pages(folio);
> > +             unsigned long idx = folio_page_idx(folio, page);
> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > +             pte_t *folio_ptep;
> > +             pte_t folio_pte;
> > +
> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > +                     goto check_folio;
> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > +                     goto check_folio;
> > +
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
> > +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > +                     goto check_folio;
> > +
> > +             page_idx = idx;
> > +             address = folio_start;
> > +             ptep = folio_ptep;
> > +             nr_pages = nr;
> > +             entry = folio->swap;
> > +             page = &folio->page;
> > +     }
> > +
> > +check_folio:
>
> Is this still the correct label name, given the checks are now above the new
> block? Perhaps "one_page" or something like that?

not quite sure about this, as the code after one_page can be multiple_pages.
On the other hand, it seems we are really checking folio after "check_folio"
:-)


BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));

/*
* Check under PT lock (to protect against concurrent fork() sharing
* the swap entry concurrently) for certainly exclusive pages.
*/
if (!folio_test_ksm(folio)) {


>
> > +
> >       /*
> >        * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >        * must never point at an anonymous page in the swapcache that is
> > @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free_nr(entry, 1);
> > +     swap_free_nr(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > +                        folio_nr_pages(folio) == nr_pages))) {
>
> I think in practice there is no change here? If nr_pages > 1 then the folio is
> in the swapcache, so there is an extra ref on it? I agree with the change for
> robustness sake. Just checking my understanding.

This is the code showing we are reusing/(mkwrite) a folio either
1. we meet a small folio and we are the only one hitting the small folio
2. we meet a large folio and we are the only one hitting the large folio

any corner cases besides the above two seems difficult. for example,

while we hit a large folio in swapcache but if we can't entirely map it
(nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
== nr_pages == 1, in this case, lacking  folio_nr_pages(folio) == nr_pages
might lead to mkwrite() on a single pte within a partially unmapped large
folio. not quite sure this is wrong, but seems buggy and arduous.

>
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >               }
> >               rmap_flags |= RMAP_EXCLUSIVE;
> >       }
> > -     flush_icache_page(vma, page);
> > +     flush_icache_pages(vma, page, nr_pages);
> >       if (pte_swp_soft_dirty(vmf->orig_pte))
> >               pte = pte_mksoft_dirty(pte);
> >       if (pte_swp_uffd_wp(vmf->orig_pte))
> >               pte = pte_mkuffd_wp(pte);
> > -     vmf->orig_pte = pte;
> > +     vmf->orig_pte = pte_advance_pfn(pte, page_idx);
> >
> >       /* ksm created a completely new copy */
> >       if (unlikely(folio != swapcache && swapcache)) {
> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> > +             folio_add_new_anon_rmap(folio, vma, address);
> >               folio_add_lru_vma(folio, vma);
> >       } else {
> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> >                                       rmap_flags);
> >       }
> >
> >       VM_BUG_ON(!folio_test_anon(folio) ||
> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > -     arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> > -                     pte, vmf->orig_pte, 1);
> > +     set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> > +     arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > +                     pte, pte, nr_pages);
> >
> >       folio_unlock(folio);
> >       if (folio != swapcache && swapcache) {
> > @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* No need to invalidate - it was non-present before */
> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > +     update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
> >  unlock:
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03  9:41   ` Ryan Roberts
@ 2024-05-03 23:40     ` Barry Song
  2024-05-06  8:06       ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-03 23:40 UTC (permalink / raw
  To: Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
>
> You mean pte_next_swp_offset()?

yes.

>
> > will directly invoke it with delta = 1.
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/internal.h | 25 +++++++++++++++++++++----
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >  }
> >
> >  /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + *    forward or backward by delta
> >   * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >   *    non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + *    is positive; backward if delta is negative
> >   *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> >   * swap type, and any swp pte bits. The resulting pte is returned.
> >   */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>
> We have equivalent functions for pfn:
>
>   pte_next_pfn()
>   pte_advance_pfn()
>
> Although the latter takes an unsigned long and only moves forward currently. I
> wonder if it makes sense to have their naming and semantics match? i.e. change
> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>
> I guess we don't have a need for that and it adds more churn.

we might have a need in the below case.
A forks B, then A and B share large folios. B unmap/exit, then large
folios of process
A become single-mapped.
Right now, while writing A's folios, we are CoWing A's large folios
into many small
folios. I believe we can reuse the entire large folios instead of doing nr_pages
CoW and page faults.
In this case, we might want to get the first PTE from vmf->pte.

Another case, might be
A forks B, and we write either A or B, we might CoW an entire large
folios instead
CoWing nr_pages small folios.

case 1 seems more useful, I might have a go after some days. then we might
see pte_move_pfn().

>
> Anyway:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

thanks!

>
>
> >  {
> >       swp_entry_t entry = pte_to_swp_entry(pte);
> >       pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > -                                                (swp_offset(entry) + 1)));
> > +                                                (swp_offset(entry) + delta)));
> >
> >       if (pte_swp_soft_dirty(pte))
> >               new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> >       return new;
> >  }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + *    non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > +     return pte_move_swp_offset(pte, 1);
> > +}
> > +
> >  /**
> >   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> >   * @start_ptep: Page table pointer for the first entry.
>

Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-03 20:37     ` Chris Li
@ 2024-05-04  4:03       ` Christoph Hellwig
  2024-05-04  4:27         ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2024-05-04  4:03 UTC (permalink / raw
  To: Chris Li
  Cc: Ryan Roberts, Barry Song, akpm, linux-mm, baolin.wang, david,
	hanchuanhua, hannes, hughd, kasong, linux-kernel, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	Rafael J. Wysocki, Pavel Machek, Len Brown

On Fri, May 03, 2024 at 01:37:06PM -0700, Chris Li wrote:
> Either way works. It will produce the same machine code. I have a
> slight inclination to just drop swap_free(entry) API so that it
> discourages the caller to do a for loop over swap_free().

Then just ad the number of entries parameter to swap_free and do away
with the separate swap_free_nr.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-04  4:03       ` Christoph Hellwig
@ 2024-05-04  4:27         ` Barry Song
  2024-05-04  4:28           ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-04  4:27 UTC (permalink / raw
  To: Christoph Hellwig
  Cc: Chris Li, Ryan Roberts, akpm, linux-mm, baolin.wang, david,
	hanchuanhua, hannes, hughd, kasong, linux-kernel, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	Rafael J. Wysocki, Pavel Machek, Len Brown

On Sat, May 4, 2024 at 12:03 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, May 03, 2024 at 01:37:06PM -0700, Chris Li wrote:
> > Either way works. It will produce the same machine code. I have a
> > slight inclination to just drop swap_free(entry) API so that it
> > discourages the caller to do a for loop over swap_free().
>
> Then just ad the number of entries parameter to swap_free and do away
> with the separate swap_free_nr.

swap_free_nr() isn't separate, after this patch, it is the only one left.
there won't be swap_free() any more. it seems you want to directly
"rename" it to swap_free()?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-04  4:27         ` Barry Song
@ 2024-05-04  4:28           ` Christoph Hellwig
  2024-05-04  4:47             ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2024-05-04  4:28 UTC (permalink / raw
  To: Barry Song
  Cc: Christoph Hellwig, Chris Li, Ryan Roberts, akpm, linux-mm,
	baolin.wang, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, Rafael J. Wysocki, Pavel Machek,
	Len Brown

On Sat, May 04, 2024 at 12:27:11PM +0800, Barry Song wrote:
> swap_free_nr() isn't separate, after this patch, it is the only one left.
> there won't be swap_free() any more. it seems you want to directly
> "rename" it to swap_free()?

Yes.  Avoid the pointless suffix if it is the only variant.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-04  4:28           ` Christoph Hellwig
@ 2024-05-04  4:47             ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2024-05-04  4:47 UTC (permalink / raw
  To: Christoph Hellwig
  Cc: Chris Li, Ryan Roberts, akpm, linux-mm, baolin.wang, david,
	hanchuanhua, hannes, hughd, kasong, linux-kernel, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	Rafael J. Wysocki, Pavel Machek, Len Brown

On Sat, May 4, 2024 at 12:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, May 04, 2024 at 12:27:11PM +0800, Barry Song wrote:
> > swap_free_nr() isn't separate, after this patch, it is the only one left.
> > there won't be swap_free() any more. it seems you want to directly
> > "rename" it to swap_free()?
>
> Yes.  Avoid the pointless suffix if it is the only variant.

well. it seems you are right. We usually use a suffix to differentiate
two or more cases, but now, there is only one case left, the suffix
seems no longer useful.

one more problem is that free_swap_and_cache_nr() and
swap_free_nr() are not quite aligned.

extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);

static inline void free_swap_and_cache(swp_entry_t entry)
{
           free_swap_and_cache_nr(entry, 1);
}

The problem space is the same. I feel like in that case, we can also drop
free_swap_and_cache_nr() and simply add the nr parameter?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03 23:40     ` Barry Song
@ 2024-05-06  8:06       ` David Hildenbrand
  2024-05-06  8:20         ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06  8:06 UTC (permalink / raw
  To: Barry Song, Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 04.05.24 01:40, Barry Song wrote:
> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> There could arise a necessity to obtain the first pte_t from a swap
>>> pte_t located in the middle. For instance, this may occur within the
>>> context of do_swap_page(), where a page fault can potentially occur in
>>> any PTE of a large folio. To address this, the following patch introduces
>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>
>> You mean pte_next_swp_offset()?
> 
> yes.
> 
>>
>>> will directly invoke it with delta = 1.
>>>
>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>   mm/internal.h | 25 +++++++++++++++++++++----
>>>   1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index c5552d35d995..cfe4aed66a5c 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>   }
>>>
>>>   /**
>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>> + *    forward or backward by delta
>>>    * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>    *    non_swap_entry() must be false.
>>> + * @delta: The direction and the offset we are moving; forward if delta
>>> + *    is positive; backward if delta is negative
>>>    *
>>> - * Increments the swap offset, while maintaining all other fields, including
>>> + * Moves the swap offset, while maintaining all other fields, including
>>>    * swap type, and any swp pte bits. The resulting pte is returned.
>>>    */
>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>
>> We have equivalent functions for pfn:
>>
>>    pte_next_pfn()
>>    pte_advance_pfn()
>>
>> Although the latter takes an unsigned long and only moves forward currently. I
>> wonder if it makes sense to have their naming and semantics match? i.e. change
>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>
>> I guess we don't have a need for that and it adds more churn.
> 
> we might have a need in the below case.
> A forks B, then A and B share large folios. B unmap/exit, then large
> folios of process
> A become single-mapped.
> Right now, while writing A's folios, we are CoWing A's large folios
> into many small
> folios. I believe we can reuse the entire large folios instead of doing nr_pages
> CoW and page faults.
> In this case, we might want to get the first PTE from vmf->pte.

Once we have COW reuse for large folios in place (I think you know that 
I am working on that), it might make sense to "COW-reuse around", 
meaning we look if some neighboring PTEs map the same large folio and 
map them writable as well. But if it's really worth it, increasing page 
fault latency, is to be decided separately.


> 
> Another case, might be
> A forks B, and we write either A or B, we might CoW an entire large
> folios instead
> CoWing nr_pages small folios.
> 
> case 1 seems more useful, I might have a go after some days. then we might
> see pte_move_pfn().
pte_move_pfn() does sound odd to me. It might not be required to 
implement the optimization described above. (it's easier to simply read 
another PTE, check if it maps the same large folio, and to batch from there)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio
  2024-05-03  0:50 ` [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-05-06  8:14   ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06  8:14 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 03.05.24 02:50, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> The function should_try_to_free_swap() operates under the assumption
> that swap-in always occurs at the normal page granularity,
> i.e., folio_nr_pages() = 1. However, in reality, for large folios,
> add_to_swap_cache() will invoke folio_ref_add(folio, nr).  To accommodate
> large folio swap-in, this patch eliminates this assumption.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>   mm/memory.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 74cdefd58f5f..22e7c33cc747 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3877,7 +3877,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
>   	 * reference only in case it's likely that we'll be the exlusive user.
>   	 */
>   	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> -		folio_ref_count(folio) == 2;
> +		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
>   }
>   
>   static vm_fault_t pte_marker_clear(struct vm_fault *vmf)


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-06  8:06       ` David Hildenbrand
@ 2024-05-06  8:20         ` Barry Song
  2024-05-06  8:31           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-06  8:20 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.05.24 01:40, Barry Song wrote:
> > On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/05/2024 01:50, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> There could arise a necessity to obtain the first pte_t from a swap
> >>> pte_t located in the middle. For instance, this may occur within the
> >>> context of do_swap_page(), where a page fault can potentially occur in
> >>> any PTE of a large folio. To address this, the following patch introduces
> >>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>
> >> You mean pte_next_swp_offset()?
> >
> > yes.
> >
> >>
> >>> will directly invoke it with delta = 1.
> >>>
> >>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>   mm/internal.h | 25 +++++++++++++++++++++----
> >>>   1 file changed, 21 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/internal.h b/mm/internal.h
> >>> index c5552d35d995..cfe4aed66a5c 100644
> >>> --- a/mm/internal.h
> >>> +++ b/mm/internal.h
> >>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>>   }
> >>>
> >>>   /**
> >>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> >>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>> + *    forward or backward by delta
> >>>    * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>>    *    non_swap_entry() must be false.
> >>> + * @delta: The direction and the offset we are moving; forward if delta
> >>> + *    is positive; backward if delta is negative
> >>>    *
> >>> - * Increments the swap offset, while maintaining all other fields, including
> >>> + * Moves the swap offset, while maintaining all other fields, including
> >>>    * swap type, and any swp pte bits. The resulting pte is returned.
> >>>    */
> >>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>
> >> We have equivalent functions for pfn:
> >>
> >>    pte_next_pfn()
> >>    pte_advance_pfn()
> >>
> >> Although the latter takes an unsigned long and only moves forward currently. I
> >> wonder if it makes sense to have their naming and semantics match? i.e. change
> >> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>
> >> I guess we don't have a need for that and it adds more churn.
> >
> > we might have a need in the below case.
> > A forks B, then A and B share large folios. B unmap/exit, then large
> > folios of process
> > A become single-mapped.
> > Right now, while writing A's folios, we are CoWing A's large folios
> > into many small
> > folios. I believe we can reuse the entire large folios instead of doing nr_pages
> > CoW and page faults.
> > In this case, we might want to get the first PTE from vmf->pte.
>
> Once we have COW reuse for large folios in place (I think you know that
> I am working on that), it might make sense to "COW-reuse around",

TBH, I don't know if you are working on that. please Cc me next time :-)

> meaning we look if some neighboring PTEs map the same large folio and
> map them writable as well. But if it's really worth it, increasing page
> fault latency, is to be decided separately.

On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
Perhaps we can discover a more cost-effective method to signify that a large
folio is probably singly mapped? and only call "multi-PTEs" reuse while that
condition is true in PF and avoid increasing latency always?

>
>
> >
> > Another case, might be
> > A forks B, and we write either A or B, we might CoW an entire large
> > folios instead
> > CoWing nr_pages small folios.
> >
> > case 1 seems more useful, I might have a go after some days. then we might
> > see pte_move_pfn().
> pte_move_pfn() does sound odd to me. It might not be required to
> implement the optimization described above. (it's easier to simply read
> another PTE, check if it maps the same large folio, and to batch from there)
>

It appears that your proposal suggests potential reusability as follows: if we
have a large folio containing 16 PTEs, you might consider reusing only 4 by
examining PTEs "around" but not necessarily all 16 PTEs. please correct me
if my understanding is wrong.

Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
in nr_pages, thus enabling complete reuse of the whole folio.

> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-06  8:20         ` Barry Song
@ 2024-05-06  8:31           ` David Hildenbrand
  2024-05-07  8:14             ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06  8:31 UTC (permalink / raw
  To: Barry Song
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On 06.05.24 10:20, Barry Song wrote:
> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.05.24 01:40, Barry Song wrote:
>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>
>>>> You mean pte_next_swp_offset()?
>>>
>>> yes.
>>>
>>>>
>>>>> will directly invoke it with delta = 1.
>>>>>
>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>> + *    forward or backward by delta
>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>     *    non_swap_entry() must be false.
>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>> + *    is positive; backward if delta is negative
>>>>>     *
>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>     */
>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>
>>>> We have equivalent functions for pfn:
>>>>
>>>>     pte_next_pfn()
>>>>     pte_advance_pfn()
>>>>
>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>
>>>> I guess we don't have a need for that and it adds more churn.
>>>
>>> we might have a need in the below case.
>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>> folios of process
>>> A become single-mapped.
>>> Right now, while writing A's folios, we are CoWing A's large folios
>>> into many small
>>> folios. I believe we can reuse the entire large folios instead of doing nr_pages
>>> CoW and page faults.
>>> In this case, we might want to get the first PTE from vmf->pte.
>>
>> Once we have COW reuse for large folios in place (I think you know that
>> I am working on that), it might make sense to "COW-reuse around",
> 
> TBH, I don't know if you are working on that. please Cc me next time :-)

I could have sworn I mentioned it to you already :)

See

https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/

I'll follow-up on that soonish (now that batching is upstream and the 
large mapcount is on its way upstream).

> 
>> meaning we look if some neighboring PTEs map the same large folio and
>> map them writable as well. But if it's really worth it, increasing page
>> fault latency, is to be decided separately.
> 
> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> Perhaps we can discover a more cost-effective method to signify that a large
> folio is probably singly mapped?

Yes, precisely what I am up to!

> and only call "multi-PTEs" reuse while that
> condition is true in PF and avoid increasing latency always?

I'm thinking along those lines:

If we detect that it's exclusive, we can certainly mapped the current 
PTE writable. Then, we can decide how much (and if) we want to 
fault-around writable as an optimization.

For smallish large folios, it might make sense to try faulting around 
most of the folio.

For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might 
not want to fault around the whole thing -- especially if there is 
little benefit to be had from contig-pte bits.

> 
>>
>>
>>>
>>> Another case, might be
>>> A forks B, and we write either A or B, we might CoW an entire large
>>> folios instead
>>> CoWing nr_pages small folios.
>>>
>>> case 1 seems more useful, I might have a go after some days. then we might
>>> see pte_move_pfn().
>> pte_move_pfn() does sound odd to me. It might not be required to
>> implement the optimization described above. (it's easier to simply read
>> another PTE, check if it maps the same large folio, and to batch from there)
>>
> 
> It appears that your proposal suggests potential reusability as follows: if we
> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> if my understanding is wrong.
> 
> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> in nr_pages, thus enabling complete reuse of the whole folio.

Simply doing an vm_normal_folio(pte - X) == folio and then trying to 
batch from there might be easier and cleaner.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03  0:50 ` [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache Barry Song
  2024-05-03 10:50   ` Ryan Roberts
@ 2024-05-06 12:05   ` David Hildenbrand
  2024-05-06 12:27     ` Barry Song
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06 12:05 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 03.05.24 02:50, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 22e7c33cc747..940fdbe69fa1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	pte_t pte;
>   	vm_fault_t ret = 0;
>   	void *shadow = NULL;
> +	int nr_pages = 1;
> +	unsigned long page_idx = 0;
> +	unsigned long address = vmf->address;
> +	pte_t *ptep;
>   
>   	if (!pte_unmap_same(vmf))
>   		goto out;
> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		goto out_nomap;
>   	}
>   
> +	ptep = vmf->pte;
> +	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> +		int nr = folio_nr_pages(folio);
> +		unsigned long idx = folio_page_idx(folio, page);
> +		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> +		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> +		pte_t *folio_ptep;
> +		pte_t folio_pte;
> +
> +		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> +			goto check_folio;
> +		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> +			goto check_folio;
> +
> +		folio_ptep = vmf->pte - idx;
> +		folio_pte = ptep_get(folio_ptep);
> +		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> +		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> +			goto check_folio;
> +
> +		page_idx = idx;
> +		address = folio_start;
> +		ptep = folio_ptep;
> +		nr_pages = nr;
> +		entry = folio->swap;
> +		page = &folio->page;
> +	}
> +
> +check_folio:
> +
>   	/*
>   	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>   	 * must never point at an anonymous page in the swapcache that is
> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	 * We're already holding a reference on the page but haven't mapped it
>   	 * yet.
>   	 */
> -	swap_free_nr(entry, 1);
> +	swap_free_nr(entry, nr_pages);
>   	if (should_try_to_free_swap(folio, vma, vmf->flags))
>   		folio_free_swap(folio);
>   
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>   	pte = mk_pte(page, vma->vm_page_prot);
>   
>   	/*
> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	 * exclusivity.
>   	 */
>   	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	    (exclusive || (folio_ref_count(folio) == nr_pages &&
> +			   folio_nr_pages(folio) == nr_pages))) {
>   		if (vmf->flags & FAULT_FLAG_WRITE) {
>   			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>   			vmf->flags &= ~FAULT_FLAG_WRITE;

I fail to convince myself that this change is correct, and if it is 
correct, it's confusing (I think there is a dependency on 
folio_free_swap() having been called and succeeding, such that we don't 
have a folio that is in the swapcache at this point).

Why can't we move the folio_ref_add() after this check and just leave 
the check as it is?

"folio_ref_count(folio) == 1" is as clear as it gets: we hold the single 
reference, so we can do with this thing whatever we want: it's certainly 
exclusive. No swapcache, no other people mapping it.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03 23:23     ` Barry Song
@ 2024-05-06 12:07       ` David Hildenbrand
  2024-05-06 12:38         ` Barry Song
  2024-05-07  8:17       ` Ryan Roberts
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06 12:07 UTC (permalink / raw
  To: Barry Song, Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 04.05.24 01:23, Barry Song wrote:
> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> With the suggested changes below:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> ---
>>>   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>   1 file changed, 48 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 22e7c33cc747..940fdbe69fa1 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        pte_t pte;
>>>        vm_fault_t ret = 0;
>>>        void *shadow = NULL;
>>> +     int nr_pages = 1;
>>> +     unsigned long page_idx = 0;
>>> +     unsigned long address = vmf->address;
>>> +     pte_t *ptep;
>>
>> nit: Personally I'd prefer all these to get initialised just before the "if
>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>> any logic between here and there makes an adjustment) and its clear that they
>> are only to be used after that block (the compiler will warn if using an
>> uninitialized value).
> 
> right. I agree this will make the code more readable.
> 
>>
>>>
>>>        if (!pte_unmap_same(vmf))
>>>                goto out;
>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                goto out_nomap;
>>>        }
>>>
>>> +     ptep = vmf->pte;
>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>> +             int nr = folio_nr_pages(folio);
>>> +             unsigned long idx = folio_page_idx(folio, page);
>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> +             pte_t *folio_ptep;
>>> +             pte_t folio_pte;
>>> +
>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> +                     goto check_folio;
>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> +                     goto check_folio;
>>> +
>>> +             folio_ptep = vmf->pte - idx;
>>> +             folio_pte = ptep_get(folio_ptep);
>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>> +                     goto check_folio;
>>> +
>>> +             page_idx = idx;
>>> +             address = folio_start;
>>> +             ptep = folio_ptep;
>>> +             nr_pages = nr;
>>> +             entry = folio->swap;
>>> +             page = &folio->page;
>>> +     }
>>> +
>>> +check_folio:
>>
>> Is this still the correct label name, given the checks are now above the new
>> block? Perhaps "one_page" or something like that?
> 
> not quite sure about this, as the code after one_page can be multiple_pages.
> On the other hand, it seems we are really checking folio after "check_folio"
> :-)
> 
> 
> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> 
> /*
> * Check under PT lock (to protect against concurrent fork() sharing
> * the swap entry concurrently) for certainly exclusive pages.
> */
> if (!folio_test_ksm(folio)) {
> 
> 
>>
>>> +
>>>        /*
>>>         * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>         * must never point at an anonymous page in the swapcache that is
>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>         * We're already holding a reference on the page but haven't mapped it
>>>         * yet.
>>>         */
>>> -     swap_free_nr(entry, 1);
>>> +     swap_free_nr(entry, nr_pages);
>>>        if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>                folio_free_swap(folio);
>>>
>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> +     folio_ref_add(folio, nr_pages - 1);
>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>        pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>        /*
>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>         * exclusivity.
>>>         */
>>>        if (!folio_test_ksm(folio) &&
>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>
>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>> in the swapcache, so there is an extra ref on it? I agree with the change for
>> robustness sake. Just checking my understanding.
> 
> This is the code showing we are reusing/(mkwrite) a folio either
> 1. we meet a small folio and we are the only one hitting the small folio
> 2. we meet a large folio and we are the only one hitting the large folio
> 
> any corner cases besides the above two seems difficult. for example,
> 
> while we hit a large folio in swapcache but if we can't entirely map it
> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> == nr_pages == 1

No, there would be other references from the swapcache and 
folio_ref_count(folio) > 1. See my other reply.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 12:05   ` David Hildenbrand
@ 2024-05-06 12:27     ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2024-05-06 12:27 UTC (permalink / raw
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 12:05 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.05.24 02:50, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >   1 file changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22e7c33cc747..940fdbe69fa1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long page_idx = 0;
> > +     unsigned long address = vmf->address;
> > +     pte_t *ptep;
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               goto out_nomap;
> >       }
> >
> > +     ptep = vmf->pte;
> > +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > +             int nr = folio_nr_pages(folio);
> > +             unsigned long idx = folio_page_idx(folio, page);
> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > +             pte_t *folio_ptep;
> > +             pte_t folio_pte;
> > +
> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > +                     goto check_folio;
> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > +                     goto check_folio;
> > +
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
> > +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > +                     goto check_folio;
> > +
> > +             page_idx = idx;
> > +             address = folio_start;
> > +             ptep = folio_ptep;
> > +             nr_pages = nr;
> > +             entry = folio->swap;
> > +             page = &folio->page;
> > +     }
> > +
> > +check_folio:
> > +
> >       /*
> >        * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >        * must never point at an anonymous page in the swapcache that is
> > @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free_nr(entry, 1);
> > +     swap_free_nr(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > +                        folio_nr_pages(folio) == nr_pages))) {
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
>
> I fail to convince myself that this change is correct, and if it is
> correct, it's confusing (I think there is a dependency on
> folio_free_swap() having been called and succeeding, such that we don't
> have a folio that is in the swapcache at this point).
>
> Why can't we move the folio_ref_add() after this check and just leave
> the check as it is?
>
> "folio_ref_count(folio) == 1" is as clear as it gets: we hold the single
> reference, so we can do with this thing whatever we want: it's certainly
> exclusive. No swapcache, no other people mapping it.

Right.
I believe the code works correctly but is a bit confusing. as you said,
we might move folio_ref_add() behind folio_ref_count(folio) == 1.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 12:07       ` David Hildenbrand
@ 2024-05-06 12:38         ` Barry Song
  2024-05-06 12:58           ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-06 12:38 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.05.24 01:23, Barry Song wrote:
> > On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/05/2024 01:50, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> When a large folio is found in the swapcache, the current implementation
> >>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>> page faults. This patch opts to map the entire large folio at once to
> >>> minimize page faults. Additionally, redundant checks and early exits
> >>> for ARM64 MTE restoring are removed.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> With the suggested changes below:
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >>> ---
> >>>   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>   1 file changed, 48 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 22e7c33cc747..940fdbe69fa1 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        pte_t pte;
> >>>        vm_fault_t ret = 0;
> >>>        void *shadow = NULL;
> >>> +     int nr_pages = 1;
> >>> +     unsigned long page_idx = 0;
> >>> +     unsigned long address = vmf->address;
> >>> +     pte_t *ptep;
> >>
> >> nit: Personally I'd prefer all these to get initialised just before the "if
> >> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >> any logic between here and there makes an adjustment) and its clear that they
> >> are only to be used after that block (the compiler will warn if using an
> >> uninitialized value).
> >
> > right. I agree this will make the code more readable.
> >
> >>
> >>>
> >>>        if (!pte_unmap_same(vmf))
> >>>                goto out;
> >>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>                goto out_nomap;
> >>>        }
> >>>
> >>> +     ptep = vmf->pte;
> >>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>> +             int nr = folio_nr_pages(folio);
> >>> +             unsigned long idx = folio_page_idx(folio, page);
> >>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>> +             pte_t *folio_ptep;
> >>> +             pte_t folio_pte;
> >>> +
> >>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>> +                     goto check_folio;
> >>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>> +                     goto check_folio;
> >>> +
> >>> +             folio_ptep = vmf->pte - idx;
> >>> +             folio_pte = ptep_get(folio_ptep);
> >>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>> +                     goto check_folio;
> >>> +
> >>> +             page_idx = idx;
> >>> +             address = folio_start;
> >>> +             ptep = folio_ptep;
> >>> +             nr_pages = nr;
> >>> +             entry = folio->swap;
> >>> +             page = &folio->page;
> >>> +     }
> >>> +
> >>> +check_folio:
> >>
> >> Is this still the correct label name, given the checks are now above the new
> >> block? Perhaps "one_page" or something like that?
> >
> > not quite sure about this, as the code after one_page can be multiple_pages.
> > On the other hand, it seems we are really checking folio after "check_folio"
> > :-)
> >
> >
> > BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> > BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >
> > /*
> > * Check under PT lock (to protect against concurrent fork() sharing
> > * the swap entry concurrently) for certainly exclusive pages.
> > */
> > if (!folio_test_ksm(folio)) {
> >
> >
> >>
> >>> +
> >>>        /*
> >>>         * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>>         * must never point at an anonymous page in the swapcache that is
> >>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>         * We're already holding a reference on the page but haven't mapped it
> >>>         * yet.
> >>>         */
> >>> -     swap_free_nr(entry, 1);
> >>> +     swap_free_nr(entry, nr_pages);
> >>>        if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>                folio_free_swap(folio);
> >>>
> >>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> +     folio_ref_add(folio, nr_pages - 1);
> >>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>        pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>>        /*
> >>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>         * exclusivity.
> >>>         */
> >>>        if (!folio_test_ksm(folio) &&
> >>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>> +                        folio_nr_pages(folio) == nr_pages))) {
> >>
> >> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >> in the swapcache, so there is an extra ref on it? I agree with the change for
> >> robustness sake. Just checking my understanding.
> >
> > This is the code showing we are reusing/(mkwrite) a folio either
> > 1. we meet a small folio and we are the only one hitting the small folio
> > 2. we meet a large folio and we are the only one hitting the large folio
> >
> > any corner cases besides the above two seems difficult. for example,
> >
> > while we hit a large folio in swapcache but if we can't entirely map it
> > (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> > == nr_pages == 1
>
> No, there would be other references from the swapcache and
> folio_ref_count(folio) > 1. See my other reply.

right. can be clearer by:

@@ -4263,7 +4264,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        if (should_try_to_free_swap(folio, vma, vmf->flags))
                folio_free_swap(folio);

-       folio_ref_add(folio, nr_pages - 1);
        add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
        add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
        pte = mk_pte(page, vma->vm_page_prot);
@@ -4275,14 +4275,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
         * exclusivity.
         */
        if (!folio_test_ksm(folio) &&
-           (exclusive || (folio_ref_count(folio) == nr_pages &&
-                          folio_nr_pages(folio) == nr_pages))) {
+           (exclusive || folio_ref_count(folio) == 1)) {
                if (vmf->flags & FAULT_FLAG_WRITE) {
                        pte = maybe_mkwrite(pte_mkdirty(pte), vma);
                        vmf->flags &= ~FAULT_FLAG_WRITE;
                }
                rmap_flags |= RMAP_EXCLUSIVE;
        }
+       folio_ref_add(folio, nr_pages - 1);
        flush_icache_pages(vma, page, nr_pages);
        if (pte_swp_soft_dirty(vmf->orig_pte))
                pte = pte_mksoft_dirty(pte);


>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 12:38         ` Barry Song
@ 2024-05-06 12:58           ` Barry Song
  2024-05-06 13:16             ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-06 12:58 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 04.05.24 01:23, Barry Song wrote:
> > > On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>
> > >> On 03/05/2024 01:50, Barry Song wrote:
> > >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> > >>>
> > >>> When a large folio is found in the swapcache, the current implementation
> > >>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > >>> page faults. This patch opts to map the entire large folio at once to
> > >>> minimize page faults. Additionally, redundant checks and early exits
> > >>> for ARM64 MTE restoring are removed.
> > >>>
> > >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > >>
> > >> With the suggested changes below:
> > >>
> > >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> > >>
> > >>> ---
> > >>>   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> > >>>   1 file changed, 48 insertions(+), 12 deletions(-)
> > >>>
> > >>> diff --git a/mm/memory.c b/mm/memory.c
> > >>> index 22e7c33cc747..940fdbe69fa1 100644
> > >>> --- a/mm/memory.c
> > >>> +++ b/mm/memory.c
> > >>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>>        pte_t pte;
> > >>>        vm_fault_t ret = 0;
> > >>>        void *shadow = NULL;
> > >>> +     int nr_pages = 1;
> > >>> +     unsigned long page_idx = 0;
> > >>> +     unsigned long address = vmf->address;
> > >>> +     pte_t *ptep;
> > >>
> > >> nit: Personally I'd prefer all these to get initialised just before the "if
> > >> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> > >> any logic between here and there makes an adjustment) and its clear that they
> > >> are only to be used after that block (the compiler will warn if using an
> > >> uninitialized value).
> > >
> > > right. I agree this will make the code more readable.
> > >
> > >>
> > >>>
> > >>>        if (!pte_unmap_same(vmf))
> > >>>                goto out;
> > >>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>>                goto out_nomap;
> > >>>        }
> > >>>
> > >>> +     ptep = vmf->pte;
> > >>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > >>> +             int nr = folio_nr_pages(folio);
> > >>> +             unsigned long idx = folio_page_idx(folio, page);
> > >>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > >>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > >>> +             pte_t *folio_ptep;
> > >>> +             pte_t folio_pte;
> > >>> +
> > >>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > >>> +                     goto check_folio;
> > >>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > >>> +                     goto check_folio;
> > >>> +
> > >>> +             folio_ptep = vmf->pte - idx;
> > >>> +             folio_pte = ptep_get(folio_ptep);
> > >>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > >>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > >>> +                     goto check_folio;
> > >>> +
> > >>> +             page_idx = idx;
> > >>> +             address = folio_start;
> > >>> +             ptep = folio_ptep;
> > >>> +             nr_pages = nr;
> > >>> +             entry = folio->swap;
> > >>> +             page = &folio->page;
> > >>> +     }
> > >>> +
> > >>> +check_folio:
> > >>
> > >> Is this still the correct label name, given the checks are now above the new
> > >> block? Perhaps "one_page" or something like that?
> > >
> > > not quite sure about this, as the code after one_page can be multiple_pages.
> > > On the other hand, it seems we are really checking folio after "check_folio"
> > > :-)
> > >
> > >
> > > BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> > > BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> > >
> > > /*
> > > * Check under PT lock (to protect against concurrent fork() sharing
> > > * the swap entry concurrently) for certainly exclusive pages.
> > > */
> > > if (!folio_test_ksm(folio)) {
> > >
> > >
> > >>
> > >>> +
> > >>>        /*
> > >>>         * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> > >>>         * must never point at an anonymous page in the swapcache that is
> > >>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>>         * We're already holding a reference on the page but haven't mapped it
> > >>>         * yet.
> > >>>         */
> > >>> -     swap_free_nr(entry, 1);
> > >>> +     swap_free_nr(entry, nr_pages);
> > >>>        if (should_try_to_free_swap(folio, vma, vmf->flags))
> > >>>                folio_free_swap(folio);
> > >>>
> > >>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > >>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > >>> +     folio_ref_add(folio, nr_pages - 1);
> > >>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > >>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > >>>        pte = mk_pte(page, vma->vm_page_prot);
> > >>>
> > >>>        /*
> > >>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>>         * exclusivity.
> > >>>         */
> > >>>        if (!folio_test_ksm(folio) &&
> > >>> -         (exclusive || folio_ref_count(folio) == 1)) {
> > >>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > >>> +                        folio_nr_pages(folio) == nr_pages))) {
> > >>
> > >> I think in practice there is no change here? If nr_pages > 1 then the folio is
> > >> in the swapcache, so there is an extra ref on it? I agree with the change for
> > >> robustness sake. Just checking my understanding.
> > >
> > > This is the code showing we are reusing/(mkwrite) a folio either
> > > 1. we meet a small folio and we are the only one hitting the small folio
> > > 2. we meet a large folio and we are the only one hitting the large folio
> > >
> > > any corner cases besides the above two seems difficult. for example,
> > >
> > > while we hit a large folio in swapcache but if we can't entirely map it
> > > (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> > > == nr_pages == 1
> >
> > No, there would be other references from the swapcache and
> > folio_ref_count(folio) > 1. See my other reply.
>
> right. can be clearer by:

Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?

one case is that we have a large folio with 16 PTEs, and we unmap
15 swap PTE entries, thus we have only one swap entry left. Then
we hit the large folio in swapcache.  but we have only one PTE thus we will
map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
large folio for one PTE. with it, do_wp_page() will migrate the large
folio to a small one?

1AM, tired and sleepy. not quite sure I am correct.
I look forward to seeing your reply tomorrow morning :-)

>
> @@ -4263,7 +4264,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (should_try_to_free_swap(folio, vma, vmf->flags))
>                 folio_free_swap(folio);
>
> -       folio_ref_add(folio, nr_pages - 1);
>         add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>         add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>         pte = mk_pte(page, vma->vm_page_prot);
> @@ -4275,14 +4275,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * exclusivity.
>          */
>         if (!folio_test_ksm(folio) &&
> -           (exclusive || (folio_ref_count(folio) == nr_pages &&
> -                          folio_nr_pages(folio) == nr_pages))) {
> +           (exclusive || folio_ref_count(folio) == 1)) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>                         vmf->flags &= ~FAULT_FLAG_WRITE;
>                 }
>                 rmap_flags |= RMAP_EXCLUSIVE;
>         }
> +       folio_ref_add(folio, nr_pages - 1);
>         flush_icache_pages(vma, page, nr_pages);
>         if (pte_swp_soft_dirty(vmf->orig_pte))
>                 pte = pte_mksoft_dirty(pte);
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 12:58           ` Barry Song
@ 2024-05-06 13:16             ` David Hildenbrand
  2024-05-06 22:58               ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-06 13:16 UTC (permalink / raw
  To: Barry Song
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On 06.05.24 14:58, Barry Song wrote:
> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 04.05.24 01:23, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>>
>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>> for ARM64 MTE restoring are removed.
>>>>>>
>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> With the suggested changes below:
>>>>>
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>
>>>>>> ---
>>>>>>    mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>    1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>         pte_t pte;
>>>>>>         vm_fault_t ret = 0;
>>>>>>         void *shadow = NULL;
>>>>>> +     int nr_pages = 1;
>>>>>> +     unsigned long page_idx = 0;
>>>>>> +     unsigned long address = vmf->address;
>>>>>> +     pte_t *ptep;
>>>>>
>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>> are only to be used after that block (the compiler will warn if using an
>>>>> uninitialized value).
>>>>
>>>> right. I agree this will make the code more readable.
>>>>
>>>>>
>>>>>>
>>>>>>         if (!pte_unmap_same(vmf))
>>>>>>                 goto out;
>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>                 goto out_nomap;
>>>>>>         }
>>>>>>
>>>>>> +     ptep = vmf->pte;
>>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>> +             int nr = folio_nr_pages(folio);
>>>>>> +             unsigned long idx = folio_page_idx(folio, page);
>>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>> +             pte_t *folio_ptep;
>>>>>> +             pte_t folio_pte;
>>>>>> +
>>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>> +                     goto check_folio;
>>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>> +                     goto check_folio;
>>>>>> +
>>>>>> +             folio_ptep = vmf->pte - idx;
>>>>>> +             folio_pte = ptep_get(folio_ptep);
>>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>> +                     goto check_folio;
>>>>>> +
>>>>>> +             page_idx = idx;
>>>>>> +             address = folio_start;
>>>>>> +             ptep = folio_ptep;
>>>>>> +             nr_pages = nr;
>>>>>> +             entry = folio->swap;
>>>>>> +             page = &folio->page;
>>>>>> +     }
>>>>>> +
>>>>>> +check_folio:
>>>>>
>>>>> Is this still the correct label name, given the checks are now above the new
>>>>> block? Perhaps "one_page" or something like that?
>>>>
>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>> :-)
>>>>
>>>>
>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>
>>>> /*
>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>> */
>>>> if (!folio_test_ksm(folio)) {
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>>         /*
>>>>>>          * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>          * must never point at an anonymous page in the swapcache that is
>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>          * We're already holding a reference on the page but haven't mapped it
>>>>>>          * yet.
>>>>>>          */
>>>>>> -     swap_free_nr(entry, 1);
>>>>>> +     swap_free_nr(entry, nr_pages);
>>>>>>         if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>                 folio_free_swap(folio);
>>>>>>
>>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>> +     folio_ref_add(folio, nr_pages - 1);
>>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>         pte = mk_pte(page, vma->vm_page_prot);
>>>>>>
>>>>>>         /*
>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>          * exclusivity.
>>>>>>          */
>>>>>>         if (!folio_test_ksm(folio) &&
>>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>>>>
>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>> robustness sake. Just checking my understanding.
>>>>
>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>
>>>> any corner cases besides the above two seems difficult. for example,
>>>>
>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>> == nr_pages == 1
>>>
>>> No, there would be other references from the swapcache and
>>> folio_ref_count(folio) > 1. See my other reply.
>>
>> right. can be clearer by:
> 
> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?

I don't think that we will "need" it.

> 
> one case is that we have a large folio with 16 PTEs, and we unmap
> 15 swap PTE entries, thus we have only one swap entry left. Then
> we hit the large folio in swapcache.  but we have only one PTE thus we will
> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> large folio for one PTE. with it, do_wp_page() will migrate the large
> folio to a small one?

We will set PAE bit and do_wp_page() will unconditionally reuse that page.

Note that this is the same as if we had pte_swp_exclusive() set and 
would have run into "exclusive=true" here.

If we'd want a similar "optimization" as we have in 
wp_can_reuse_anon_folio(), you'd want something like

exclusive || (folio_ref_count(folio) == 1 &&
	      (!folio_test_large(folio) || nr_pages > 1)

... but I am not sure if that is really worth the complexity here.

> 
> 1AM, tired and sleepy. not quite sure I am correct.
> I look forward to seeing your reply tomorrow morning :-)

Heh, no need to dream about this ;)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages
  2024-05-03  0:50 ` [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages Barry Song
  2024-05-03 10:02   ` Ryan Roberts
@ 2024-05-06 16:51   ` Khalid Aziz
  1 sibling, 0 replies; 49+ messages in thread
From: Khalid Aziz @ 2024-05-06 16:51 UTC (permalink / raw
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	linux-kernel, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy, David S. Miller,
	Andreas Larsson

On 5/2/24 18:50, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Should do_swap_page() have the capability to directly map a large folio,
> metadata restoration becomes necessary for a specified number of pages
> denoted as nr. It's important to highlight that metadata restoration is
> solely required by the SPARC platform, which, however, does not enable
> THP_SWAP. Consequently, in the present kernel configuration, there
> exists no practical scenario where users necessitate the restoration of
> nr metadata. Platforms implementing THP_SWAP might invoke this function
> with nr values exceeding 1, subsequent to do_swap_page() successfully
> mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr()
> functions remain empty.
> 
> Cc: Khalid Aziz <khalid.aziz@oracle.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Looks good to me.

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>


> ---
>   include/linux/pgtable.h | 26 ++++++++++++++++++++------
>   mm/memory.c             |  3 ++-
>   2 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 18019f037bae..463e84c3de26 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1084,6 +1084,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
>   })
>   
>   #ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     pte_t pte, pte_t oldpte,
> +				     int nr)
> +{
> +
> +}
> +#else
>   /*
>    * Some architectures support metadata associated with a page. When a
>    * page is being swapped out, this metadata must be saved so it can be
> @@ -1092,12 +1101,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
>    * page as metadata for the page. arch_do_swap_page() can restore this
>    * metadata when a page is swapped back in.
>    */
> -static inline void arch_do_swap_page(struct mm_struct *mm,
> -				     struct vm_area_struct *vma,
> -				     unsigned long addr,
> -				     pte_t pte, pte_t oldpte)
> -{
> -
> +static inline void arch_do_swap_page_nr(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long addr,
> +					pte_t pte, pte_t oldpte,
> +					int nr)
> +{
> +	for (int i = 0; i < nr; i++) {
> +		arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
> +				pte_advance_pfn(pte, i),
> +				pte_advance_pfn(oldpte, i));
> +	}
>   }
>   #endif
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index f033eb3528ba..74cdefd58f5f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,7 +4266,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	VM_BUG_ON(!folio_test_anon(folio) ||
>   			(pte_write(pte) && !PageAnonExclusive(page)));
>   	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +	arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
> +			pte, vmf->orig_pte, 1);
>   
>   	folio_unlock(folio);
>   	if (folio != swapcache && swapcache) {


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 13:16             ` David Hildenbrand
@ 2024-05-06 22:58               ` Barry Song
  2024-05-07  8:24                 ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-06 22:58 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.05.24 14:58, Barry Song wrote:
> > On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 04.05.24 01:23, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>>
> >>>>>> When a large folio is found in the swapcache, the current implementation
> >>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>>>>> page faults. This patch opts to map the entire large folio at once to
> >>>>>> minimize page faults. Additionally, redundant checks and early exits
> >>>>>> for ARM64 MTE restoring are removed.
> >>>>>>
> >>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> With the suggested changes below:
> >>>>>
> >>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>>
> >>>>>> ---
> >>>>>>    mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>>>>    1 file changed, 48 insertions(+), 12 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index 22e7c33cc747..940fdbe69fa1 100644
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>         pte_t pte;
> >>>>>>         vm_fault_t ret = 0;
> >>>>>>         void *shadow = NULL;
> >>>>>> +     int nr_pages = 1;
> >>>>>> +     unsigned long page_idx = 0;
> >>>>>> +     unsigned long address = vmf->address;
> >>>>>> +     pte_t *ptep;
> >>>>>
> >>>>> nit: Personally I'd prefer all these to get initialised just before the "if
> >>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >>>>> any logic between here and there makes an adjustment) and its clear that they
> >>>>> are only to be used after that block (the compiler will warn if using an
> >>>>> uninitialized value).
> >>>>
> >>>> right. I agree this will make the code more readable.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>         if (!pte_unmap_same(vmf))
> >>>>>>                 goto out;
> >>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>                 goto out_nomap;
> >>>>>>         }
> >>>>>>
> >>>>>> +     ptep = vmf->pte;
> >>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>>>>> +             int nr = folio_nr_pages(folio);
> >>>>>> +             unsigned long idx = folio_page_idx(folio, page);
> >>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>>>>> +             pte_t *folio_ptep;
> >>>>>> +             pte_t folio_pte;
> >>>>>> +
> >>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>>>>> +                     goto check_folio;
> >>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>>>>> +                     goto check_folio;
> >>>>>> +
> >>>>>> +             folio_ptep = vmf->pte - idx;
> >>>>>> +             folio_pte = ptep_get(folio_ptep);
> >>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>>>>> +                     goto check_folio;
> >>>>>> +
> >>>>>> +             page_idx = idx;
> >>>>>> +             address = folio_start;
> >>>>>> +             ptep = folio_ptep;
> >>>>>> +             nr_pages = nr;
> >>>>>> +             entry = folio->swap;
> >>>>>> +             page = &folio->page;
> >>>>>> +     }
> >>>>>> +
> >>>>>> +check_folio:
> >>>>>
> >>>>> Is this still the correct label name, given the checks are now above the new
> >>>>> block? Perhaps "one_page" or something like that?
> >>>>
> >>>> not quite sure about this, as the code after one_page can be multiple_pages.
> >>>> On the other hand, it seems we are really checking folio after "check_folio"
> >>>> :-)
> >>>>
> >>>>
> >>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> >>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >>>>
> >>>> /*
> >>>> * Check under PT lock (to protect against concurrent fork() sharing
> >>>> * the swap entry concurrently) for certainly exclusive pages.
> >>>> */
> >>>> if (!folio_test_ksm(folio)) {
> >>>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>>         /*
> >>>>>>          * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>>>>>          * must never point at an anonymous page in the swapcache that is
> >>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>          * We're already holding a reference on the page but haven't mapped it
> >>>>>>          * yet.
> >>>>>>          */
> >>>>>> -     swap_free_nr(entry, 1);
> >>>>>> +     swap_free_nr(entry, nr_pages);
> >>>>>>         if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>>>>                 folio_free_swap(folio);
> >>>>>>
> >>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>>>>> +     folio_ref_add(folio, nr_pages - 1);
> >>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>>>>         pte = mk_pte(page, vma->vm_page_prot);
> >>>>>>
> >>>>>>         /*
> >>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>          * exclusivity.
> >>>>>>          */
> >>>>>>         if (!folio_test_ksm(folio) &&
> >>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
> >>>>>
> >>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
> >>>>> robustness sake. Just checking my understanding.
> >>>>
> >>>> This is the code showing we are reusing/(mkwrite) a folio either
> >>>> 1. we meet a small folio and we are the only one hitting the small folio
> >>>> 2. we meet a large folio and we are the only one hitting the large folio
> >>>>
> >>>> any corner cases besides the above two seems difficult. for example,
> >>>>
> >>>> while we hit a large folio in swapcache but if we can't entirely map it
> >>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> >>>> == nr_pages == 1
> >>>
> >>> No, there would be other references from the swapcache and
> >>> folio_ref_count(folio) > 1. See my other reply.
> >>
> >> right. can be clearer by:
> >
> > Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> > folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>
> I don't think that we will "need" it.
>
> >
> > one case is that we have a large folio with 16 PTEs, and we unmap
> > 15 swap PTE entries, thus we have only one swap entry left. Then
> > we hit the large folio in swapcache.  but we have only one PTE thus we will
> > map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> > large folio for one PTE. with it, do_wp_page() will migrate the large
> > folio to a small one?
>
> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>
> Note that this is the same as if we had pte_swp_exclusive() set and
> would have run into "exclusive=true" here.
>
> If we'd want a similar "optimization" as we have in
> wp_can_reuse_anon_folio(), you'd want something like
>
> exclusive || (folio_ref_count(folio) == 1 &&
>               (!folio_test_large(folio) || nr_pages > 1)

I feel like

A :   !folio_test_large(folio) || nr_pages > 1

equals

B:    folio_nr_pages(folio) == nr_pages

if folio is small,  folio_test_large(folio) is false, both A and B will be true;
if folio is large, and we map the whole large folio, A will be true
because of nr_pages > 1;
B is also true;
if folio is large, and we map single one PTE, A will be false;
B is also false, because nr_pages == 1 but  folio_nr_pages(folio) > 1;

right?

However, I agree that delving into this complexity might not be necessary
at the moment.

>
> ... but I am not sure if that is really worth the complexity here.
>
> >
> > 1AM, tired and sleepy. not quite sure I am correct.
> > I look forward to seeing your reply tomorrow morning :-)
>
> Heh, no need to dream about this ;)
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-06  8:31           ` David Hildenbrand
@ 2024-05-07  8:14             ` Ryan Roberts
  2024-05-07  8:24               ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-05-07  8:14 UTC (permalink / raw
  To: David Hildenbrand, Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 06/05/2024 09:31, David Hildenbrand wrote:
> On 06.05.24 10:20, Barry Song wrote:
>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 04.05.24 01:40, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>
>>>>> You mean pte_next_swp_offset()?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>>> will directly invoke it with delta = 1.
>>>>>>
>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>> ---
>>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>> *folio, unsigned long addr,
>>>>>>    }
>>>>>>
>>>>>>    /**
>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>> pte.
>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>> + *    forward or backward by delta
>>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>     *    non_swap_entry() must be false.
>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>> + *    is positive; backward if delta is negative
>>>>>>     *
>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>     */
>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>
>>>>> We have equivalent functions for pfn:
>>>>>
>>>>>     pte_next_pfn()
>>>>>     pte_advance_pfn()
>>>>>
>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>
>>>>> I guess we don't have a need for that and it adds more churn.
>>>>
>>>> we might have a need in the below case.
>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>> folios of process
>>>> A become single-mapped.
>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>> into many small
>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>> nr_pages
>>>> CoW and page faults.
>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>
>>> Once we have COW reuse for large folios in place (I think you know that
>>> I am working on that), it might make sense to "COW-reuse around",
>>
>> TBH, I don't know if you are working on that. please Cc me next time :-)
> 
> I could have sworn I mentioned it to you already :)
> 
> See
> 
> https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
> 
> I'll follow-up on that soonish (now that batching is upstream and the large
> mapcount is on its way upstream).
> 
>>
>>> meaning we look if some neighboring PTEs map the same large folio and
>>> map them writable as well. But if it's really worth it, increasing page
>>> fault latency, is to be decided separately.
>>
>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>> Perhaps we can discover a more cost-effective method to signify that a large
>> folio is probably singly mapped?
> 
> Yes, precisely what I am up to!
> 
>> and only call "multi-PTEs" reuse while that
>> condition is true in PF and avoid increasing latency always?
> 
> I'm thinking along those lines:
> 
> If we detect that it's exclusive, we can certainly mapped the current PTE
> writable. Then, we can decide how much (and if) we want to fault-around writable
> as an optimization.
> 
> For smallish large folios, it might make sense to try faulting around most of
> the folio.
> 
> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> to fault around the whole thing -- especially if there is little benefit to be
> had from contig-pte bits.
> 
>>
>>>
>>>
>>>>
>>>> Another case, might be
>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>> folios instead
>>>> CoWing nr_pages small folios.
>>>>
>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>> see pte_move_pfn().
>>> pte_move_pfn() does sound odd to me. 

Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
Perhaps just pte_advance_swp_offset() with a negative value is clearer about
what its doing?

>>> It might not be required to
>>> implement the optimization described above. (it's easier to simply read
>>> another PTE, check if it maps the same large folio, and to batch from there)

Yes agreed.

>>>
>>
>> It appears that your proposal suggests potential reusability as follows: if we
>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>> if my understanding is wrong.
>>
>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>> in nr_pages, thus enabling complete reuse of the whole folio.
> 
> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> there might be easier and cleaner.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-03 23:23     ` Barry Song
  2024-05-06 12:07       ` David Hildenbrand
@ 2024-05-07  8:17       ` Ryan Roberts
  1 sibling, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-07  8:17 UTC (permalink / raw
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy

On 04/05/2024 00:23, Barry Song wrote:
> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> With the suggested changes below:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> ---
>>>  mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>  1 file changed, 48 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 22e7c33cc747..940fdbe69fa1 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       pte_t pte;
>>>       vm_fault_t ret = 0;
>>>       void *shadow = NULL;
>>> +     int nr_pages = 1;
>>> +     unsigned long page_idx = 0;
>>> +     unsigned long address = vmf->address;
>>> +     pte_t *ptep;
>>
>> nit: Personally I'd prefer all these to get initialised just before the "if
>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>> any logic between here and there makes an adjustment) and its clear that they
>> are only to be used after that block (the compiler will warn if using an
>> uninitialized value).
> 
> right. I agree this will make the code more readable.
> 
>>
>>>
>>>       if (!pte_unmap_same(vmf))
>>>               goto out;
>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>               goto out_nomap;
>>>       }
>>>
>>> +     ptep = vmf->pte;
>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>> +             int nr = folio_nr_pages(folio);
>>> +             unsigned long idx = folio_page_idx(folio, page);
>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> +             pte_t *folio_ptep;
>>> +             pte_t folio_pte;
>>> +
>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> +                     goto check_folio;
>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> +                     goto check_folio;
>>> +
>>> +             folio_ptep = vmf->pte - idx;
>>> +             folio_pte = ptep_get(folio_ptep);
>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>> +                     goto check_folio;
>>> +
>>> +             page_idx = idx;
>>> +             address = folio_start;
>>> +             ptep = folio_ptep;
>>> +             nr_pages = nr;
>>> +             entry = folio->swap;
>>> +             page = &folio->page;
>>> +     }
>>> +
>>> +check_folio:
>>
>> Is this still the correct label name, given the checks are now above the new
>> block? Perhaps "one_page" or something like that?
> 
> not quite sure about this, as the code after one_page can be multiple_pages.
> On the other hand, it seems we are really checking folio after "check_folio"
> :-)

Yeah fair enough. Ignore my comment.

> 
> 
> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> 
> /*
> * Check under PT lock (to protect against concurrent fork() sharing
> * the swap entry concurrently) for certainly exclusive pages.
> */
> if (!folio_test_ksm(folio)) {
> 
> 
>>
>>> +
>>>       /*
>>>        * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>        * must never point at an anonymous page in the swapcache that is
>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * We're already holding a reference on the page but haven't mapped it
>>>        * yet.
>>>        */
>>> -     swap_free_nr(entry, 1);
>>> +     swap_free_nr(entry, nr_pages);
>>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>               folio_free_swap(folio);
>>>
>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> +     folio_ref_add(folio, nr_pages - 1);
>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>       pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>       /*
>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * exclusivity.
>>>        */
>>>       if (!folio_test_ksm(folio) &&
>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>
>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>> in the swapcache, so there is an extra ref on it? I agree with the change for
>> robustness sake. Just checking my understanding.
> 
> This is the code showing we are reusing/(mkwrite) a folio either
> 1. we meet a small folio and we are the only one hitting the small folio
> 2. we meet a large folio and we are the only one hitting the large folio
> 
> any corner cases besides the above two seems difficult. for example,
> 
> while we hit a large folio in swapcache but if we can't entirely map it
> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> == nr_pages == 1, in this case, lacking  folio_nr_pages(folio) == nr_pages
> might lead to mkwrite() on a single pte within a partially unmapped large
> folio. not quite sure this is wrong, but seems buggy and arduous.
> 
>>
>>>               if (vmf->flags & FAULT_FLAG_WRITE) {
>>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
>>>               }
>>>               rmap_flags |= RMAP_EXCLUSIVE;
>>>       }
>>> -     flush_icache_page(vma, page);
>>> +     flush_icache_pages(vma, page, nr_pages);
>>>       if (pte_swp_soft_dirty(vmf->orig_pte))
>>>               pte = pte_mksoft_dirty(pte);
>>>       if (pte_swp_uffd_wp(vmf->orig_pte))
>>>               pte = pte_mkuffd_wp(pte);
>>> -     vmf->orig_pte = pte;
>>> +     vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>>>
>>>       /* ksm created a completely new copy */
>>>       if (unlikely(folio != swapcache && swapcache)) {
>>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +             folio_add_new_anon_rmap(folio, vma, address);
>>>               folio_add_lru_vma(folio, vma);
>>>       } else {
>>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>>>                                       rmap_flags);
>>>       }
>>>
>>>       VM_BUG_ON(!folio_test_anon(folio) ||
>>>                       (pte_write(pte) && !PageAnonExclusive(page)));
>>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> -     arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address,
>>> -                     pte, vmf->orig_pte, 1);
>>> +     set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
>>> +     arch_do_swap_page_nr(vma->vm_mm, vma, address,
>>> +                     pte, pte, nr_pages);
>>>
>>>       folio_unlock(folio);
>>>       if (folio != swapcache && swapcache) {
>>> @@ -4291,7 +4327,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       }
>>>
>>>       /* No need to invalidate - it was non-present before */
>>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> +     update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
>>>  unlock:
>>>       if (vmf->pte)
>>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-06 22:58               ` Barry Song
@ 2024-05-07  8:24                 ` David Hildenbrand
  2024-05-07  8:43                   ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-07  8:24 UTC (permalink / raw
  To: Barry Song
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On 07.05.24 00:58, Barry Song wrote:
> On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 06.05.24 14:58, Barry Song wrote:
>>> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 04.05.24 01:23, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>>>>
>>>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>>>> for ARM64 MTE restoring are removed.
>>>>>>>>
>>>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> With the suggested changes below:
>>>>>>>
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>     mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>>>     1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>          pte_t pte;
>>>>>>>>          vm_fault_t ret = 0;
>>>>>>>>          void *shadow = NULL;
>>>>>>>> +     int nr_pages = 1;
>>>>>>>> +     unsigned long page_idx = 0;
>>>>>>>> +     unsigned long address = vmf->address;
>>>>>>>> +     pte_t *ptep;
>>>>>>>
>>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>>>> are only to be used after that block (the compiler will warn if using an
>>>>>>> uninitialized value).
>>>>>>
>>>>>> right. I agree this will make the code more readable.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>          if (!pte_unmap_same(vmf))
>>>>>>>>                  goto out;
>>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>                  goto out_nomap;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> +     ptep = vmf->pte;
>>>>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>>>> +             int nr = folio_nr_pages(folio);
>>>>>>>> +             unsigned long idx = folio_page_idx(folio, page);
>>>>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>>>> +             pte_t *folio_ptep;
>>>>>>>> +             pte_t folio_pte;
>>>>>>>> +
>>>>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>>>> +                     goto check_folio;
>>>>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>>>> +                     goto check_folio;
>>>>>>>> +
>>>>>>>> +             folio_ptep = vmf->pte - idx;
>>>>>>>> +             folio_pte = ptep_get(folio_ptep);
>>>>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>>>> +                     goto check_folio;
>>>>>>>> +
>>>>>>>> +             page_idx = idx;
>>>>>>>> +             address = folio_start;
>>>>>>>> +             ptep = folio_ptep;
>>>>>>>> +             nr_pages = nr;
>>>>>>>> +             entry = folio->swap;
>>>>>>>> +             page = &folio->page;
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +check_folio:
>>>>>>>
>>>>>>> Is this still the correct label name, given the checks are now above the new
>>>>>>> block? Perhaps "one_page" or something like that?
>>>>>>
>>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>>>> :-)
>>>>>>
>>>>>>
>>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>>>
>>>>>> /*
>>>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio)) {
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>>          /*
>>>>>>>>           * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>>>           * must never point at an anonymous page in the swapcache that is
>>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>           * We're already holding a reference on the page but haven't mapped it
>>>>>>>>           * yet.
>>>>>>>>           */
>>>>>>>> -     swap_free_nr(entry, 1);
>>>>>>>> +     swap_free_nr(entry, nr_pages);
>>>>>>>>          if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>>>                  folio_free_swap(folio);
>>>>>>>>
>>>>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>>>> +     folio_ref_add(folio, nr_pages - 1);
>>>>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>>>          pte = mk_pte(page, vma->vm_page_prot);
>>>>>>>>
>>>>>>>>          /*
>>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>           * exclusivity.
>>>>>>>>           */
>>>>>>>>          if (!folio_test_ksm(folio) &&
>>>>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>>>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>>>>>>
>>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>>>> robustness sake. Just checking my understanding.
>>>>>>
>>>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>>>
>>>>>> any corner cases besides the above two seems difficult. for example,
>>>>>>
>>>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>>>> == nr_pages == 1
>>>>>
>>>>> No, there would be other references from the swapcache and
>>>>> folio_ref_count(folio) > 1. See my other reply.
>>>>
>>>> right. can be clearer by:
>>>
>>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
>>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>>
>> I don't think that we will "need" it.
>>
>>>
>>> one case is that we have a large folio with 16 PTEs, and we unmap
>>> 15 swap PTE entries, thus we have only one swap entry left. Then
>>> we hit the large folio in swapcache.  but we have only one PTE thus we will
>>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
>>> large folio for one PTE. with it, do_wp_page() will migrate the large
>>> folio to a small one?
>>
>> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>>
>> Note that this is the same as if we had pte_swp_exclusive() set and
>> would have run into "exclusive=true" here.
>>
>> If we'd want a similar "optimization" as we have in
>> wp_can_reuse_anon_folio(), you'd want something like
>>
>> exclusive || (folio_ref_count(folio) == 1 &&
>>                (!folio_test_large(folio) || nr_pages > 1)
> 
> I feel like
> 
> A :   !folio_test_large(folio) || nr_pages > 1
> 
> equals
> 
> B:    folio_nr_pages(folio) == nr_pages
> 
> if folio is small,  folio_test_large(folio) is false, both A and B will be true;
> if folio is large, and we map the whole large folio, A will be true
> because of nr_pages > 1;
> B is also true;
> if folio is large, and we map single one PTE, A will be false;
> B is also false, because nr_pages == 1 but  folio_nr_pages(folio) > 1;
> 
> right?

Let's assume a single subpage of a large folio is no longer mapped. 
Then, we'd have:

nr_pages == folio_nr_pages(folio) - 1.

You could simply map+reuse most of the folio without COWing.

Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's 
just easy to detect that the folio is exclusive (folio_ref_count(folio) 
== 1 before mapping anything).

If you really want to mimic what do_wp_page() currently does, you should 
have:

exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))

Personally, I think we should keep it simple here and use:

exclusive || folio_ref_count(folio) == 1

IMHO, that's as clear as it gets.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-07  8:14             ` Ryan Roberts
@ 2024-05-07  8:24               ` Barry Song
  2024-05-07  9:39                 ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-07  8:24 UTC (permalink / raw
  To: Ryan Roberts
  Cc: David Hildenbrand, akpm, linux-mm, baolin.wang, chrisl,
	hanchuanhua, hannes, hughd, kasong, linux-kernel, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 06/05/2024 09:31, David Hildenbrand wrote:
> > On 06.05.24 10:20, Barry Song wrote:
> >> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 04.05.24 01:40, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> There could arise a necessity to obtain the first pte_t from a swap
> >>>>>> pte_t located in the middle. For instance, this may occur within the
> >>>>>> context of do_swap_page(), where a page fault can potentially occur in
> >>>>>> any PTE of a large folio. To address this, the following patch introduces
> >>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>>>>
> >>>>> You mean pte_next_swp_offset()?
> >>>>
> >>>> yes.
> >>>>
> >>>>>
> >>>>>> will directly invoke it with delta = 1.
> >>>>>>
> >>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>> ---
> >>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
> >>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>>>> index c5552d35d995..cfe4aed66a5c 100644
> >>>>>> --- a/mm/internal.h
> >>>>>> +++ b/mm/internal.h
> >>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
> >>>>>> *folio, unsigned long addr,
> >>>>>>    }
> >>>>>>
> >>>>>>    /**
> >>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
> >>>>>> pte.
> >>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>>>>> + *    forward or backward by delta
> >>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>>>>>     *    non_swap_entry() must be false.
> >>>>>> + * @delta: The direction and the offset we are moving; forward if delta
> >>>>>> + *    is positive; backward if delta is negative
> >>>>>>     *
> >>>>>> - * Increments the swap offset, while maintaining all other fields, including
> >>>>>> + * Moves the swap offset, while maintaining all other fields, including
> >>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
> >>>>>>     */
> >>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>>>>
> >>>>> We have equivalent functions for pfn:
> >>>>>
> >>>>>     pte_next_pfn()
> >>>>>     pte_advance_pfn()
> >>>>>
> >>>>> Although the latter takes an unsigned long and only moves forward currently. I
> >>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
> >>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>>>>
> >>>>> I guess we don't have a need for that and it adds more churn.
> >>>>
> >>>> we might have a need in the below case.
> >>>> A forks B, then A and B share large folios. B unmap/exit, then large
> >>>> folios of process
> >>>> A become single-mapped.
> >>>> Right now, while writing A's folios, we are CoWing A's large folios
> >>>> into many small
> >>>> folios. I believe we can reuse the entire large folios instead of doing
> >>>> nr_pages
> >>>> CoW and page faults.
> >>>> In this case, we might want to get the first PTE from vmf->pte.
> >>>
> >>> Once we have COW reuse for large folios in place (I think you know that
> >>> I am working on that), it might make sense to "COW-reuse around",
> >>
> >> TBH, I don't know if you are working on that. please Cc me next time :-)
> >
> > I could have sworn I mentioned it to you already :)
> >
> > See
> >
> > https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
> >
> > I'll follow-up on that soonish (now that batching is upstream and the large
> > mapcount is on its way upstream).
> >
> >>
> >>> meaning we look if some neighboring PTEs map the same large folio and
> >>> map them writable as well. But if it's really worth it, increasing page
> >>> fault latency, is to be decided separately.
> >>
> >> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> >> Perhaps we can discover a more cost-effective method to signify that a large
> >> folio is probably singly mapped?
> >
> > Yes, precisely what I am up to!
> >
> >> and only call "multi-PTEs" reuse while that
> >> condition is true in PF and avoid increasing latency always?
> >
> > I'm thinking along those lines:
> >
> > If we detect that it's exclusive, we can certainly mapped the current PTE
> > writable. Then, we can decide how much (and if) we want to fault-around writable
> > as an optimization.
> >
> > For smallish large folios, it might make sense to try faulting around most of
> > the folio.
> >
> > For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> > to fault around the whole thing -- especially if there is little benefit to be
> > had from contig-pte bits.
> >
> >>
> >>>
> >>>
> >>>>
> >>>> Another case, might be
> >>>> A forks B, and we write either A or B, we might CoW an entire large
> >>>> folios instead
> >>>> CoWing nr_pages small folios.
> >>>>
> >>>> case 1 seems more useful, I might have a go after some days. then we might
> >>>> see pte_move_pfn().
> >>> pte_move_pfn() does sound odd to me.
>
> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
> what its doing?
>

I am not a native speaker. but dictionary says

advance:
move forward in a purposeful way.
a forward movement.

Now we are moving backward or forward :-)

> >>> It might not be required to
> >>> implement the optimization described above. (it's easier to simply read
> >>> another PTE, check if it maps the same large folio, and to batch from there)
>
> Yes agreed.
>
> >>>
> >>
> >> It appears that your proposal suggests potential reusability as follows: if we
> >> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> >> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> >> if my understanding is wrong.
> >>
> >> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> >> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> >> in nr_pages, thus enabling complete reuse of the whole folio.
> >
> > Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> > there might be easier and cleaner.
> >
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-07  8:24                 ` David Hildenbrand
@ 2024-05-07  8:43                   ` Barry Song
  2024-05-07  8:59                     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-07  8:43 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 8:24 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.05.24 00:58, Barry Song wrote:
> > On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 06.05.24 14:58, Barry Song wrote:
> >>> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 04.05.24 01:23, Barry Song wrote:
> >>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>>>>
> >>>>>>>> When a large folio is found in the swapcache, the current implementation
> >>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>>>>>>> page faults. This patch opts to map the entire large folio at once to
> >>>>>>>> minimize page faults. Additionally, redundant checks and early exits
> >>>>>>>> for ARM64 MTE restoring are removed.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>>>
> >>>>>>> With the suggested changes below:
> >>>>>>>
> >>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>>>>
> >>>>>>>> ---
> >>>>>>>>     mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>>>>>>     1 file changed, 48 insertions(+), 12 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
> >>>>>>>> --- a/mm/memory.c
> >>>>>>>> +++ b/mm/memory.c
> >>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>>          pte_t pte;
> >>>>>>>>          vm_fault_t ret = 0;
> >>>>>>>>          void *shadow = NULL;
> >>>>>>>> +     int nr_pages = 1;
> >>>>>>>> +     unsigned long page_idx = 0;
> >>>>>>>> +     unsigned long address = vmf->address;
> >>>>>>>> +     pte_t *ptep;
> >>>>>>>
> >>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
> >>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >>>>>>> any logic between here and there makes an adjustment) and its clear that they
> >>>>>>> are only to be used after that block (the compiler will warn if using an
> >>>>>>> uninitialized value).
> >>>>>>
> >>>>>> right. I agree this will make the code more readable.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>          if (!pte_unmap_same(vmf))
> >>>>>>>>                  goto out;
> >>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>>                  goto out_nomap;
> >>>>>>>>          }
> >>>>>>>>
> >>>>>>>> +     ptep = vmf->pte;
> >>>>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>>>>>>> +             int nr = folio_nr_pages(folio);
> >>>>>>>> +             unsigned long idx = folio_page_idx(folio, page);
> >>>>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>>>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>>>>>>> +             pte_t *folio_ptep;
> >>>>>>>> +             pte_t folio_pte;
> >>>>>>>> +
> >>>>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>>>>>>> +                     goto check_folio;
> >>>>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>>>>>>> +                     goto check_folio;
> >>>>>>>> +
> >>>>>>>> +             folio_ptep = vmf->pte - idx;
> >>>>>>>> +             folio_pte = ptep_get(folio_ptep);
> >>>>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>>>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>>>>>>> +                     goto check_folio;
> >>>>>>>> +
> >>>>>>>> +             page_idx = idx;
> >>>>>>>> +             address = folio_start;
> >>>>>>>> +             ptep = folio_ptep;
> >>>>>>>> +             nr_pages = nr;
> >>>>>>>> +             entry = folio->swap;
> >>>>>>>> +             page = &folio->page;
> >>>>>>>> +     }
> >>>>>>>> +
> >>>>>>>> +check_folio:
> >>>>>>>
> >>>>>>> Is this still the correct label name, given the checks are now above the new
> >>>>>>> block? Perhaps "one_page" or something like that?
> >>>>>>
> >>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
> >>>>>> On the other hand, it seems we are really checking folio after "check_folio"
> >>>>>> :-)
> >>>>>>
> >>>>>>
> >>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> >>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >>>>>>
> >>>>>> /*
> >>>>>> * Check under PT lock (to protect against concurrent fork() sharing
> >>>>>> * the swap entry concurrently) for certainly exclusive pages.
> >>>>>> */
> >>>>>> if (!folio_test_ksm(folio)) {
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> +
> >>>>>>>>          /*
> >>>>>>>>           * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>>>>>>>           * must never point at an anonymous page in the swapcache that is
> >>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>>           * We're already holding a reference on the page but haven't mapped it
> >>>>>>>>           * yet.
> >>>>>>>>           */
> >>>>>>>> -     swap_free_nr(entry, 1);
> >>>>>>>> +     swap_free_nr(entry, nr_pages);
> >>>>>>>>          if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>>>>>>                  folio_free_swap(folio);
> >>>>>>>>
> >>>>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>>>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>>>>>>> +     folio_ref_add(folio, nr_pages - 1);
> >>>>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>>>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>>>>>>          pte = mk_pte(page, vma->vm_page_prot);
> >>>>>>>>
> >>>>>>>>          /*
> >>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>>>           * exclusivity.
> >>>>>>>>           */
> >>>>>>>>          if (!folio_test_ksm(folio) &&
> >>>>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>>>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>>>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
> >>>>>>>
> >>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
> >>>>>>> robustness sake. Just checking my understanding.
> >>>>>>
> >>>>>> This is the code showing we are reusing/(mkwrite) a folio either
> >>>>>> 1. we meet a small folio and we are the only one hitting the small folio
> >>>>>> 2. we meet a large folio and we are the only one hitting the large folio
> >>>>>>
> >>>>>> any corner cases besides the above two seems difficult. for example,
> >>>>>>
> >>>>>> while we hit a large folio in swapcache but if we can't entirely map it
> >>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> >>>>>> == nr_pages == 1
> >>>>>
> >>>>> No, there would be other references from the swapcache and
> >>>>> folio_ref_count(folio) > 1. See my other reply.
> >>>>
> >>>> right. can be clearer by:
> >>>
> >>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> >>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
> >>
> >> I don't think that we will "need" it.
> >>
> >>>
> >>> one case is that we have a large folio with 16 PTEs, and we unmap
> >>> 15 swap PTE entries, thus we have only one swap entry left. Then
> >>> we hit the large folio in swapcache.  but we have only one PTE thus we will
> >>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> >>> large folio for one PTE. with it, do_wp_page() will migrate the large
> >>> folio to a small one?
> >>
> >> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
> >>
> >> Note that this is the same as if we had pte_swp_exclusive() set and
> >> would have run into "exclusive=true" here.
> >>
> >> If we'd want a similar "optimization" as we have in
> >> wp_can_reuse_anon_folio(), you'd want something like
> >>
> >> exclusive || (folio_ref_count(folio) == 1 &&
> >>                (!folio_test_large(folio) || nr_pages > 1)
> >
> > I feel like
> >
> > A :   !folio_test_large(folio) || nr_pages > 1
> >
> > equals
> >
> > B:    folio_nr_pages(folio) == nr_pages
> >
> > if folio is small,  folio_test_large(folio) is false, both A and B will be true;
> > if folio is large, and we map the whole large folio, A will be true
> > because of nr_pages > 1;
> > B is also true;
> > if folio is large, and we map single one PTE, A will be false;
> > B is also false, because nr_pages == 1 but  folio_nr_pages(folio) > 1;
> >
> > right?
>
> Let's assume a single subpage of a large folio is no longer mapped.
> Then, we'd have:
>
> nr_pages == folio_nr_pages(folio) - 1.
>
> You could simply map+reuse most of the folio without COWing.

yes. This is good but the pte which is no longer mapped could be
anyone within the nr_pages PTEs. so it could be quite tricky for
set_ptes.

>
> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> == 1 before mapping anything).
>
> If you really want to mimic what do_wp_page() currently does, you should
> have:
>
> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))

I actually dislike the part that do_wp_page() handles the reuse of a large
folio which is entirely mapped. For example, A forks B, B exit, we write
A's large folio, we get nr_pages CoW of small folios. Ideally, we can
reuse the whole folios for writing.

>
> Personally, I think we should keep it simple here and use:
>
> exclusive || folio_ref_count(folio) == 1

I feel this is still better than
exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
as we reuse the whole large folio. the do_wp_page() behaviour
doesn't have this.

>
> IMHO, that's as clear as it gets.

I agree this is clear. But I wonder if there is a possibility to optimize.

Using your instance,
"Let's assume a single subpage of a large folio is no longer mapped."

For a large folio with 16 PTEs and in case we have unmapped one of them.
Thus, we have 15 swap entries left.

The first PTE which gets page faults will reuse the whole large folio having
"exclusive || folio_ref_count(folio) == 1" only. The left 14 will
allocate 14 small
folios(swapcache has been dropped), thus, we use 16 + 14 = 30 pages
memory.

with either
A :   !folio_test_large(folio) || nr_pages > 1
or
B:    folio_nr_pages(folio) == nr_pages

We consume 15 pages.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-07  8:43                   ` Barry Song
@ 2024-05-07  8:59                     ` David Hildenbrand
  2024-05-07  9:24                       ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-07  8:59 UTC (permalink / raw
  To: Barry Song
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

>> Let's assume a single subpage of a large folio is no longer mapped.
>> Then, we'd have:
>>
>> nr_pages == folio_nr_pages(folio) - 1.
>>
>> You could simply map+reuse most of the folio without COWing.
> 
> yes. This is good but the pte which is no longer mapped could be
> anyone within the nr_pages PTEs. so it could be quite tricky for
> set_ptes.

The swap batching logic should take care of that, otherwise it would be 
buggy.

> 
>>
>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
>> == 1 before mapping anything).
>>
>> If you really want to mimic what do_wp_page() currently does, you should
>> have:
>>
>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> 
> I actually dislike the part that do_wp_page() handles the reuse of a large
> folio which is entirely mapped. For example, A forks B, B exit, we write
> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> reuse the whole folios for writing.

Yes, see the link I shared to what I am working on. There isn't really a 
question if what we do right now needs to be improved and all these 
scenarios are pretty obvious clear.

> 
>>
>> Personally, I think we should keep it simple here and use:
>>
>> exclusive || folio_ref_count(folio) == 1
> 
> I feel this is still better than
> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> as we reuse the whole large folio. the do_wp_page() behaviour
> doesn't have this.

Yes, but there is the comment "Same logic as in do_wp_page();". We 
already ran into issues having different COW reuse logic all over the 
place. For this case here, I don't care if we leave it as

"exclusive || folio_ref_count(folio) == 1"

But let's not try inventing new stuff here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-07  8:59                     ` David Hildenbrand
@ 2024-05-07  9:24                       ` Barry Song
  2024-05-07 10:39                         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-07  9:24 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> Let's assume a single subpage of a large folio is no longer mapped.
> >> Then, we'd have:
> >>
> >> nr_pages == folio_nr_pages(folio) - 1.
> >>
> >> You could simply map+reuse most of the folio without COWing.
> >
> > yes. This is good but the pte which is no longer mapped could be
> > anyone within the nr_pages PTEs. so it could be quite tricky for
> > set_ptes.
>
> The swap batching logic should take care of that, otherwise it would be
> buggy.

When you mention "it would be buggy," are you also referring to the current
fallback approach? or only refer to the future patch which might be able
to map/reuse "nr_pages - 1" pages?

The current patch falls back to setting nr_pages = 1 without mapping or
reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?

>
> >
> >>
> >> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> >> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> >> == 1 before mapping anything).
> >>
> >> If you really want to mimic what do_wp_page() currently does, you should
> >> have:
> >>
> >> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >
> > I actually dislike the part that do_wp_page() handles the reuse of a large
> > folio which is entirely mapped. For example, A forks B, B exit, we write
> > A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> > reuse the whole folios for writing.
>
> Yes, see the link I shared to what I am working on. There isn't really a
> question if what we do right now needs to be improved and all these
> scenarios are pretty obvious clear.

Great! I plan to dedicate more time to reviewing your work.

>
> >
> >>
> >> Personally, I think we should keep it simple here and use:
> >>
> >> exclusive || folio_ref_count(folio) == 1
> >
> > I feel this is still better than
> > exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> > as we reuse the whole large folio. the do_wp_page() behaviour
> > doesn't have this.
>
> Yes, but there is the comment "Same logic as in do_wp_page();". We
> already ran into issues having different COW reuse logic all over the
> place. For this case here, I don't care if we leave it as
>
> "exclusive || folio_ref_count(folio) == 1"

I'm perfectly fine with using the code for this patchset and maybe
looking for other
opportunities for potential optimization as an incremental patchset,
for example,
reusing the remaining PTEs as suggested by you -  "simply map+reuse most of
the folio without COWing."

>
> But let's not try inventing new stuff here.

It seems you ignored and snipped my "16 + 14" pages and "15" pages
example though. but once we support "simply map+reuse most of the
folio without COWing", the "16+14" problem can be resolved, instead,
we consume 16 pages.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-07  8:24               ` Barry Song
@ 2024-05-07  9:39                 ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-07  9:39 UTC (permalink / raw
  To: Barry Song
  Cc: David Hildenbrand, akpm, linux-mm, baolin.wang, chrisl,
	hanchuanhua, hannes, hughd, kasong, linux-kernel, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy

On 07/05/2024 09:24, Barry Song wrote:
> On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 06/05/2024 09:31, David Hildenbrand wrote:
>>> On 06.05.24 10:20, Barry Song wrote:
>>>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 04.05.24 01:40, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>>
>>>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>>>
>>>>>>> You mean pte_next_swp_offset()?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>
>>>>>>>> will directly invoke it with delta = 1.
>>>>>>>>
>>>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>>> ---
>>>>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>>>> --- a/mm/internal.h
>>>>>>>> +++ b/mm/internal.h
>>>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>>>> *folio, unsigned long addr,
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    /**
>>>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>>>> pte.
>>>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>>>> + *    forward or backward by delta
>>>>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>>>     *    non_swap_entry() must be false.
>>>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>>>> + *    is positive; backward if delta is negative
>>>>>>>>     *
>>>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>>>     */
>>>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>>>
>>>>>>> We have equivalent functions for pfn:
>>>>>>>
>>>>>>>     pte_next_pfn()
>>>>>>>     pte_advance_pfn()
>>>>>>>
>>>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>>>
>>>>>>> I guess we don't have a need for that and it adds more churn.
>>>>>>
>>>>>> we might have a need in the below case.
>>>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>>>> folios of process
>>>>>> A become single-mapped.
>>>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>>>> into many small
>>>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>>>> nr_pages
>>>>>> CoW and page faults.
>>>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>>>
>>>>> Once we have COW reuse for large folios in place (I think you know that
>>>>> I am working on that), it might make sense to "COW-reuse around",
>>>>
>>>> TBH, I don't know if you are working on that. please Cc me next time :-)
>>>
>>> I could have sworn I mentioned it to you already :)
>>>
>>> See
>>>
>>> https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
>>>
>>> I'll follow-up on that soonish (now that batching is upstream and the large
>>> mapcount is on its way upstream).
>>>
>>>>
>>>>> meaning we look if some neighboring PTEs map the same large folio and
>>>>> map them writable as well. But if it's really worth it, increasing page
>>>>> fault latency, is to be decided separately.
>>>>
>>>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>>>> Perhaps we can discover a more cost-effective method to signify that a large
>>>> folio is probably singly mapped?
>>>
>>> Yes, precisely what I am up to!
>>>
>>>> and only call "multi-PTEs" reuse while that
>>>> condition is true in PF and avoid increasing latency always?
>>>
>>> I'm thinking along those lines:
>>>
>>> If we detect that it's exclusive, we can certainly mapped the current PTE
>>> writable. Then, we can decide how much (and if) we want to fault-around writable
>>> as an optimization.
>>>
>>> For smallish large folios, it might make sense to try faulting around most of
>>> the folio.
>>>
>>> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
>>> to fault around the whole thing -- especially if there is little benefit to be
>>> had from contig-pte bits.
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Another case, might be
>>>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>>>> folios instead
>>>>>> CoWing nr_pages small folios.
>>>>>>
>>>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>>>> see pte_move_pfn().
>>>>> pte_move_pfn() does sound odd to me.
>>
>> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
>> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
>> what its doing?
>>
> 
> I am not a native speaker. but dictionary says
> 
> advance:
> move forward in a purposeful way.
> a forward movement.
> 
> Now we are moving backward or forward :-)

Sure, but if you pass a negative value then you are moving forwards by a
negative amount ;-)

Anyway, forget I said anything - its not important.

> 
>>>>> It might not be required to
>>>>> implement the optimization described above. (it's easier to simply read
>>>>> another PTE, check if it maps the same large folio, and to batch from there)
>>
>> Yes agreed.
>>
>>>>>
>>>>
>>>> It appears that your proposal suggests potential reusability as follows: if we
>>>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>>>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>>>> if my understanding is wrong.
>>>>
>>>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>>>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>>>> in nr_pages, thus enabling complete reuse of the whole folio.
>>>
>>> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
>>> there might be easier and cleaner.
>>>
>>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-07  9:24                       ` Barry Song
@ 2024-05-07 10:39                         ` David Hildenbrand
  2024-05-07 10:48                           ` Barry Song
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-05-07 10:39 UTC (permalink / raw
  To: Barry Song
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On 07.05.24 11:24, Barry Song wrote:
> On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Let's assume a single subpage of a large folio is no longer mapped.
>>>> Then, we'd have:
>>>>
>>>> nr_pages == folio_nr_pages(folio) - 1.
>>>>
>>>> You could simply map+reuse most of the folio without COWing.
>>>
>>> yes. This is good but the pte which is no longer mapped could be
>>> anyone within the nr_pages PTEs. so it could be quite tricky for
>>> set_ptes.
>>
>> The swap batching logic should take care of that, otherwise it would be
>> buggy.
> 
> When you mention "it would be buggy," are you also referring to the current
> fallback approach? or only refer to the future patch which might be able
> to map/reuse "nr_pages - 1" pages?

swap_pte_batch() should not skip any holes. So consequently, set_ptes() 
should do the right thing. (regarding your comment "could be quite ricky 
for set_ptes")

So I think that should be working as expected.

> 
> The current patch falls back to setting nr_pages = 1 without mapping or
> reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?
> 
>>
>>>
>>>>
>>>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
>>>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
>>>> == 1 before mapping anything).
>>>>
>>>> If you really want to mimic what do_wp_page() currently does, you should
>>>> have:
>>>>
>>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
>>>
>>> I actually dislike the part that do_wp_page() handles the reuse of a large
>>> folio which is entirely mapped. For example, A forks B, B exit, we write
>>> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
>>> reuse the whole folios for writing.
>>
>> Yes, see the link I shared to what I am working on. There isn't really a
>> question if what we do right now needs to be improved and all these
>> scenarios are pretty obvious clear.
> 
> Great! I plan to dedicate more time to reviewing your work.

Nice! And there will be a lot of follow-up optimization work I won't 
tackle immediately regarding COW (COW-reuse around, maybe sometimes we 
want to COW bigger chunks).

I still have making PageAnonExclusive a per-folio flag on my TODO list, 
that will help the COW-reuse around case a lot.

> 
>>
>>>
>>>>
>>>> Personally, I think we should keep it simple here and use:
>>>>
>>>> exclusive || folio_ref_count(folio) == 1
>>>
>>> I feel this is still better than
>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
>>> as we reuse the whole large folio. the do_wp_page() behaviour
>>> doesn't have this.
>>
>> Yes, but there is the comment "Same logic as in do_wp_page();". We
>> already ran into issues having different COW reuse logic all over the
>> place. For this case here, I don't care if we leave it as
>>
>> "exclusive || folio_ref_count(folio) == 1"
> 
> I'm perfectly fine with using the code for this patchset and maybe
> looking for other
> opportunities for potential optimization as an incremental patchset,
> for example,
> reusing the remaining PTEs as suggested by you -  "simply map+reuse most of
> the folio without COWing."
> 
>>
>> But let's not try inventing new stuff here.
> 
> It seems you ignored and snipped my "16 + 14" pages and "15" pages
> example though. but once we support "simply map+reuse most of the
> folio without COWing", the "16+14" problem can be resolved, instead,
> we consume 16 pages.


Oh, sorry for skipping that, for me it was rather clear: the partially 
mapped folios will be on the deferred split list and the excess memory 
can (and will be) reclaimed when there is need. So this temporary memory 
consumption is usually not a problem in practice. But yes, something to 
optimize (just like COW reuse in general).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache
  2024-05-07 10:39                         ` David Hildenbrand
@ 2024-05-07 10:48                           ` Barry Song
  0 siblings, 0 replies; 49+ messages in thread
From: Barry Song @ 2024-05-07 10:48 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, chrisl, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy

On Tue, May 7, 2024 at 6:39 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.05.24 11:24, Barry Song wrote:
> > On Tue, May 7, 2024 at 8:59 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>> Let's assume a single subpage of a large folio is no longer mapped.
> >>>> Then, we'd have:
> >>>>
> >>>> nr_pages == folio_nr_pages(folio) - 1.
> >>>>
> >>>> You could simply map+reuse most of the folio without COWing.
> >>>
> >>> yes. This is good but the pte which is no longer mapped could be
> >>> anyone within the nr_pages PTEs. so it could be quite tricky for
> >>> set_ptes.
> >>
> >> The swap batching logic should take care of that, otherwise it would be
> >> buggy.
> >
> > When you mention "it would be buggy," are you also referring to the current
> > fallback approach? or only refer to the future patch which might be able
> > to map/reuse "nr_pages - 1" pages?
>
> swap_pte_batch() should not skip any holes. So consequently, set_ptes()
> should do the right thing. (regarding your comment "could be quite ricky
> for set_ptes")
>
> So I think that should be working as expected.

maybe not. take a look at my current code, I am goto check_folio with
nr_pages = 1
if  swap_pte_batch(folio_ptep, nr, folio_pte) != folio_nr_pages(folio).

+       nr_pages = 1;
+       ...
+       if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+               int nr = folio_nr_pages(folio);
+               ...
+               if (!pte_same(folio_pte,
pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+                   swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+                       goto check_folio; /* read here, i am falling
back nr_pages = 1 */
+
+
+               ...
+               nr_pages = nr;

The fallback(=1) works. but it seems you are proposing set nr_pages =
swap_pte_batch(folio_ptep, nr, folio_pte)
if (swap_pte_batch(folio_ptep, nr, folio_pte) > 1 &&
swap_pte_batch(folio_ptep, nr, folio_pte) <
nr_pages) ?

>
> >
> > The current patch falls back to setting nr_pages = 1 without mapping or
> > reusing nr_pages - 1. I feel your concern doesn't refer to this fallback?
> >
> >>
> >>>
> >>>>
> >>>> Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
> >>>> just easy to detect that the folio is exclusive (folio_ref_count(folio)
> >>>> == 1 before mapping anything).
> >>>>
> >>>> If you really want to mimic what do_wp_page() currently does, you should
> >>>> have:
> >>>>
> >>>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >>>
> >>> I actually dislike the part that do_wp_page() handles the reuse of a large
> >>> folio which is entirely mapped. For example, A forks B, B exit, we write
> >>> A's large folio, we get nr_pages CoW of small folios. Ideally, we can
> >>> reuse the whole folios for writing.
> >>
> >> Yes, see the link I shared to what I am working on. There isn't really a
> >> question if what we do right now needs to be improved and all these
> >> scenarios are pretty obvious clear.
> >
> > Great! I plan to dedicate more time to reviewing your work.
>
> Nice! And there will be a lot of follow-up optimization work I won't
> tackle immediately regarding COW (COW-reuse around, maybe sometimes we
> want to COW bigger chunks).
>
> I still have making PageAnonExclusive a per-folio flag on my TODO list,
> that will help the COW-reuse around case a lot.
>
> >
> >>
> >>>
> >>>>
> >>>> Personally, I think we should keep it simple here and use:
> >>>>
> >>>> exclusive || folio_ref_count(folio) == 1
> >>>
> >>> I feel this is still better than
> >>> exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
> >>> as we reuse the whole large folio. the do_wp_page() behaviour
> >>> doesn't have this.
> >>
> >> Yes, but there is the comment "Same logic as in do_wp_page();". We
> >> already ran into issues having different COW reuse logic all over the
> >> place. For this case here, I don't care if we leave it as
> >>
> >> "exclusive || folio_ref_count(folio) == 1"
> >
> > I'm perfectly fine with using the code for this patchset and maybe
> > looking for other
> > opportunities for potential optimization as an incremental patchset,
> > for example,
> > reusing the remaining PTEs as suggested by you -  "simply map+reuse most of
> > the folio without COWing."
> >
> >>
> >> But let's not try inventing new stuff here.
> >
> > It seems you ignored and snipped my "16 + 14" pages and "15" pages
> > example though. but once we support "simply map+reuse most of the
> > folio without COWing", the "16+14" problem can be resolved, instead,
> > we consume 16 pages.
>
>
> Oh, sorry for skipping that, for me it was rather clear: the partially
> mapped folios will be on the deferred split list and the excess memory
> can (and will be) reclaimed when there is need. So this temporary memory
> consumption is usually not a problem in practice. But yes, something to
> optimize (just like COW reuse in general).
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-05-03  9:26   ` Ryan Roberts
  2024-05-03 20:25   ` Chris Li
@ 2024-05-08  7:35   ` Huang, Ying
  2 siblings, 0 replies; 49+ messages in thread
From: Huang, Ying @ 2024-05-08  7:35 UTC (permalink / raw
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua,
	willy, xiang, yosryahmed, yuzhao, ziy

Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> Furthermore, this new function, swap_free_nr(), is designed to efficiently
> handle various scenarios for releasing a specified number, nr, of swap
> entries.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 47 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..d1d35e92d7e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..ec12f2b9d229 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,53 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +static void cluster_swap_free_nr(struct swap_info_struct *sis,
> +		unsigned long offset, int nr_pages)
> +{
> +	struct swap_cluster_info *ci;
> +	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> +	int i, nr;
> +
> +	ci = lock_cluster_or_swap_info(sis, offset);
> +	while (nr_pages) {
> +		nr = min(BITS_PER_LONG, nr_pages);
> +		for (i = 0; i < nr; i++) {
> +			if (!__swap_entry_free_locked(sis, offset + i, 1))
> +				bitmap_set(to_free, i, 1);
> +		}
> +		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> +			unlock_cluster_or_swap_info(sis, ci);
> +			for_each_set_bit(i, to_free, BITS_PER_LONG)
> +				free_swap_slot(swp_entry(sis->type, offset + i));
> +			if (nr == nr_pages)
> +				return;
> +			bitmap_clear(to_free, 0, BITS_PER_LONG);
> +			ci = lock_cluster_or_swap_info(sis, offset);
> +		}
> +		offset += nr;
> +		nr_pages -= nr;
> +	}
> +	unlock_cluster_or_swap_info(sis, ci);
> +}
> +
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int nr;
> +	struct swap_info_struct *sis;
> +	unsigned long offset = swp_offset(entry);
> +
> +	sis = _swap_info_get(entry);
> +	if (!sis)
> +		return;
> +
> +	while (nr_pages) {
> +		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> +		cluster_swap_free_nr(sis, offset, nr);
> +		offset += nr;
> +		nr_pages -= nr;
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-03  9:31   ` Ryan Roberts
  2024-05-03 20:37     ` Chris Li
@ 2024-05-08  7:56     ` Huang, Ying
  2024-05-08  8:30       ` Barry Song
  1 sibling, 1 reply; 49+ messages in thread
From: Huang, Ying @ 2024-05-08  7:56 UTC (permalink / raw
  To: Ryan Roberts, Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, Rafael J. Wysocki, Pavel Machek,
	Len Brown

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 03/05/2024 01:50, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>> 
>> To streamline maintenance efforts, we propose discontinuing the use of
>> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
>> to 1. This adjustment offers the advantage of enabling batch processing
>> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
>> a bitmap consisting of only one long, resulting in overhead that can be
>> ignored for cases where nr equals 1.
>> 
>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> ---
>>  include/linux/swap.h |  5 -----
>>  kernel/power/swap.c  |  7 +++----
>>  mm/memory.c          |  2 +-
>>  mm/rmap.c            |  4 ++--
>>  mm/shmem.c           |  4 ++--
>>  mm/swapfile.c        | 19 +++++--------------
>>  6 files changed, 13 insertions(+), 28 deletions(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index d1d35e92d7e9..f03cb446124e 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>  extern void swap_shmem_alloc(swp_entry_t);
>>  extern int swap_duplicate(swp_entry_t);
>>  extern int swapcache_prepare(swp_entry_t);
>> -extern void swap_free(swp_entry_t);
>
> I wonder if it would be cleaner to:
>
> #define swap_free(entry) swap_free_nr((entry), 1)
>
> To save all the churn for the callsites that just want to pass a single entry?

I prefer this way.  Although I prefer inline functions.

Otherwise, LGTM.  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

in the future version.

>>  extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
>>  	return 0;
>>  }
>>  
>> -static inline void swap_free(swp_entry_t swp)
>> -{
>> -}
>> -
>>  static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>>  {
>>  }
>> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
>> index 5bc04bfe2db1..6befaa88a342 100644
>> --- a/kernel/power/swap.c
>> +++ b/kernel/power/swap.c
>> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
>>  	offset = swp_offset(get_swap_page_of_type(swap));
>>  	if (offset) {
>>  		if (swsusp_extents_insert(offset))
>> -			swap_free(swp_entry(swap, offset));
>> +			swap_free_nr(swp_entry(swap, offset), 1);
>>  		else
>>  			return swapdev_block(swap, offset);
>>  	}
>> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>>  
>>  	while ((node = swsusp_extents.rb_node)) {
>>  		struct swsusp_extent *ext;
>> -		unsigned long offset;
>>  
>>  		ext = rb_entry(node, struct swsusp_extent, node);
>>  		rb_erase(node, &swsusp_extents);
>> -		for (offset = ext->start; offset <= ext->end; offset++)
>> -			swap_free(swp_entry(swap, offset));
>> +		swap_free_nr(swp_entry(swap, ext->start),
>> +			     ext->end - ext->start + 1);
>>  
>>  		kfree(ext);
>>  	}
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eea6e4984eae..f033eb3528ba 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	 * We're already holding a reference on the page but haven't mapped it
>>  	 * yet.
>>  	 */
>> -	swap_free(entry);
>> +	swap_free_nr(entry, 1);
>>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>>  		folio_free_swap(folio);
>>  
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 087a79f1f611..39ec7742acec 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  				goto walk_done_err;
>>  			}
>>  			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>> -				swap_free(entry);
>> +				swap_free_nr(entry, 1);
>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>  				goto walk_done_err;
>>  			}
>> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>>  			if (anon_exclusive &&
>>  			    folio_try_share_anon_rmap_pte(folio, subpage)) {
>> -				swap_free(entry);
>> +				swap_free_nr(entry, 1);
>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>  				goto walk_done_err;
>>  			}
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index fa2a0ed97507..bfc8a2beb24f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>  	 * in shmem_evict_inode().
>>  	 */
>>  	shmem_recalc_inode(inode, -1, -1);
>> -	swap_free(swap);
>> +	swap_free_nr(swap, 1);
>>  }
>>  
>>  /*
>> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>  
>>  	delete_from_swap_cache(folio);
>>  	folio_mark_dirty(folio);
>> -	swap_free(swap);
>> +	swap_free_nr(swap, 1);
>>  	put_swap_device(si);
>>  
>>  	*foliop = folio;
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index ec12f2b9d229..ddcd0f24b9a1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>>  	swap_range_free(p, offset, 1);
>>  }
>>  
>> -/*
>> - * Caller has made sure that the swap device corresponding to entry
>> - * is still around or has not been recycled.
>> - */
>> -void swap_free(swp_entry_t entry)
>> -{
>> -	struct swap_info_struct *p;
>> -
>> -	p = _swap_info_get(entry);
>> -	if (p)
>> -		__swap_entry_free(p, entry);
>> -}
>> -
>>  static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>  		unsigned long offset, int nr_pages)
>>  {
>> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>  	unlock_cluster_or_swap_info(sis, ci);
>>  }
>>  
>> +/*
>> + * Caller has made sure that the swap device corresponding to entry
>> + * is still around or has not been recycled.
>> + */
>>  void swap_free_nr(swp_entry_t entry, int nr_pages)
>>  {
>>  	int nr;
>> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>>  		new_pte = pte_mkuffd_wp(new_pte);
>>  setpte:
>>  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
>> -	swap_free(entry);
>> +	swap_free_nr(entry, 1);
>>  out:
>>  	if (pte)
>>  		pte_unmap_unlock(pte, ptl);

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
  2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
  2024-05-03  9:41   ` Ryan Roberts
  2024-05-03 20:51   ` Chris Li
@ 2024-05-08  8:08   ` Huang, Ying
  2 siblings, 0 replies; 49+ messages in thread
From: Huang, Ying @ 2024-05-08  8:08 UTC (permalink / raw
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, ryan.roberts, surenb, v-songbaohua,
	willy, xiang, yosryahmed, yuzhao, ziy

Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.
>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM, Thanks!  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

in the future version.

> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>  
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *	 forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *	 non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *	 is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>  {
>  	swp_entry_t entry = pte_to_swp_entry(pte);
>  	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -						   (swp_offset(entry) + 1)));
> +						   (swp_offset(entry) + delta)));
>  
>  	if (pte_swp_soft_dirty(pte))
>  		new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>  	return new;
>  }
>  
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *	 non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +	return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-08  7:56     ` Huang, Ying
@ 2024-05-08  8:30       ` Barry Song
  2024-05-08  9:10         ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Barry Song @ 2024-05-08  8:30 UTC (permalink / raw
  To: Huang, Ying, Christoph Hellwig, chrisl
  Cc: Ryan Roberts, akpm, linux-mm, baolin.wang, david, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, surenb, v-songbaohua, willy,
	xiang, yosryahmed, yuzhao, ziy, Rafael J. Wysocki, Pavel Machek,
	Len Brown

On Wed, May 8, 2024 at 7:58 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> > On 03/05/2024 01:50, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> To streamline maintenance efforts, we propose discontinuing the use of
> >> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
> >> to 1. This adjustment offers the advantage of enabling batch processing
> >> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
> >> a bitmap consisting of only one long, resulting in overhead that can be
> >> ignored for cases where nr equals 1.
> >>
> >> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> Cc: Pavel Machek <pavel@ucw.cz>
> >> Cc: Len Brown <len.brown@intel.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> ---
> >>  include/linux/swap.h |  5 -----
> >>  kernel/power/swap.c  |  7 +++----
> >>  mm/memory.c          |  2 +-
> >>  mm/rmap.c            |  4 ++--
> >>  mm/shmem.c           |  4 ++--
> >>  mm/swapfile.c        | 19 +++++--------------
> >>  6 files changed, 13 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index d1d35e92d7e9..f03cb446124e 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> >>  extern void swap_shmem_alloc(swp_entry_t);
> >>  extern int swap_duplicate(swp_entry_t);
> >>  extern int swapcache_prepare(swp_entry_t);
> >> -extern void swap_free(swp_entry_t);
> >
> > I wonder if it would be cleaner to:
> >
> > #define swap_free(entry) swap_free_nr((entry), 1)
> >
> > To save all the churn for the callsites that just want to pass a single entry?
>
> I prefer this way.  Although I prefer inline functions.

Yes, using static inline is preferable. I've recently submitted
a checkpatch/codestyle for this, which can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=39c58d5ed036
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=8379bf0b0e1f5

Using static inline aligns with the established rule.

>
> Otherwise, LGTM.  Feel free to add
>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Thanks!

>
> in the future version.

I believe Christoph's vote leans towards simply removing swap_free_nr
and renaming it to swap_free, while adding a new parameter as follows.

void swap_free(swp_entry_t entry, int nr);
{
}

now I see Ryan and you prefer

static inline swap_free()
{
        swap_free_nr(...., 1)
}

Chris slightly favors discouraging the use of swap_free() without the
new parameter. Removing swap_free() can address this concern.

It seems that maintaining swap_free() and having it call swap_free_nr() with
a default value of 1 received the most support.

To align with free_swap_and_cache() and free_swap_and_cache_nr(),
I'll proceed with the "static inline" approach in the new version. Please
voice any objections you may have, Christoph, Chris.

>
> >>  extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
> >>      return 0;
> >>  }
> >>
> >> -static inline void swap_free(swp_entry_t swp)
> >> -{
> >> -}
> >> -
> >>  static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>  {
> >>  }
> >> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> >> index 5bc04bfe2db1..6befaa88a342 100644
> >> --- a/kernel/power/swap.c
> >> +++ b/kernel/power/swap.c
> >> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
> >>      offset = swp_offset(get_swap_page_of_type(swap));
> >>      if (offset) {
> >>              if (swsusp_extents_insert(offset))
> >> -                    swap_free(swp_entry(swap, offset));
> >> +                    swap_free_nr(swp_entry(swap, offset), 1);
> >>              else
> >>                      return swapdev_block(swap, offset);
> >>      }
> >> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
> >>
> >>      while ((node = swsusp_extents.rb_node)) {
> >>              struct swsusp_extent *ext;
> >> -            unsigned long offset;
> >>
> >>              ext = rb_entry(node, struct swsusp_extent, node);
> >>              rb_erase(node, &swsusp_extents);
> >> -            for (offset = ext->start; offset <= ext->end; offset++)
> >> -                    swap_free(swp_entry(swap, offset));
> >> +            swap_free_nr(swp_entry(swap, ext->start),
> >> +                         ext->end - ext->start + 1);
> >>
> >>              kfree(ext);
> >>      }
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index eea6e4984eae..f033eb3528ba 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>       * We're already holding a reference on the page but haven't mapped it
> >>       * yet.
> >>       */
> >> -    swap_free(entry);
> >> +    swap_free_nr(entry, 1);
> >>      if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>              folio_free_swap(folio);
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 087a79f1f611..39ec7742acec 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>                              goto walk_done_err;
> >>                      }
> >>                      if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> >> -                            swap_free(entry);
> >> +                            swap_free_nr(entry, 1);
> >>                              set_pte_at(mm, address, pvmw.pte, pteval);
> >>                              goto walk_done_err;
> >>                      }
> >> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>                      /* See folio_try_share_anon_rmap(): clear PTE first. */
> >>                      if (anon_exclusive &&
> >>                          folio_try_share_anon_rmap_pte(folio, subpage)) {
> >> -                            swap_free(entry);
> >> +                            swap_free_nr(entry, 1);
> >>                              set_pte_at(mm, address, pvmw.pte, pteval);
> >>                              goto walk_done_err;
> >>                      }
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index fa2a0ed97507..bfc8a2beb24f 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >>       * in shmem_evict_inode().
> >>       */
> >>      shmem_recalc_inode(inode, -1, -1);
> >> -    swap_free(swap);
> >> +    swap_free_nr(swap, 1);
> >>  }
> >>
> >>  /*
> >> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>
> >>      delete_from_swap_cache(folio);
> >>      folio_mark_dirty(folio);
> >> -    swap_free(swap);
> >> +    swap_free_nr(swap, 1);
> >>      put_swap_device(si);
> >>
> >>      *foliop = folio;
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index ec12f2b9d229..ddcd0f24b9a1 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >>      swap_range_free(p, offset, 1);
> >>  }
> >>
> >> -/*
> >> - * Caller has made sure that the swap device corresponding to entry
> >> - * is still around or has not been recycled.
> >> - */
> >> -void swap_free(swp_entry_t entry)
> >> -{
> >> -    struct swap_info_struct *p;
> >> -
> >> -    p = _swap_info_get(entry);
> >> -    if (p)
> >> -            __swap_entry_free(p, entry);
> >> -}
> >> -
> >>  static void cluster_swap_free_nr(struct swap_info_struct *sis,
> >>              unsigned long offset, int nr_pages)
> >>  {
> >> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
> >>      unlock_cluster_or_swap_info(sis, ci);
> >>  }
> >>
> >> +/*
> >> + * Caller has made sure that the swap device corresponding to entry
> >> + * is still around or has not been recycled.
> >> + */
> >>  void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>  {
> >>      int nr;
> >> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >>              new_pte = pte_mkuffd_wp(new_pte);
> >>  setpte:
> >>      set_pte_at(vma->vm_mm, addr, pte, new_pte);
> >> -    swap_free(entry);
> >> +    swap_free_nr(entry, 1);
> >>  out:
> >>      if (pte)
> >>              pte_unmap_unlock(pte, ptl);
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr()
  2024-05-08  8:30       ` Barry Song
@ 2024-05-08  9:10         ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-05-08  9:10 UTC (permalink / raw
  To: Barry Song, Huang, Ying, Christoph Hellwig, chrisl
  Cc: akpm, linux-mm, baolin.wang, david, hanchuanhua, hannes, hughd,
	kasong, linux-kernel, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, Rafael J. Wysocki, Pavel Machek,
	Len Brown

On 08/05/2024 09:30, Barry Song wrote:
> On Wed, May 8, 2024 at 7:58 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> On 03/05/2024 01:50, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> To streamline maintenance efforts, we propose discontinuing the use of
>>>> swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
>>>> to 1. This adjustment offers the advantage of enabling batch processing
>>>> within kernel/power/swap.c. Furthermore, swap_free_nr() is designed with
>>>> a bitmap consisting of only one long, resulting in overhead that can be
>>>> ignored for cases where nr equals 1.
>>>>
>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>>>> Cc: Pavel Machek <pavel@ucw.cz>
>>>> Cc: Len Brown <len.brown@intel.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> ---
>>>>  include/linux/swap.h |  5 -----
>>>>  kernel/power/swap.c  |  7 +++----
>>>>  mm/memory.c          |  2 +-
>>>>  mm/rmap.c            |  4 ++--
>>>>  mm/shmem.c           |  4 ++--
>>>>  mm/swapfile.c        | 19 +++++--------------
>>>>  6 files changed, 13 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index d1d35e92d7e9..f03cb446124e 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -482,7 +482,6 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>>>  extern void swap_shmem_alloc(swp_entry_t);
>>>>  extern int swap_duplicate(swp_entry_t);
>>>>  extern int swapcache_prepare(swp_entry_t);
>>>> -extern void swap_free(swp_entry_t);
>>>
>>> I wonder if it would be cleaner to:
>>>
>>> #define swap_free(entry) swap_free_nr((entry), 1)
>>>
>>> To save all the churn for the callsites that just want to pass a single entry?
>>
>> I prefer this way.  Although I prefer inline functions.

Yes, I agree inline function is the better approach.

> 
> Yes, using static inline is preferable. I've recently submitted
> a checkpatch/codestyle for this, which can be found at:
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=39c58d5ed036
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=8379bf0b0e1f5
> 
> Using static inline aligns with the established rule.
> 
>>
>> Otherwise, LGTM.  Feel free to add
>>
>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Thanks!
> 
>>
>> in the future version.
> 
> I believe Christoph's vote leans towards simply removing swap_free_nr
> and renaming it to swap_free, while adding a new parameter as follows.
> 
> void swap_free(swp_entry_t entry, int nr);
> {
> }
> 
> now I see Ryan and you prefer
> 
> static inline swap_free()
> {
>         swap_free_nr(...., 1)
> }
> 
> Chris slightly favors discouraging the use of swap_free() without the
> new parameter. Removing swap_free() can address this concern.
> 
> It seems that maintaining swap_free() and having it call swap_free_nr() with
> a default value of 1 received the most support.
> 
> To align with free_swap_and_cache() and free_swap_and_cache_nr(),
> I'll proceed with the "static inline" approach in the new version. Please
> voice any objections you may have, Christoph, Chris.

I'm happy with either route. If you end up adding a nr param to swap_free() then
it would also be good to give free_swap_and_cache_nr() the same treatment.

> 
>>
>>>>  extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>> @@ -561,10 +560,6 @@ static inline int swapcache_prepare(swp_entry_t swp)
>>>>      return 0;
>>>>  }
>>>>
>>>> -static inline void swap_free(swp_entry_t swp)
>>>> -{
>>>> -}
>>>> -
>>>>  static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>>>>  {
>>>>  }
>>>> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
>>>> index 5bc04bfe2db1..6befaa88a342 100644
>>>> --- a/kernel/power/swap.c
>>>> +++ b/kernel/power/swap.c
>>>> @@ -181,7 +181,7 @@ sector_t alloc_swapdev_block(int swap)
>>>>      offset = swp_offset(get_swap_page_of_type(swap));
>>>>      if (offset) {
>>>>              if (swsusp_extents_insert(offset))
>>>> -                    swap_free(swp_entry(swap, offset));
>>>> +                    swap_free_nr(swp_entry(swap, offset), 1);
>>>>              else
>>>>                      return swapdev_block(swap, offset);
>>>>      }
>>>> @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap)
>>>>
>>>>      while ((node = swsusp_extents.rb_node)) {
>>>>              struct swsusp_extent *ext;
>>>> -            unsigned long offset;
>>>>
>>>>              ext = rb_entry(node, struct swsusp_extent, node);
>>>>              rb_erase(node, &swsusp_extents);
>>>> -            for (offset = ext->start; offset <= ext->end; offset++)
>>>> -                    swap_free(swp_entry(swap, offset));
>>>> +            swap_free_nr(swp_entry(swap, ext->start),
>>>> +                         ext->end - ext->start + 1);
>>>>
>>>>              kfree(ext);
>>>>      }
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index eea6e4984eae..f033eb3528ba 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4225,7 +4225,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>       * We're already holding a reference on the page but haven't mapped it
>>>>       * yet.
>>>>       */
>>>> -    swap_free(entry);
>>>> +    swap_free_nr(entry, 1);
>>>>      if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>              folio_free_swap(folio);
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 087a79f1f611..39ec7742acec 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1865,7 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                              goto walk_done_err;
>>>>                      }
>>>>                      if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>>>> -                            swap_free(entry);
>>>> +                            swap_free_nr(entry, 1);
>>>>                              set_pte_at(mm, address, pvmw.pte, pteval);
>>>>                              goto walk_done_err;
>>>>                      }
>>>> @@ -1873,7 +1873,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                      /* See folio_try_share_anon_rmap(): clear PTE first. */
>>>>                      if (anon_exclusive &&
>>>>                          folio_try_share_anon_rmap_pte(folio, subpage)) {
>>>> -                            swap_free(entry);
>>>> +                            swap_free_nr(entry, 1);
>>>>                              set_pte_at(mm, address, pvmw.pte, pteval);
>>>>                              goto walk_done_err;
>>>>                      }
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index fa2a0ed97507..bfc8a2beb24f 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -1836,7 +1836,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>       * in shmem_evict_inode().
>>>>       */
>>>>      shmem_recalc_inode(inode, -1, -1);
>>>> -    swap_free(swap);
>>>> +    swap_free_nr(swap, 1);
>>>>  }
>>>>
>>>>  /*
>>>> @@ -1927,7 +1927,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>
>>>>      delete_from_swap_cache(folio);
>>>>      folio_mark_dirty(folio);
>>>> -    swap_free(swap);
>>>> +    swap_free_nr(swap, 1);
>>>>      put_swap_device(si);
>>>>
>>>>      *foliop = folio;
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index ec12f2b9d229..ddcd0f24b9a1 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -1343,19 +1343,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>>>>      swap_range_free(p, offset, 1);
>>>>  }
>>>>
>>>> -/*
>>>> - * Caller has made sure that the swap device corresponding to entry
>>>> - * is still around or has not been recycled.
>>>> - */
>>>> -void swap_free(swp_entry_t entry)
>>>> -{
>>>> -    struct swap_info_struct *p;
>>>> -
>>>> -    p = _swap_info_get(entry);
>>>> -    if (p)
>>>> -            __swap_entry_free(p, entry);
>>>> -}
>>>> -
>>>>  static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>>>              unsigned long offset, int nr_pages)
>>>>  {
>>>> @@ -1385,6 +1372,10 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
>>>>      unlock_cluster_or_swap_info(sis, ci);
>>>>  }
>>>>
>>>> +/*
>>>> + * Caller has made sure that the swap device corresponding to entry
>>>> + * is still around or has not been recycled.
>>>> + */
>>>>  void swap_free_nr(swp_entry_t entry, int nr_pages)
>>>>  {
>>>>      int nr;
>>>> @@ -1930,7 +1921,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>>>>              new_pte = pte_mkuffd_wp(new_pte);
>>>>  setpte:
>>>>      set_pte_at(vma->vm_mm, addr, pte, new_pte);
>>>> -    swap_free(entry);
>>>> +    swap_free_nr(entry, 1);
>>>>  out:
>>>>      if (pte)
>>>>              pte_unmap_unlock(pte, ptl);
>>
>> --
>> Best Regards,
>> Huang, Ying
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2024-05-08  9:10 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-03  0:50 [PATCH v3 0/6] large folios swap-in: handle refault cases first Barry Song
2024-05-03  0:50 ` [PATCH v3 1/6] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
2024-05-03  9:26   ` Ryan Roberts
2024-05-03 20:25   ` Chris Li
2024-05-08  7:35   ` Huang, Ying
2024-05-03  0:50 ` [PATCH v3 2/6] mm: remove swap_free() and always use swap_free_nr() Barry Song
2024-05-03  9:31   ` Ryan Roberts
2024-05-03 20:37     ` Chris Li
2024-05-04  4:03       ` Christoph Hellwig
2024-05-04  4:27         ` Barry Song
2024-05-04  4:28           ` Christoph Hellwig
2024-05-04  4:47             ` Barry Song
2024-05-08  7:56     ` Huang, Ying
2024-05-08  8:30       ` Barry Song
2024-05-08  9:10         ` Ryan Roberts
2024-05-03  0:50 ` [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally Barry Song
2024-05-03  9:41   ` Ryan Roberts
2024-05-03 23:40     ` Barry Song
2024-05-06  8:06       ` David Hildenbrand
2024-05-06  8:20         ` Barry Song
2024-05-06  8:31           ` David Hildenbrand
2024-05-07  8:14             ` Ryan Roberts
2024-05-07  8:24               ` Barry Song
2024-05-07  9:39                 ` Ryan Roberts
2024-05-03 20:51   ` Chris Li
2024-05-03 23:07     ` Barry Song
2024-05-08  8:08   ` Huang, Ying
2024-05-03  0:50 ` [PATCH v3 4/6] mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages Barry Song
2024-05-03 10:02   ` Ryan Roberts
2024-05-06 16:51   ` Khalid Aziz
2024-05-03  0:50 ` [PATCH v3 5/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-05-06  8:14   ` David Hildenbrand
2024-05-03  0:50 ` [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache Barry Song
2024-05-03 10:50   ` Ryan Roberts
2024-05-03 23:23     ` Barry Song
2024-05-06 12:07       ` David Hildenbrand
2024-05-06 12:38         ` Barry Song
2024-05-06 12:58           ` Barry Song
2024-05-06 13:16             ` David Hildenbrand
2024-05-06 22:58               ` Barry Song
2024-05-07  8:24                 ` David Hildenbrand
2024-05-07  8:43                   ` Barry Song
2024-05-07  8:59                     ` David Hildenbrand
2024-05-07  9:24                       ` Barry Song
2024-05-07 10:39                         ` David Hildenbrand
2024-05-07 10:48                           ` Barry Song
2024-05-07  8:17       ` Ryan Roberts
2024-05-06 12:05   ` David Hildenbrand
2024-05-06 12:27     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).