[RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
@ 2022-08-08 14:56 Aaron Lu
  2022-08-08 14:56 ` [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Aaron Lu
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-08 14:56 UTC (permalink / raw)
  To: Dave Hansen, Rick Edgecombe; +Cc: Song Liu, linux-kernel, linux-mm

This is an early RFC. While all reviews are welcome, reviewing this code
now will be a waste of time for the x86 subsystem maintainers. I would,
however, appreciate a preliminary review from the folks on the to and cc
list. I'm posting it to the list in case anyone else is interested in
seeing this early version.

Dave Hansen: I need your ack before this goes to the maintainers.

Here it goes:

On x86_64, Linux has direct mapping of almost all physical memory. For
performance reasons, this mapping is usually set as large page like 2M
or 1G per hardware's capability with read, write and non-execute
protection.

There are cases where some pages have to change their protection to RO
and eXecutable, like pages that host module code or bpf prog. When these
pages' protection are changed, the corresponding large mapping that
cover these pages will have to be splitted into 4K first and then
individual 4k page's protection changed accordingly, i.e. unaffected
pages keep their original protection as RW and NX while affected pages'
protection changed to RO and X.

There is a problem due to this split: the large mapping will remain
splitted even after the affected pages' protection are changed back to
RW and NX, like when the module is unloaded or bpf progs are freed.
After system runs a long time, there can be more and more large mapping
being splitted, causing more and more dTLB misses and overall system
performance getting hurt[1].

For this reason, people tried some techniques to reduce the harm of
large mapping beling splitted, like bpf_prog_pack[2] which packs
multiple bpf progs into a single page instead of allocating and changing
one page's protection for each bpf prog. This approach made large
mapping split happen much fewer.

This patchset addresses this problem in another way: it merges
splitted mappings back to a large mapping when protections of all entries
of the splitted small mapping page table become same again, e.g. when the
page whose protection was changed to RO+X now has its protection changed
back to RW+NX due to reasons like module unload, bpf prog free, etc. and
all other entries' protection are also RW+NX.

One final note is, with features like bpf_prog_pack etc., there can be
much fewer large mapping split IIUC; also, this patchset can not help
when the page which has its protection changed keeps in use. So my take
on this large mapping split problem is: to get the most value of keeping
large mapping intact, features like bpf_prog_pack is important. This
patchset can help to further reduce large mapping split when in use page
that has special protection set finally gets released.

[1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
[2]: https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/

Aaron Lu (4):
  x86/mm/cpa: restore global bit when page is present
  x86/mm/cpa: merge splitted direct mapping when possible
  x86/mm/cpa: add merge event counter
  x86/mm/cpa: add a test interface to split direct map

 arch/x86/mm/pat/set_memory.c  | 411 +++++++++++++++++++++++++++++++++-
 include/linux/mm_types.h      |   6 +
 include/linux/page-flags.h    |   6 +
 include/linux/vm_event_item.h |   2 +
 mm/vmstat.c                   |   2 +
 5 files changed, 420 insertions(+), 7 deletions(-)

-- 
2.37.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
@ 2022-08-08 14:56 ` Aaron Lu
  2022-08-11  5:21   ` Hyeonggon Yoo
  2022-08-08 14:56 ` [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible Aaron Lu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2022-08-08 14:56 UTC (permalink / raw)
  To: Dave Hansen, Rick Edgecombe; +Cc: Song Liu, linux-kernel, linux-mm

For configs that don't have PTI enabled or cpus that don't need
meltdown mitigation, current kernel can lose GLOBAL bit after a page
goes through a cycle of present -> not present -> present.

It happened like this(__vunmap() does this in vm_remove_mappings()):
original page protection: 0x8000000000000163 (NX/G/D/A/RW/P)
set_memory_np(page, 1):   0x8000000000000062 (NX/D/A/RW) lose G and P
set_memory_p(pagem 1):    0x8000000000000063 (NX/D/A/RW/P) restored P

In the end, this page's protection no longer has Global bit set and this
would create problem for this merge small mapping feature.

For this reason, restore Global bit for systems that do not have PTI
enabled if page is present.

(pgprot_clear_protnone_bits() deserves a better name if this patch is
acceptible but first, I would like to get some feedback if this is the
right way to solve this so I didn't bother with the name yet)

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 arch/x86/mm/pat/set_memory.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..33657a54670a 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
 	 */
 	if (!(pgprot_val(prot) & _PAGE_PRESENT))
 		pgprot_val(prot) &= ~_PAGE_GLOBAL;
+	else
+		pgprot_val(prot) |= _PAGE_GLOBAL & __default_kernel_pte_mask;
 
 	return prot;
 }
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
  2022-08-08 14:56 ` [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Aaron Lu
@ 2022-08-08 14:56 ` Aaron Lu
  2022-08-08 14:56 ` [RFC PATCH 3/4] x86/mm/cpa: add merge event counter Aaron Lu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-08 14:56 UTC (permalink / raw)
  To: Dave Hansen, Rick Edgecombe; +Cc: Song Liu, linux-kernel, linux-mm

On x86_64, Linux has direct mapping of almost all physical memory. For
performance reasons, this mapping is usually set as large page like 2M
or 1G per hardware's capability with read, write and non-execute
protection.

There are cases where some pages have to change their protection to RO
and eXecutable, like pages that host module code or bpf prog. When these
pages' protection are changed, the corresponding large mapping that
cover these pages will have to be splitted into 4K first and then individual
4k page's protection changed accordingly, i.e. unaffected pages keep their
original protection as RW and NX while affected pages' protection changed
to RO and X.

There is a problem due to this split: the large mapping will remain
splitted even after the affected pages' protection are changed back to
RW and NX, like when the module is unloaded or bpf progs are freed.
After system runs a long time, there can be more and more large mapping
being splitted, causing more and more dTLB misses and overall system
performance getting hurt.

This patch tries to restore splitted large mapping by tracking how
many entries of the splitted small mapping page table have the same
protection bits and once that number becomes PTRS_PER_PTE, this small
mapping page table can be released with its upper level page table
entry pointing directly to a large page.

Testing: see patch4 for detailed testing.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 arch/x86/mm/pat/set_memory.c | 184 +++++++++++++++++++++++++++++++++--
 include/linux/mm_types.h     |   6 ++
 include/linux/page-flags.h   |   6 ++
 3 files changed, 189 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 33657a54670a..fea2c70ff37f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -718,13 +718,89 @@ phys_addr_t slow_virt_to_phys(void *__virt_addr)
 }
 EXPORT_SYMBOL_GPL(slow_virt_to_phys);
 
+static void merge_splitted_mapping(struct page *pgt, int level);
+static void set_pte_adjust_nr_same_prot(pte_t *kpte, int level, pte_t pte)
+{
+	struct page *pgt = virt_to_page(kpte);
+	pgprot_t old_prot, new_prot;
+	int i;
+
+	/* The purpose of tracking entries with same_prot is to hopefully
+	 * mege splitted small mappings to large ones. Since only 2M and
+	 * 1G mapping are supported, there is no need tracking for page
+	 * tables of level > 2M.
+	 */
+	if (!PageSplitpgt(pgt) || level > PG_LEVEL_2M) {
+		set_pte(kpte, pte);
+		return;
+	}
+
+	/* get old protection before kpte is updated */
+	if (level == PG_LEVEL_4K) {
+		old_prot = pte_pgprot(*kpte);
+		new_prot = pte_pgprot(pte);
+	} else {
+		old_prot = pmd_pgprot(*(pmd_t *)kpte);
+		new_prot = pmd_pgprot(*(pmd_t *)&pte);
+	}
+
+	set_pte(kpte, pte);
+
+	if (pgprot_val(pgt->same_prot) != pgprot_val(old_prot) &&
+	    pgprot_val(pgt->same_prot) == pgprot_val(new_prot))
+		pgt->nr_same_prot++;
+
+	if (pgprot_val(pgt->same_prot) == pgprot_val(old_prot) &&
+	    pgprot_val(pgt->same_prot) != pgprot_val(new_prot))
+		pgt->nr_same_prot--;
+
+	if (unlikely(pgt->nr_same_prot == 0)) {
+		pte_t *entry = page_address(pgt);
+
+		/*
+		 * Now all entries' prot have changed, check again
+		 * to see if all entries have the same new prot.
+		 * Use the 1st entry's prot as the new pgt->same_prot.
+		 */
+		if (level == PG_LEVEL_4K)
+			pgt->same_prot = pte_pgprot(*entry);
+		else
+			pgt->same_prot = pmd_pgprot(*(pmd_t *)entry);
+
+		for (i = 0; i < PTRS_PER_PTE; i++, entry++) {
+			pgprot_t prot;
+
+			if (level == PG_LEVEL_4K)
+				prot = pte_pgprot(*entry);
+			else
+				prot = pmd_pgprot(*(pmd_t *)entry);
+
+			if (pgprot_val(prot) == pgprot_val(pgt->same_prot))
+				pgt->nr_same_prot++;
+		}
+	}
+
+	/*
+	 * If this splitted page table's entries all have the same
+	 * protection now, try merge it. Note that for a PMD level
+	 * page table, if all entries are pointing to PTE page table,
+	 * no merge can be done.
+	 */
+	if (unlikely(pgt->nr_same_prot == PTRS_PER_PTE &&
+		     (pgprot_val(pgt->same_prot) & _PAGE_PRESENT) &&
+		     (level == PG_LEVEL_4K ||
+		      pgprot_val(pgt->same_prot) & _PAGE_PSE)))
+		merge_splitted_mapping(pgt, level);
+
+}
+
 /*
  * Set the new pmd in all the pgds we know about:
  */
-static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+static void __set_pmd_pte(pte_t *kpte, int level, unsigned long address, pte_t pte)
 {
 	/* change init_mm */
-	set_pte_atomic(kpte, pte);
+	set_pte_adjust_nr_same_prot(kpte, level, pte);
 #ifdef CONFIG_X86_32
 	if (!SHARED_KERNEL_PMD) {
 		struct page *page;
@@ -739,12 +815,68 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 			p4d = p4d_offset(pgd, address);
 			pud = pud_offset(p4d, address);
 			pmd = pmd_offset(pud, address);
-			set_pte_atomic((pte_t *)pmd, pte);
+			set_pte_adjust_nr_same_prot((pte_t *)pmd, level, pte);
 		}
 	}
 #endif
 }
 
+static void merge_splitted_mapping(struct page *pgt, int level)
+{
+	pte_t *kpte = page_address(pgt);
+	pgprot_t pte_prot, pmd_prot;
+	unsigned long address;
+	unsigned long pfn;
+	pte_t pte;
+	pud_t pud;
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		pte_prot = pte_pgprot(*kpte);
+		pmd_prot = pgprot_4k_2_large(pte_prot);
+		pgprot_val(pmd_prot) |= _PAGE_PSE;
+		pfn = pte_pfn(*kpte);
+		pte = pfn_pte(pfn, pmd_prot);
+
+		/*
+		 * update upper level kpte.
+		 * Note that further merge can happen if all PMD table's
+		 * entries have the same protection bits after this change.
+		 */
+		address = (unsigned long)page_address(pfn_to_page(pfn));
+		__set_pmd_pte(pgt->upper_kpte, level + 1, address, pte);
+		break;
+	case PG_LEVEL_2M:
+		pfn = pmd_pfn(*(pmd_t *)kpte);
+		pmd_prot = pmd_pgprot(*(pmd_t *)kpte);
+		pud = pfn_pud(pfn, pmd_prot);
+		set_pud(pgt->upper_kpte, pud);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	/*
+	 * Current kernel did flush_tlb_all() when splitting a large page
+	 * inside pgd_lock because:
+	 * - an errata of Atom AAH41; as well as
+	 * - avoid another cpu simultaneously changing the just splitted
+	 *   large page's attr.
+	 * The first does not require a full tlb flush according to
+	 * commit 211b3d03c7400("x86: work around Fedora-11 x86-32 kernel
+	 * failures on Intel Atom CPUs") while the 2nd can be already
+	 * achieved by cpa_lock. commit c0a759abf5a68("x86/mm/cpa: Move
+	 * flush_tlb_all()") simplified the code by doing a full tlb flush
+	 * inside pgd_lock. For the same reason, I also did a full tlb
+	 * flush inside pgd_lock after doing a merge.
+	 */
+	flush_tlb_all();
+
+	__ClearPageSplitpgt(pgt);
+	__free_page(pgt);
+}
+
 static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
 {
 	/*
@@ -901,9 +1033,10 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
 
 	/* All checks passed. Update the large page mapping. */
 	new_pte = pfn_pte(old_pfn, new_prot);
-	__set_pmd_pte(kpte, address, new_pte);
+	__set_pmd_pte(kpte, level, address, new_pte);
 	cpa->flags |= CPA_FLUSHTLB;
 	cpa_inc_lp_preserved(level);
+
 	return 0;
 }
 
@@ -1023,6 +1156,11 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	for (i = 0; i < PTRS_PER_PTE; i++, pfn += pfninc, lpaddr += lpinc)
 		split_set_pte(cpa, pbase + i, pfn, ref_prot, lpaddr, lpinc);
 
+	__SetPageSplitpgt(base);
+	base->upper_kpte = kpte;
+	base->same_prot = ref_prot;
+	base->nr_same_prot = PTRS_PER_PTE;
+
 	if (virt_addr_valid(address)) {
 		unsigned long pfn = PFN_DOWN(__pa(address));
 
@@ -1037,7 +1175,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * pagetable protections, the actual ptes set above control the
 	 * primary protection behavior:
 	 */
-	__set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));
+	__set_pmd_pte(kpte, level, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));
 
 	/*
 	 * Do a global flush tlb after splitting the large page
@@ -1508,6 +1646,23 @@ static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr,
 	}
 }
 
+/*
+ * When debug_pagealloc_enabled():
+ * - direct map will not use large page mapping;
+ * - kernel highmap can still use large mapping.
+ * When !debug_pagealloc_enabled(): both direct map and kernel highmap
+ * can use large page mapping.
+ *
+ * When large page mapping is used, it can be splitted due to reasons
+ * like protection change and thus, it is also possible a merge can
+ * happen for that splitted small mapping page table page.
+ */
+static bool subject_to_merge(unsigned long addr)
+{
+	return !debug_pagealloc_enabled() ||
+		within(addr, (unsigned long)_text, _brk_end);
+}
+
 static int __change_page_attr(struct cpa_data *cpa, int primary)
 {
 	unsigned long address;
@@ -1526,10 +1681,23 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 		return __cpa_process_fault(cpa, address, primary);
 
 	if (level == PG_LEVEL_4K) {
-		pte_t new_pte;
+		pte_t new_pte, *tmp;
 		pgprot_t new_prot = pte_pgprot(old_pte);
 		unsigned long pfn = pte_pfn(old_pte);
 
+		if (subject_to_merge(address)) {
+			spin_lock(&pgd_lock);
+			/*
+			 * Check for races, another CPU might have merged
+			 * this page up already.
+			 */
+			tmp = _lookup_address_cpa(cpa, address, &level);
+			if (tmp != kpte) {
+				spin_unlock(&pgd_lock);
+				goto repeat;
+			}
+		}
+
 		pgprot_val(new_prot) &= ~pgprot_val(cpa->mask_clr);
 		pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
 
@@ -1551,10 +1719,12 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 		 * Do we really change anything ?
 		 */
 		if (pte_val(old_pte) != pte_val(new_pte)) {
-			set_pte_atomic(kpte, new_pte);
+			set_pte_adjust_nr_same_prot(kpte, level, new_pte);
 			cpa->flags |= CPA_FLUSHTLB;
 		}
 		cpa->numpages = 1;
+		if (subject_to_merge(address))
+			spin_unlock(&pgd_lock);
 		return 0;
 	}
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c29ab4c0cd5c..6124c575fdad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -160,6 +160,12 @@ struct page {
 			spinlock_t ptl;
 #endif
 		};
+		struct {        /* splitted page table pages */
+			void *upper_kpte;		/* compound_head */
+			int nr_same_prot;
+			unsigned long _split_pt_pad;	/* mapping */
+			pgprot_t same_prot;
+		};
 		struct {	/* ZONE_DEVICE pages */
 			/** @pgmap: Points to the hosting device page map. */
 			struct dev_pagemap *pgmap;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..3fe395dd7dfc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -942,6 +942,7 @@ static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_splitpgt	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -1012,6 +1013,11 @@ PAGE_TYPE_OPS(Table, table)
  */
 PAGE_TYPE_OPS(Guard, guard)
 
+/*
+ * Marks pages in use as splitted page tables
+ */
+PAGE_TYPE_OPS(Splitpgt, splitpgt)
+
 extern bool is_free_buddy_page(struct page *page);
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/4] x86/mm/cpa: add merge event counter
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
  2022-08-08 14:56 ` [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Aaron Lu
  2022-08-08 14:56 ` [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible Aaron Lu
@ 2022-08-08 14:56 ` Aaron Lu
  2022-08-08 14:56 ` [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map Aaron Lu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-08 14:56 UTC (permalink / raw)
  To: Dave Hansen, Rick Edgecombe; +Cc: Song Liu, linux-kernel, linux-mm

Like split event counter, this patch add counter for merge event.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 arch/x86/mm/pat/set_memory.c  | 19 +++++++++++++++++++
 include/linux/vm_event_item.h |  2 ++
 mm/vmstat.c                   |  2 ++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index fea2c70ff37f..1be9aab42c79 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -105,6 +105,23 @@ static void split_page_count(int level)
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
+static void merge_page_count(int level)
+{
+	if (direct_pages_count[level] < PTRS_PER_PTE) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	direct_pages_count[level] -= PTRS_PER_PTE;
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_4K)
+			count_vm_event(DIRECT_MAP_LEVEL1_MERGE);
+		else if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_MERGE);
+	}
+	direct_pages_count[level + 1]++;
+}
+
 void arch_report_meminfo(struct seq_file *m)
 {
 	seq_printf(m, "DirectMap4k:    %8lu kB\n",
@@ -875,6 +892,8 @@ static void merge_splitted_mapping(struct page *pgt, int level)
 
 	__ClearPageSplitpgt(pgt);
 	__free_page(pgt);
+
+	merge_page_count(level);
 }
 
 static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 404024486fa5..00a9a435af49 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -143,6 +143,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
+		DIRECT_MAP_LEVEL1_MERGE,
+		DIRECT_MAP_LEVEL2_MERGE,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..1a4287a4d614 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1403,6 +1403,8 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
+	"direct_map_level1_merges",
+	"direct_map_level2_merges",
 #endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
                   ` (2 preceding siblings ...)
  2022-08-08 14:56 ` [RFC PATCH 3/4] x86/mm/cpa: add merge event counter Aaron Lu
@ 2022-08-08 14:56 ` Aaron Lu
  2022-08-09 10:04 ` [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Kirill A. Shutemov
  2022-08-11  4:50 ` Hyeonggon Yoo
  5 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-08 14:56 UTC (permalink / raw)
  To: Dave Hansen, Rick Edgecombe; +Cc: Song Liu, linux-kernel, linux-mm

To test this functionality, a debugfs interface is added:
/sys/kernel/debug/x86/split_mapping

There are three test modes.
mode 0: allocate $page_nr pages and set each page's protection first
to RO and X and then back to RW and NX. This is used to test multiple
CPUs dealing with different address ranges.
mode 1: allocate several pages and create $nr_cpu kthreads to
simultaneously change those pages protection with a fixed pattern.
This is used to test multiple CPUs dealing with the same address range.
mode 2: same as mode 0 except using alloc_pages() instead of vmalloc()
because vmalloc space is too small on x86_32/pae.

On a x86_64 VM, I started mode0.sh and mode1.sh at the same time:

mode0.sh:
mode=0
page_nr=200000
nr_cpu=16

function test_one()
{
	echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping
}

while true; do
	for i in `seq $nr_cpu`; do
		test_one &
	done
	wait
done

mode1.sh:
mode=1
page_nr=1
echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping

After 5 hours, no problem occured with some millions of splits and merges.

For x86_32 and x86_pae, mode2 test is used and also no problem found.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 arch/x86/mm/pat/set_memory.c | 206 +++++++++++++++++++++++++++++++++++
 1 file changed, 206 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1be9aab42c79..4deea4de73e7 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -20,6 +20,9 @@
 #include <linux/kernel.h>
 #include <linux/cc_platform.h>
 #include <linux/set_memory.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/random.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -2556,6 +2559,209 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
 	return retval;
 }
 
+static int split_mapping_mode0_test(int page_nr)
+{
+	void **addr_buff;
+	void *addr;
+	int i, j;
+
+	addr_buff = kvmalloc(sizeof(void *) * page_nr, GFP_KERNEL);
+	if (!addr_buff) {
+		pr_err("addr_buff: no memory\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < page_nr; i++) {
+		addr = vmalloc(PAGE_SIZE);
+		if (!addr) {
+			pr_err("no memory\n");
+			break;
+		}
+
+		set_memory_ro((unsigned long)addr, 1);
+		set_memory_x((unsigned long)addr, 1);
+
+		addr_buff[i] = addr;
+	}
+
+	for (j = 0; j < i; j++) {
+		set_memory_nx((unsigned long)addr_buff[j], 1);
+		set_memory_rw((unsigned long)addr_buff[j], 1);
+		vfree(addr_buff[j]);
+	}
+
+	kvfree(addr_buff);
+
+	return 0;
+}
+
+struct split_mapping_mode1_data {
+	unsigned long addr;
+	int page_nr;
+};
+
+static int split_mapping_set_prot(void *data)
+{
+	struct split_mapping_mode1_data *d = data;
+	unsigned long addr = d->addr;
+	int page_nr = d->page_nr;
+	int m;
+
+	m = get_random_int() % 100;
+	msleep(m);
+
+	while (!kthread_should_stop()) {
+		set_memory_ro(addr, page_nr);
+		set_memory_x(addr, page_nr);
+		set_memory_rw(addr, page_nr);
+		set_memory_nx(addr, page_nr);
+		cond_resched();
+	}
+
+	return 0;
+}
+
+static int split_mapping_mode1_test(int page_nr)
+{
+	int nr_kthreads = num_online_cpus();
+	struct split_mapping_mode1_data d;
+	struct task_struct **kthreads;
+	int i, j, ret;
+	void *addr;
+
+	addr = vmalloc(PAGE_SIZE * page_nr);
+	if (!addr)
+		return -ENOMEM;
+
+	kthreads = kmalloc(nr_kthreads * sizeof(struct task_struct *), GFP_KERNEL);
+	if (!kthreads) {
+		vfree(addr);
+		return -ENOMEM;
+	}
+
+	d.addr = (unsigned long)addr;
+	d.page_nr = page_nr;
+	for (i = 0; i < nr_kthreads; i++) {
+		kthreads[i] = kthread_run(split_mapping_set_prot, &d, "split_mappingd%d", i);
+		if (IS_ERR(kthreads[i])) {
+			for (j = 0; j < i; j++)
+				kthread_stop(kthreads[j]);
+			ret = PTR_ERR(kthreads[i]);
+			goto out;
+		}
+	}
+
+	while (1) {
+		if (signal_pending(current)) {
+			for (i = 0; i < nr_kthreads; i++)
+				kthread_stop(kthreads[i]);
+			ret = 0;
+			break;
+		}
+		msleep(1000);
+	}
+
+out:
+	kfree(kthreads);
+	vfree(addr);
+	return ret;
+}
+
+static int split_mapping_mode2_test(int page_nr)
+{
+	struct page *p, *t;
+	unsigned long addr;
+	int i;
+
+	LIST_HEAD(head);
+
+	for (i = 0; i < page_nr; i++) {
+		p = alloc_pages(GFP_KERNEL | GFP_DMA32, 0);
+		if (!p) {
+			pr_err("no memory\n");
+			break;
+		}
+
+		addr = (unsigned long)page_address(p);
+		BUG_ON(!addr);
+
+		set_memory_ro(addr, 1);
+		set_memory_x(addr, 1);
+
+		list_add(&p->lru, &head);
+	}
+
+	list_for_each_entry_safe(p, t, &head, lru) {
+		addr = (unsigned long)page_address(p);
+		set_memory_nx(addr, 1);
+		set_memory_rw(addr, 1);
+
+		list_del(&p->lru);
+		__free_page(p);
+	}
+
+	return 0;
+}
+static ssize_t split_mapping_write_file(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	unsigned int mode = 0, page_nr = 0;
+	char buffer[64];
+	int ret;
+
+	if (count > 64)
+		return -EINVAL;
+
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	sscanf(buffer, "%u %u", &mode, &page_nr);
+
+	/*
+	 * There are 3 test modes.
+	 * mode 0: each thread allocates $page_nr pages and set each page's
+	 *         protection first to RO and X and then back to RW and NX.
+	 *         This is used to test multiple CPUs dealing with different
+	 *         pages.
+	 * mode 1: allocate several pages and create $nr_cpu kthreads to
+	 *         simultaneously change those pages protection to a fixed
+	 *         pattern. This is used to test multiple CPUs dealing with
+	 *         some same page's protection.
+	 * mode 2: like mode 0 but directly use alloc_pages() because vmalloc
+	 *         area on x86_32 is too small, only 128M.
+	 */
+	if (mode > 2)
+		return -EINVAL;
+
+	if (page_nr == 0)
+		return -EINVAL;
+
+	if (mode == 0)
+		ret = split_mapping_mode0_test(page_nr);
+	else if (mode == 1)
+		ret = split_mapping_mode1_test(page_nr);
+	else
+		ret = split_mapping_mode2_test(page_nr);
+
+	return ret ? ret : count;
+}
+
+static const struct file_operations split_mapping_fops = {
+	.write          = split_mapping_write_file,
+};
+
+static int __init split_mapping_init(void)
+{
+	struct dentry *d = debugfs_create_file("split_mapping", S_IWUSR, arch_debugfs_dir, NULL,
+			&split_mapping_fops);
+	if (IS_ERR(d)) {
+		pr_err("create split_mapping failed: %ld\n", PTR_ERR(d));
+		return PTR_ERR(d);
+	}
+
+	return 0;
+}
+late_initcall(split_mapping_init);
+
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
                   ` (3 preceding siblings ...)
  2022-08-08 14:56 ` [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map Aaron Lu
@ 2022-08-09 10:04 ` Kirill A. Shutemov
  2022-08-09 14:58   ` Aaron Lu
  2022-08-11  4:50 ` Hyeonggon Yoo
  5 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 10:04 UTC (permalink / raw)
  To: Aaron Lu; +Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm

On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> This is an early RFC. While all reviews are welcome, reviewing this code
> now will be a waste of time for the x86 subsystem maintainers. I would,
> however, appreciate a preliminary review from the folks on the to and cc
> list. I'm posting it to the list in case anyone else is interested in
> seeing this early version.

Last time[1] I tried to merge pages back in direct mapping it lead to
substantial performance regression for some workloads. I cannot find the
report right now, but I remember it was something graphics related.

Have you done any performance evaluation?

My take away was that the merge has to be batched. Like log where changes
to direct mapping happens and come back to then and merge when the number
of changes cross some limit.

Also I don't see you handling set_memory_4k(). Huh?

[1] https://lore.kernel.org/lkml/20200416213229.19174-1-kirill.shutemov@linux.intel.com/

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-09 10:04 ` [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Kirill A. Shutemov
@ 2022-08-09 14:58   ` Aaron Lu
  2022-08-09 17:56     ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Aaron Lu @ 2022-08-09 14:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm

Hi Kirill,

Thanks a lot for the feedback.

On 8/9/2022 6:04 PM, Kirill A. Shutemov wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
>> This is an early RFC. While all reviews are welcome, reviewing this code
>> now will be a waste of time for the x86 subsystem maintainers. I would,
>> however, appreciate a preliminary review from the folks on the to and cc
>> list. I'm posting it to the list in case anyone else is interested in
>> seeing this early version.
> 
> Last time[1] I tried to merge pages back in direct mapping it lead to
> substantial performance regression for some workloads. I cannot find the
> report right now, but I remember it was something graphics related.
>

Do you happen to remember the workload name? I can try running it.

> Have you done any performance evaluation?
> 

Not yet, I was mostly concentrating on correctness. In addition to the
graphics workload, do you have anything else in mind that may be
sensitive to such a change?

I think maybe I can run patch4's mode0 test with and without this merge
functionality and see how performance would change since mode0 is
essentially doing various set_memory_X() calls on different cpus
simultaneously which can trigger a lot of splits and merges. Sounds good?

> My take away was that the merge has to be batched. Like log where changes
> to direct mapping happens and come back to then and merge when the number
> of changes cross some limit.
> 

Appreciate your suggestion.

> Also I don't see you handling set_memory_4k(). Huh?
>

Ah Right, I missed that. Currently set_memory_4k() is not specially
handled and can be mistakenly merged. Will fix this in later versions.

> [1] https://lore.kernel.org/lkml/20200416213229.19174-1-kirill.shutemov@linux.intel.com/
> 

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-09 14:58   ` Aaron Lu
@ 2022-08-09 17:56     ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 17:56 UTC (permalink / raw)
  To: Aaron Lu; +Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm

On Tue, Aug 09, 2022 at 10:58:18PM +0800, Aaron Lu wrote:
> Hi Kirill,
> 
> Thanks a lot for the feedback.
> 
> On 8/9/2022 6:04 PM, Kirill A. Shutemov wrote:
> > On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> >> This is an early RFC. While all reviews are welcome, reviewing this code
> >> now will be a waste of time for the x86 subsystem maintainers. I would,
> >> however, appreciate a preliminary review from the folks on the to and cc
> >> list. I'm posting it to the list in case anyone else is interested in
> >> seeing this early version.
> > 
> > Last time[1] I tried to merge pages back in direct mapping it lead to
> > substantial performance regression for some workloads. I cannot find the
> > report right now, but I remember it was something graphics related.
> >
> 
> Do you happen to remember the workload name? I can try running it.


No, sorry. As I said, I tried to find the report, but failed.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
                   ` (4 preceding siblings ...)
  2022-08-09 10:04 ` [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Kirill A. Shutemov
@ 2022-08-11  4:50 ` Hyeonggon Yoo
  2022-08-11  7:50   ` Lu, Aaron
  2022-08-13 16:05   ` Mike Rapoport
  5 siblings, 2 replies; 16+ messages in thread
From: Hyeonggon Yoo @ 2022-08-11  4:50 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm,
	Mike Rapoport, Hyeonggon Yoo

On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> This is an early RFC. While all reviews are welcome, reviewing this code
> now will be a waste of time for the x86 subsystem maintainers. I would,
> however, appreciate a preliminary review from the folks on the to and cc
> list. I'm posting it to the list in case anyone else is interested in
> seeing this early version.
> 

Hello Aaron!

+Cc Mike Rapoport, who has been same problem. [1]

There is also LPC discussion (with different approach on this problem)
[2], [4]

and performance measurement when all pages are 4K/2M. [3]

[1] https://lore.kernel.org/linux-mm/20220127085608.306306-1-rppt@kernel.org/
[2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
[3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
[4] https://lwn.net/Articles/894557/

> Dave Hansen: I need your ack before this goes to the maintainers.
> 
> Here it goes:
> 
> On x86_64, Linux has direct mapping of almost all physical memory. For
> performance reasons, this mapping is usually set as large page like 2M
> or 1G per hardware's capability with read, write and non-execute
> protection.
> 
> There are cases where some pages have to change their protection to RO
> and eXecutable, like pages that host module code or bpf prog. When these
> pages' protection are changed, the corresponding large mapping that
> cover these pages will have to be splitted into 4K first and then
> individual 4k page's protection changed accordingly, i.e. unaffected
> pages keep their original protection as RW and NX while affected pages'
> protection changed to RO and X.
> 
> There is a problem due to this split: the large mapping will remain
> splitted even after the affected pages' protection are changed back to
> RW and NX, like when the module is unloaded or bpf progs are freed.
> After system runs a long time, there can be more and more large mapping
> being splitted, causing more and more dTLB misses and overall system
> performance getting hurt[1].
> 
> For this reason, people tried some techniques to reduce the harm of
> large mapping beling splitted, like bpf_prog_pack[2] which packs
> multiple bpf progs into a single page instead of allocating and changing
> one page's protection for each bpf prog. This approach made large
> mapping split happen much fewer.
> 
> This patchset addresses this problem in another way: it merges
> splitted mappings back to a large mapping when protections of all entries
> of the splitted small mapping page table become same again, e.g. when the
> page whose protection was changed to RO+X now has its protection changed
> back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> all other entries' protection are also RW+NX.
>

I tried very similar approach few months ago (for toy implementation) [5],
and the biggest obstacle to this approach was: you need to be extremely sure
that the page->nr_same_prot is ALWAYS correct.

For example, in arch/x86/include/asm/kfence.h [6], it clears and set
_PAGE_PRESENT without going through CPA, which can simply break the count.

[5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
[6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56

I think we may need to hook set_pte/set_pmd/etc and use proper
synchronization primitives when changing init_mm's page table to go
further on this approach.

> One final note is, with features like bpf_prog_pack etc., there can be
> much fewer large mapping split IIUC; also, this patchset can not help
> when the page which has its protection changed keeps in use. So my take
> on this large mapping split problem is: to get the most value of keeping
> large mapping intact, features like bpf_prog_pack is important. This
> patchset can help to further reduce large mapping split when in use page
> that has special protection set finally gets released.
> 
> [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> [2]: https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> 
> Aaron Lu (4):
>   x86/mm/cpa: restore global bit when page is present
>   x86/mm/cpa: merge splitted direct mapping when possible
>   x86/mm/cpa: add merge event counter
>   x86/mm/cpa: add a test interface to split direct map
> 
>  arch/x86/mm/pat/set_memory.c  | 411 +++++++++++++++++++++++++++++++++-
>  include/linux/mm_types.h      |   6 +
>  include/linux/page-flags.h    |   6 +
>  include/linux/vm_event_item.h |   2 +
>  mm/vmstat.c                   |   2 +
>  5 files changed, 420 insertions(+), 7 deletions(-)
> 
> -- 
> 2.37.1
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present
  2022-08-08 14:56 ` [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Aaron Lu
@ 2022-08-11  5:21   ` Hyeonggon Yoo
  2022-08-11  8:16     ` Lu, Aaron
  0 siblings, 1 reply; 16+ messages in thread
From: Hyeonggon Yoo @ 2022-08-11  5:21 UTC (permalink / raw)
  To: Aaron Lu; +Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm

On Mon, Aug 08, 2022 at 10:56:46PM +0800, Aaron Lu wrote:
> For configs that don't have PTI enabled or cpus that don't need
> meltdown mitigation, current kernel can lose GLOBAL bit after a page
> goes through a cycle of present -> not present -> present.
> 
> It happened like this(__vunmap() does this in vm_remove_mappings()):
> original page protection: 0x8000000000000163 (NX/G/D/A/RW/P)
> set_memory_np(page, 1):   0x8000000000000062 (NX/D/A/RW) lose G and P
> set_memory_p(pagem 1):    0x8000000000000063 (NX/D/A/RW/P) restored P
> 
> In the end, this page's protection no longer has Global bit set and this
> would create problem for this merge small mapping feature.
> 
> For this reason, restore Global bit for systems that do not have PTI
> enabled if page is present.
> 
> (pgprot_clear_protnone_bits() deserves a better name if this patch is
> acceptible but first, I would like to get some feedback if this is the
> right way to solve this so I didn't bother with the name yet)
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  arch/x86/mm/pat/set_memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 1abd5438f126..33657a54670a 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
>  	 */
>  	if (!(pgprot_val(prot) & _PAGE_PRESENT))
>  		pgprot_val(prot) &= ~_PAGE_GLOBAL;
> +	else
> +		pgprot_val(prot) |= _PAGE_GLOBAL & __default_kernel_pte_mask;
>  
>  	return prot;
>  }

IIUC It makes it unable to set _PAGE_GLOBL when PTI is on.

Maybe it would be less intrusive to make
set_direct_map_default_noflush() replace protection bits
with PAGE_KENREL as it's only called for direct map, and the function
is to reset permission to default:

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..0dd4433c1382 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2250,7 +2250,16 @@ int set_direct_map_invalid_noflush(struct page *page)

 int set_direct_map_default_noflush(struct page *page)
 {
-       return __set_pages_p(page, 1);
+       unsigned long tempaddr = (unsigned long) page_address(page);
+       struct cpa_data cpa = {
+                       .vaddr = &tempaddr,
+                       .pgd = NULL,
+                       .numpages = 1,
+                       .mask_set = PAGE_KERNEL,
+                       .mask_clr = __pgprot(~0),
+                       .flags = 0};
+
+       return __change_page_attr_set_clr(&cpa, 0);
 }

set_direct_map_{invalid,default}_noflush() is the exact reason
why direct map become split after vmalloc/vfree with special
permissions.

> -- 
> 2.37.1
> 
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-11  4:50 ` Hyeonggon Yoo
@ 2022-08-11  7:50   ` Lu, Aaron
  2022-08-13 16:05   ` Mike Rapoport
  1 sibling, 0 replies; 16+ messages in thread
From: Lu, Aaron @ 2022-08-11  7:50 UTC (permalink / raw)
  To: 42.hyeyoo@gmail.com
  Cc: linux-mm@kvack.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	Edgecombe, Rick P, song@kernel.org, rppt@kernel.org

On Thu, 2022-08-11 at 04:50 +0000, Hyeonggon Yoo wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> > This is an early RFC. While all reviews are welcome, reviewing this code
> > now will be a waste of time for the x86 subsystem maintainers. I would,
> > however, appreciate a preliminary review from the folks on the to and cc
> > list. I'm posting it to the list in case anyone else is interested in
> > seeing this early version.
> > 
> 
> Hello Aaron!
> 

Hi Hyeonggon,

> +Cc Mike Rapoport, who has been same problem. [1]
> 
> There is also LPC discussion (with different approach on this problem)
> [2], [4]
> 
> and performance measurement when all pages are 4K/2M. [3]
> 
> [1] https://lore.kernel.org/linux-mm/20220127085608.306306-1-rppt@kernel.org/
> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [4] https://lwn.net/Articles/894557/
> 

Thanks a lot for these info.

> > Dave Hansen: I need your ack before this goes to the maintainers.
> > 
> > Here it goes:
> > 
> > On x86_64, Linux has direct mapping of almost all physical memory. For
> > performance reasons, this mapping is usually set as large page like 2M
> > or 1G per hardware's capability with read, write and non-execute
> > protection.
> > 
> > There are cases where some pages have to change their protection to RO
> > and eXecutable, like pages that host module code or bpf prog. When these
> > pages' protection are changed, the corresponding large mapping that
> > cover these pages will have to be splitted into 4K first and then
> > individual 4k page's protection changed accordingly, i.e. unaffected
> > pages keep their original protection as RW and NX while affected pages'
> > protection changed to RO and X.
> > 
> > There is a problem due to this split: the large mapping will remain
> > splitted even after the affected pages' protection are changed back to
> > RW and NX, like when the module is unloaded or bpf progs are freed.
> > After system runs a long time, there can be more and more large mapping
> > being splitted, causing more and more dTLB misses and overall system
> > performance getting hurt[1].
> > 
> > For this reason, people tried some techniques to reduce the harm of
> > large mapping beling splitted, like bpf_prog_pack[2] which packs
> > multiple bpf progs into a single page instead of allocating and changing
> > one page's protection for each bpf prog. This approach made large
> > mapping split happen much fewer.
> > 
> > This patchset addresses this problem in another way: it merges
> > splitted mappings back to a large mapping when protections of all entries
> > of the splitted small mapping page table become same again, e.g. when the
> > page whose protection was changed to RO+X now has its protection changed
> > back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> > all other entries' protection are also RW+NX.
> > 
> 
> I tried very similar approach few months ago (for toy implementation) [5],

Cool, glad we have tried similar approach :-)

> and the biggest obstacle to this approach was: you need to be extremely sure
> that the page->nr_same_prot is ALWAYS correct.
> 

Yes indeed.

> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
> _PAGE_PRESENT without going through CPA, which can simply break the count.
> 
> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
> 

For this specific case, it probably doesn't matter because kfence
intentionally uses set_memory_4k() for these pages and no merge shall
ever be done for them according to commit 1dc0da6e9ec0("x86, kfence:
enable KFENCE for x86").
(Kirill pointed out my current version has problem dealing with
set_memory_4k() but that is fixable).

> I think we may need to hook set_pte/set_pmd/etc and use proper
> synchronization primitives when changing init_mm's page table to go
> further on this approach.

Thanks for the suggestion. I'll check how many callsites there are that
manipulate init_mm's page table outside of cpa() and then decides if it
is possible to do the hook and sync for set_pte/etc.

> 
> > One final note is, with features like bpf_prog_pack etc., there can be
> > much fewer large mapping split IIUC; also, this patchset can not help
> > when the page which has its protection changed keeps in use. So my take
> > on this large mapping split problem is: to get the most value of keeping
> > large mapping intact, features like bpf_prog_pack is important. This
> > patchset can help to further reduce large mapping split when in use page
> > that has special protection set finally gets released.
> > 
> > [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> > [2]: https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> > 
> > Aaron Lu (4):
> >   x86/mm/cpa: restore global bit when page is present
> >   x86/mm/cpa: merge splitted direct mapping when possible
> >   x86/mm/cpa: add merge event counter
> >   x86/mm/cpa: add a test interface to split direct map
> > 
> >  arch/x86/mm/pat/set_memory.c  | 411 +++++++++++++++++++++++++++++++++-
> >  include/linux/mm_types.h      |   6 +
> >  include/linux/page-flags.h    |   6 +
> >  include/linux/vm_event_item.h |   2 +
> >  mm/vmstat.c                   |   2 +
> >  5 files changed, 420 insertions(+), 7 deletions(-)
> > 
> > -- 
> > 2.37.1
> > 
> > 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present
  2022-08-11  5:21   ` Hyeonggon Yoo
@ 2022-08-11  8:16     ` Lu, Aaron
  2022-08-11 11:30       ` Hyeonggon Yoo
  0 siblings, 1 reply; 16+ messages in thread
From: Lu, Aaron @ 2022-08-11  8:16 UTC (permalink / raw)
  To: 42.hyeyoo@gmail.com
  Cc: linux-mm@kvack.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	Edgecombe, Rick P, song@kernel.org

On Thu, 2022-08-11 at 05:21 +0000, Hyeonggon Yoo wrote:
> On Mon, Aug 08, 2022 at 10:56:46PM +0800, Aaron Lu wrote:
> > For configs that don't have PTI enabled or cpus that don't need
> > meltdown mitigation, current kernel can lose GLOBAL bit after a page
> > goes through a cycle of present -> not present -> present.
> > 
> > It happened like this(__vunmap() does this in vm_remove_mappings()):
> > original page protection: 0x8000000000000163 (NX/G/D/A/RW/P)
> > set_memory_np(page, 1):   0x8000000000000062 (NX/D/A/RW) lose G and P
> > set_memory_p(pagem 1):    0x8000000000000063 (NX/D/A/RW/P) restored P
> > 
> > In the end, this page's protection no longer has Global bit set and this
> > would create problem for this merge small mapping feature.
> > 
> > For this reason, restore Global bit for systems that do not have PTI
> > enabled if page is present.
> > 
> > (pgprot_clear_protnone_bits() deserves a better name if this patch is
> > acceptible but first, I would like to get some feedback if this is the
> > right way to solve this so I didn't bother with the name yet)
> > 
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  arch/x86/mm/pat/set_memory.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 1abd5438f126..33657a54670a 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
> >  	 */
> >  	if (!(pgprot_val(prot) & _PAGE_PRESENT))
> >  		pgprot_val(prot) &= ~_PAGE_GLOBAL;
> > +	else
> > +		pgprot_val(prot) |= _PAGE_GLOBAL & __default_kernel_pte_mask;
> >  
> >  	return prot;
> >  }
> 
> IIUC It makes it unable to set _PAGE_GLOBL when PTI is on.
> 

Yes. Is this a problem?
I think that is the intended behaviour when PTI is on: not to enable
Gloabl bit on kernel mappings.

> Maybe it would be less intrusive to make
> set_direct_map_default_noflush() replace protection bits
> with PAGE_KENREL as it's only called for direct map, and the function
> is to reset permission to default:
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 1abd5438f126..0dd4433c1382 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2250,7 +2250,16 @@ int set_direct_map_invalid_noflush(struct page *page)
> 
>  int set_direct_map_default_noflush(struct page *page)
>  {
> -       return __set_pages_p(page, 1);
> +       unsigned long tempaddr = (unsigned long) page_address(page);
> +       struct cpa_data cpa = {
> +                       .vaddr = &tempaddr,
> +                       .pgd = NULL,
> +                       .numpages = 1,
> +                       .mask_set = PAGE_KERNEL,
> +                       .mask_clr = __pgprot(~0),
> +                       .flags = 0};
> +
> +       return __change_page_attr_set_clr(&cpa, 0);
>  }

Looks reasonable to me and it is indeed less intrusive. I'm only
concerned there might be other paths that also go through present ->
not present -> present and this change can not cover them.

> 
> set_direct_map_{invalid,default}_noflush() is the exact reason
> why direct map become split after vmalloc/vfree with special
> permissions.

Yes I agree, because it can lose G bit after the whole cycle when PTI
is not on. When PTI is on, there is no such problem because G bit is
not there initially.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present
  2022-08-11  8:16     ` Lu, Aaron
@ 2022-08-11 11:30       ` Hyeonggon Yoo
  2022-08-11 12:28         ` Aaron Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Hyeonggon Yoo @ 2022-08-11 11:30 UTC (permalink / raw)
  To: Lu, Aaron
  Cc: linux-mm@kvack.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	Edgecombe, Rick P, song@kernel.org

On Thu, Aug 11, 2022 at 08:16:08AM +0000, Lu, Aaron wrote:
> On Thu, 2022-08-11 at 05:21 +0000, Hyeonggon Yoo wrote:
> > On Mon, Aug 08, 2022 at 10:56:46PM +0800, Aaron Lu wrote:
> > > For configs that don't have PTI enabled or cpus that don't need
> > > meltdown mitigation, current kernel can lose GLOBAL bit after a page
> > > goes through a cycle of present -> not present -> present.
> > > 
> > > It happened like this(__vunmap() does this in vm_remove_mappings()):
> > > original page protection: 0x8000000000000163 (NX/G/D/A/RW/P)
> > > set_memory_np(page, 1):   0x8000000000000062 (NX/D/A/RW) lose G and P
> > > set_memory_p(pagem 1):    0x8000000000000063 (NX/D/A/RW/P) restored P
> > > 
> > > In the end, this page's protection no longer has Global bit set and this
> > > would create problem for this merge small mapping feature.
> > > 
> > > For this reason, restore Global bit for systems that do not have PTI
> > > enabled if page is present.
> > > 
> > > (pgprot_clear_protnone_bits() deserves a better name if this patch is
> > > acceptible but first, I would like to get some feedback if this is the
> > > right way to solve this so I didn't bother with the name yet)
> > > 
> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > > ---
> > >  arch/x86/mm/pat/set_memory.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > > index 1abd5438f126..33657a54670a 100644
> > > --- a/arch/x86/mm/pat/set_memory.c
> > > +++ b/arch/x86/mm/pat/set_memory.c
> > > @@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
> > >  	 */
> > >  	if (!(pgprot_val(prot) & _PAGE_PRESENT))
> > >  		pgprot_val(prot) &= ~_PAGE_GLOBAL;
> > > +	else
> > > +		pgprot_val(prot) |= _PAGE_GLOBAL & __default_kernel_pte_mask;
> > >  
> > >  	return prot;
> > >  }
> > 
> > IIUC It makes it unable to set _PAGE_GLOBL when PTI is on.
> > 
> 
> Yes. Is this a problem?
> I think that is the intended behaviour when PTI is on: not to enable
> Gloabl bit on kernel mappings.

Please note that I'm not expert on PTI.

but AFAIK with PTI, at least everything (kernel part) mapped to user page table is
mapped as global when PGE is supported.

Not sure "Global bit is never used for kernel part when PTI is enabled"
is true.

Also, commit d1440b23c922d ("x86/mm: Factor out pageattr _PAGE_GLOBAL setting") that introduced
pgprot_clear_protnone_bits() says:
	
	This unconditional setting of _PAGE_GLOBAL is a problem when we have
	PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
	not.

	This updated version of the code says:
	1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
	2. Never set _PAGE_GLOBAL implicitly
	3. Allow _PAGE_GLOBAL to be in cpa.set_mask
	4. Allow _PAGE_GLOBAL to be inherited from previous PTE

> > Maybe it would be less intrusive to make
> > set_direct_map_default_noflush() replace protection bits
> > with PAGE_KENREL as it's only called for direct map, and the function
> > is to reset permission to default:
> > 
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 1abd5438f126..0dd4433c1382 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2250,7 +2250,16 @@ int set_direct_map_invalid_noflush(struct page *page)
> > 
> >  int set_direct_map_default_noflush(struct page *page)
> >  {
> > -       return __set_pages_p(page, 1);
> > +       unsigned long tempaddr = (unsigned long) page_address(page);
> > +       struct cpa_data cpa = {
> > +                       .vaddr = &tempaddr,
> > +                       .pgd = NULL,
> > +                       .numpages = 1,
> > +                       .mask_set = PAGE_KERNEL,
> > +                       .mask_clr = __pgprot(~0),

Nah, this sets _PAGE_ENC unconditionally, which should be evaluated.
Maybe less intrusive way would be:
		       .mask_set = __pgprot(_PAGE_PRESENT |
					   (_PAGE_GLOBAL & __kernel_default_pte_mask)),
                       .mask_clr = __pgprot(0),

> > +                       .flags = 0};
> > +
> > +       return __change_page_attr_set_clr(&cpa, 0);
> >  }
> 
> Looks reasonable to me and it is indeed less intrusive. I'm only
> concerned there might be other paths that also go through present ->
> not present -> present and this change can not cover them.
>

AFAIK other paths going through present->not present->present (using CPA)
is only when DEBUG_PAGEALLOC is used.

Do we care direct map fragmentation when using DEBUG_PAGEALLOC?

> > 
> > set_direct_map_{invalid,default}_noflush() is the exact reason
> > why direct map become split after vmalloc/vfree with special
> > permissions.
> 
> Yes I agree, because it can lose G bit after the whole cycle when PTI
> is not on. When PTI is on, there is no such problem because G bit is
> not there initially.
> 
> Thanks,
> Aaron

-- 
Thanks,
Hyeonggon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present
  2022-08-11 11:30       ` Hyeonggon Yoo
@ 2022-08-11 12:28         ` Aaron Lu
  0 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-11 12:28 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm@kvack.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	Edgecombe, Rick P, song@kernel.org

On 8/11/2022 7:30 PM, Hyeonggon Yoo wrote:
> On Thu, Aug 11, 2022 at 08:16:08AM +0000, Lu, Aaron wrote:
>> On Thu, 2022-08-11 at 05:21 +0000, Hyeonggon Yoo wrote:
>>> On Mon, Aug 08, 2022 at 10:56:46PM +0800, Aaron Lu wrote:
>>>> For configs that don't have PTI enabled or cpus that don't need
>>>> meltdown mitigation, current kernel can lose GLOBAL bit after a page
>>>> goes through a cycle of present -> not present -> present.
>>>>
>>>> It happened like this(__vunmap() does this in vm_remove_mappings()):
>>>> original page protection: 0x8000000000000163 (NX/G/D/A/RW/P)
>>>> set_memory_np(page, 1):   0x8000000000000062 (NX/D/A/RW) lose G and P
>>>> set_memory_p(pagem 1):    0x8000000000000063 (NX/D/A/RW/P) restored P
>>>>
>>>> In the end, this page's protection no longer has Global bit set and this
>>>> would create problem for this merge small mapping feature.
>>>>
>>>> For this reason, restore Global bit for systems that do not have PTI
>>>> enabled if page is present.
>>>>
>>>> (pgprot_clear_protnone_bits() deserves a better name if this patch is
>>>> acceptible but first, I would like to get some feedback if this is the
>>>> right way to solve this so I didn't bother with the name yet)
>>>>
>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>> ---
>>>>  arch/x86/mm/pat/set_memory.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>>>> index 1abd5438f126..33657a54670a 100644
>>>> --- a/arch/x86/mm/pat/set_memory.c
>>>> +++ b/arch/x86/mm/pat/set_memory.c
>>>> @@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
>>>>  	 */
>>>>  	if (!(pgprot_val(prot) & _PAGE_PRESENT))
>>>>  		pgprot_val(prot) &= ~_PAGE_GLOBAL;
>>>> +	else
>>>> +		pgprot_val(prot) |= _PAGE_GLOBAL & __default_kernel_pte_mask;
>>>>  
>>>>  	return prot;
>>>>  }
>>>
>>> IIUC It makes it unable to set _PAGE_GLOBL when PTI is on.
>>>
>>
>> Yes. Is this a problem?
>> I think that is the intended behaviour when PTI is on: not to enable
>> Gloabl bit on kernel mappings.
> 
> Please note that I'm not expert on PTI.
> 
> but AFAIK with PTI, at least everything (kernel part) mapped to user page table is
> mapped as global when PGE is supported.
> 
> Not sure "Global bit is never used for kernel part when PTI is enabled"
> is true.
>
> Also, commit d1440b23c922d ("x86/mm: Factor out pageattr _PAGE_GLOBAL setting") that introduced
> pgprot_clear_protnone_bits() says:
> 	
> 	This unconditional setting of _PAGE_GLOBAL is a problem when we have
> 	PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
> 	not.
> 
> 	This updated version of the code says:
> 	1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
> 	2. Never set _PAGE_GLOBAL implicitly
> 	3. Allow _PAGE_GLOBAL to be in cpa.set_mask
> 	4. Allow _PAGE_GLOBAL to be inherited from previous PTE
>

Thanks for these info, I'll need to take a closer look at PTI.

>>> Maybe it would be less intrusive to make
>>> set_direct_map_default_noflush() replace protection bits
>>> with PAGE_KENREL as it's only called for direct map, and the function
>>> is to reset permission to default:
>>>
>>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>>> index 1abd5438f126..0dd4433c1382 100644
>>> --- a/arch/x86/mm/pat/set_memory.c
>>> +++ b/arch/x86/mm/pat/set_memory.c
>>> @@ -2250,7 +2250,16 @@ int set_direct_map_invalid_noflush(struct page *page)
>>>
>>>  int set_direct_map_default_noflush(struct page *page)
>>>  {
>>> -       return __set_pages_p(page, 1);
>>> +       unsigned long tempaddr = (unsigned long) page_address(page);
>>> +       struct cpa_data cpa = {
>>> +                       .vaddr = &tempaddr,
>>> +                       .pgd = NULL,
>>> +                       .numpages = 1,
>>> +                       .mask_set = PAGE_KERNEL,
>>> +                       .mask_clr = __pgprot(~0),
> 
> Nah, this sets _PAGE_ENC unconditionally, which should be evaluated.
> Maybe less intrusive way would be:
> 		       .mask_set = __pgprot(_PAGE_PRESENT |
> 					   (_PAGE_GLOBAL & __kernel_default_pte_mask)),
>                        .mask_clr = __pgprot(0),
> 
>>> +                       .flags = 0};
>>> +
>>> +       return __change_page_attr_set_clr(&cpa, 0);
>>>  }
>>
>> Looks reasonable to me and it is indeed less intrusive. I'm only
>> concerned there might be other paths that also go through present ->
>> not present -> present and this change can not cover them.
>>
> 
> AFAIK other paths going through present->not present->present (using CPA)
> is only when DEBUG_PAGEALLOC is used.
> 
> Do we care direct map fragmentation when using DEBUG_PAGEALLOC?
> 

No, direct mapping does not use large page mapping when DEBUG_PAGEALLOC.

>>>
>>> set_direct_map_{invalid,default}_noflush() is the exact reason
>>> why direct map become split after vmalloc/vfree with special
>>> permissions.
>>
>> Yes I agree, because it can lose G bit after the whole cycle when PTI
>> is not on. When PTI is on, there is no such problem because G bit is
>> not there initially.
>>
>> Thanks,
>> Aaron
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-11  4:50 ` Hyeonggon Yoo
  2022-08-11  7:50   ` Lu, Aaron
@ 2022-08-13 16:05   ` Mike Rapoport
  2022-08-16  6:33     ` Aaron Lu
  1 sibling, 1 reply; 16+ messages in thread
From: Mike Rapoport @ 2022-08-13 16:05 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Aaron Lu, Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel,
	linux-mm

Hi Aaron,

On Thu, Aug 11, 2022 at 04:50:44AM +0000, Hyeonggon Yoo wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> > This is an early RFC. While all reviews are welcome, reviewing this code
> > now will be a waste of time for the x86 subsystem maintainers. I would,
> > however, appreciate a preliminary review from the folks on the to and cc
> > list. I'm posting it to the list in case anyone else is interested in
> > seeing this early version.
> > 
> 
> Hello Aaron!
> 
> +Cc Mike Rapoport, who has been same problem. [1]

Thanks Hyeonggon!
 
> There is also LPC discussion (with different approach on this problem)
> [2], [4]
> 
> and performance measurement when all pages are 4K/2M. [3]
> 
> [1] https://lore.kernel.org/linux-mm/20220127085608.306306-1-rppt@kernel.org/
> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [4] https://lwn.net/Articles/894557/
> 
> > Dave Hansen: I need your ack before this goes to the maintainers.
> > 
> > Here it goes:
> > 
> > On x86_64, Linux has direct mapping of almost all physical memory. For
> > performance reasons, this mapping is usually set as large page like 2M
> > or 1G per hardware's capability with read, write and non-execute
> > protection.
> > 
> > There are cases where some pages have to change their protection to RO
> > and eXecutable, like pages that host module code or bpf prog. When these
> > pages' protection are changed, the corresponding large mapping that
> > cover these pages will have to be splitted into 4K first and then
> > individual 4k page's protection changed accordingly, i.e. unaffected
> > pages keep their original protection as RW and NX while affected pages'
> > protection changed to RO and X.
> > 
> > There is a problem due to this split: the large mapping will remain
> > splitted even after the affected pages' protection are changed back to
> > RW and NX, like when the module is unloaded or bpf progs are freed.
> > After system runs a long time, there can be more and more large mapping
> > being splitted, causing more and more dTLB misses and overall system
> > performance getting hurt[1].
> > 
> > For this reason, people tried some techniques to reduce the harm of
> > large mapping beling splitted, like bpf_prog_pack[2] which packs
> > multiple bpf progs into a single page instead of allocating and changing
> > one page's protection for each bpf prog. This approach made large
> > mapping split happen much fewer.
> > 
> > This patchset addresses this problem in another way: it merges
> > splitted mappings back to a large mapping when protections of all entries
> > of the splitted small mapping page table become same again, e.g. when the
> > page whose protection was changed to RO+X now has its protection changed
> > back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> > all other entries' protection are also RW+NX.
> >
> 
> I tried very similar approach few months ago (for toy implementation) [5],
> and the biggest obstacle to this approach was: you need to be extremely sure
> that the page->nr_same_prot is ALWAYS correct.
> 
> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
> _PAGE_PRESENT without going through CPA, which can simply break the count.
> 
> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
> 
> I think we may need to hook set_pte/set_pmd/etc and use proper
> synchronization primitives when changing init_mm's page table to go
> further on this approach.
> 
> > One final note is, with features like bpf_prog_pack etc., there can be
> > much fewer large mapping split IIUC; also, this patchset can not help
> > when the page which has its protection changed keeps in use. So my take
> > on this large mapping split problem is: to get the most value of keeping
> > large mapping intact, features like bpf_prog_pack is important. This
> > patchset can help to further reduce large mapping split when in use page
> > that has special protection set finally gets released.

I'm not sure automatic collapse of large pages in the direct map will
actually trigger frequently. 

Consider for example pages allocated for modules, that have adjusted
protection bits. This pages could be scattered all over and even when they
are freed, chances there will be a contiguous 2M chunk are quite low...

I believe that to reduce the fragmentation of the direct map the 4K pages
with changed protection should be allocated from a cache of large pages, as
I did on older version of secretmem or as Rick implemented in his vmalloc
and PKS series.

Then CPA may provide a method for explicitly collapsing a large page, so
that such cache can call this method when an entire large page becomes
free.

> > [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> > [2]: https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> > 
> > Aaron Lu (4):
> >   x86/mm/cpa: restore global bit when page is present
> >   x86/mm/cpa: merge splitted direct mapping when possible
> >   x86/mm/cpa: add merge event counter
> >   x86/mm/cpa: add a test interface to split direct map
> > 
> >  arch/x86/mm/pat/set_memory.c  | 411 +++++++++++++++++++++++++++++++++-
> >  include/linux/mm_types.h      |   6 +
> >  include/linux/page-flags.h    |   6 +
> >  include/linux/vm_event_item.h |   2 +
> >  mm/vmstat.c                   |   2 +
> >  5 files changed, 420 insertions(+), 7 deletions(-)
> > 
> > -- 
> > 2.37.1
> > 
> > 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible
  2022-08-13 16:05   ` Mike Rapoport
@ 2022-08-16  6:33     ` Aaron Lu
  0 siblings, 0 replies; 16+ messages in thread
From: Aaron Lu @ 2022-08-16  6:33 UTC (permalink / raw)
  To: Mike Rapoport, Hyeonggon Yoo
  Cc: Dave Hansen, Rick Edgecombe, Song Liu, linux-kernel, linux-mm

Hi Mike,

Thanks for the feedback. See below for my comments.

On 8/14/2022 12:05 AM, Mike Rapoport wrote:
> Hi Aaron,
> 
> On Thu, Aug 11, 2022 at 04:50:44AM +0000, Hyeonggon Yoo wrote:
>> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
>>> This is an early RFC. While all reviews are welcome, reviewing this code
>>> now will be a waste of time for the x86 subsystem maintainers. I would,
>>> however, appreciate a preliminary review from the folks on the to and cc
>>> list. I'm posting it to the list in case anyone else is interested in
>>> seeing this early version.
>>>
>>
>> Hello Aaron!
>>
>> +Cc Mike Rapoport, who has been same problem. [1]
> 
> Thanks Hyeonggon!
>  
>> There is also LPC discussion (with different approach on this problem)
>> [2], [4]
>>
>> and performance measurement when all pages are 4K/2M. [3]
>>
>> [1] https://lore.kernel.org/linux-mm/20220127085608.306306-1-rppt@kernel.org/
>> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
>> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
>> [4] https://lwn.net/Articles/894557/
>>
>>> Dave Hansen: I need your ack before this goes to the maintainers.
>>>
>>> Here it goes:
>>>
>>> On x86_64, Linux has direct mapping of almost all physical memory. For
>>> performance reasons, this mapping is usually set as large page like 2M
>>> or 1G per hardware's capability with read, write and non-execute
>>> protection.
>>>
>>> There are cases where some pages have to change their protection to RO
>>> and eXecutable, like pages that host module code or bpf prog. When these
>>> pages' protection are changed, the corresponding large mapping that
>>> cover these pages will have to be splitted into 4K first and then
>>> individual 4k page's protection changed accordingly, i.e. unaffected
>>> pages keep their original protection as RW and NX while affected pages'
>>> protection changed to RO and X.
>>>
>>> There is a problem due to this split: the large mapping will remain
>>> splitted even after the affected pages' protection are changed back to
>>> RW and NX, like when the module is unloaded or bpf progs are freed.
>>> After system runs a long time, there can be more and more large mapping
>>> being splitted, causing more and more dTLB misses and overall system
>>> performance getting hurt[1].
>>>
>>> For this reason, people tried some techniques to reduce the harm of
>>> large mapping beling splitted, like bpf_prog_pack[2] which packs
>>> multiple bpf progs into a single page instead of allocating and changing
>>> one page's protection for each bpf prog. This approach made large
>>> mapping split happen much fewer.
>>>
>>> This patchset addresses this problem in another way: it merges
>>> splitted mappings back to a large mapping when protections of all entries
>>> of the splitted small mapping page table become same again, e.g. when the
>>> page whose protection was changed to RO+X now has its protection changed
>>> back to RW+NX due to reasons like module unload, bpf prog free, etc. and
>>> all other entries' protection are also RW+NX.
>>>
>>
>> I tried very similar approach few months ago (for toy implementation) [5],
>> and the biggest obstacle to this approach was: you need to be extremely sure
>> that the page->nr_same_prot is ALWAYS correct.
>>
>> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
>> _PAGE_PRESENT without going through CPA, which can simply break the count.
>>
>> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
>> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
>>
>> I think we may need to hook set_pte/set_pmd/etc and use proper
>> synchronization primitives when changing init_mm's page table to go
>> further on this approach.
>>
>>> One final note is, with features like bpf_prog_pack etc., there can be
>>> much fewer large mapping split IIUC; also, this patchset can not help
>>> when the page which has its protection changed keeps in use. So my take
>>> on this large mapping split problem is: to get the most value of keeping
>>> large mapping intact, features like bpf_prog_pack is important. This
>>> patchset can help to further reduce large mapping split when in use page
>>> that has special protection set finally gets released.
> 
> I'm not sure automatic collapse of large pages in the direct map will
> actually trigger frequently. 
> 
> Consider for example pages allocated for modules, that have adjusted
> protection bits. This pages could be scattered all over and even when they
> are freed, chances there will be a contiguous 2M chunk are quite low...
> 

When these pages that scattered a 2M chunk with special protection bits
set are all freed, then we can do merge for them. I suppose you mean
it's not easy to have all these special pages freed?

> I believe that to reduce the fragmentation of the direct map the 4K pages
> with changed protection should be allocated from a cache of large pages, as
> I did on older version of secretmem or as Rick implemented in his vmalloc
> and PKS series.

I agree that allocation side is important to reduce direct map
fragmentation. This approach here doesn't help when these special pages
are in use while the approaches you mentioned can help that.

> 
> Then CPA may provide a method for explicitly collapsing a large page, so
> that such cache can call this method when an entire large page becomes
> free.

I think this is a good idea. With things like your Unmap migratetype
patchset, when a order-9 page is free, it can somehow notify CPA about
this and then arch code like CPA can manipulate direct mapping as it
sees appropriate, like merging lower level page tables to higher level.

This also saves the trouble of tracking pgt->same_prot and nr_same_prot
of the kernel page tables in this patchset. CPA now only need to get
notified and then do a page table scan to make sure such a merge is correct.

I suppose this should work as long as all pages that will have
protection bits changed are allocated from the page allocator(so that
your approach can track such pages).

> 
>>> [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
>>> [2]: https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
>>>
>>> Aaron Lu (4):
>>>   x86/mm/cpa: restore global bit when page is present
>>>   x86/mm/cpa: merge splitted direct mapping when possible
>>>   x86/mm/cpa: add merge event counter
>>>   x86/mm/cpa: add a test interface to split direct map
>>>
>>>  arch/x86/mm/pat/set_memory.c  | 411 +++++++++++++++++++++++++++++++++-
>>>  include/linux/mm_types.h      |   6 +
>>>  include/linux/page-flags.h    |   6 +
>>>  include/linux/vm_event_item.h |   2 +
>>>  mm/vmstat.c                   |   2 +
>>>  5 files changed, 420 insertions(+), 7 deletions(-)
>>>
>>> -- 
>>> 2.37.1
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-08-16  8:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08 14:56 [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Aaron Lu
2022-08-08 14:56 ` [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Aaron Lu
2022-08-11  5:21   ` Hyeonggon Yoo
2022-08-11  8:16     ` Lu, Aaron
2022-08-11 11:30       ` Hyeonggon Yoo
2022-08-11 12:28         ` Aaron Lu
2022-08-08 14:56 ` [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible Aaron Lu
2022-08-08 14:56 ` [RFC PATCH 3/4] x86/mm/cpa: add merge event counter Aaron Lu
2022-08-08 14:56 ` [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map Aaron Lu
2022-08-09 10:04 ` [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible Kirill A. Shutemov
2022-08-09 14:58   ` Aaron Lu
2022-08-09 17:56     ` Kirill A. Shutemov
2022-08-11  4:50 ` Hyeonggon Yoo
2022-08-11  7:50   ` Lu, Aaron
2022-08-13 16:05   ` Mike Rapoport
2022-08-16  6:33     ` Aaron Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.