[RFC 0/3] Group pages on the direct map for permissioned vmallocs

LKML Archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] Group pages on the direct map for permissioned vmallocs
@ 2021-04-05 20:37 Rick Edgecombe
  2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

Hi,

This is a followup to the previous attempt to overhaul how vmalloc permissions
are done:
https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/

In working on a next version it dawned on me that we can significantly reduce
direct map breakage on x86 with a much less invasive change, so I thought maybe
I would start there in the meantime.

In a test of booting fedora and running the BPF unit tests, this reduced 4k
direct map pages by 98%.

It simply uses pages for x86 module_alloc() mappings from a cache created out of
2MB pages. This results in all the later breakage clustering in 2MB blocks on
the direct map. The trade-off is colder pages are used for these allocations.
All module_alloc() users (modules, ebpf jit, ftrace, kprobes) get this behavior.

Potentially this behavior should be enabled for eBPF byte code allocations as
well in the case of !CONFIG_BPF_JIT_ALWAYS_ON.

The new APIs and invasive changes in the callers can happen after vmalloc huge
pages bring more benefits. Although, I can post shootdown reduction changes with
previous comments integrated if anyone disagrees.

Based on v5.11.

Thanks,

Rick

Rick Edgecombe (3):
  list: Support getting most recent element in list_lru
  vmalloc: Support grouped page allocations
  x86/module: Use VM_GROUP_PAGES flag

 arch/x86/Kconfig         |   1 +
 arch/x86/kernel/module.c |   2 +-
 include/linux/list_lru.h |  13 +++
 include/linux/vmalloc.h  |   1 +
 mm/Kconfig               |   9 ++
 mm/list_lru.c            |  28 +++++
 mm/vmalloc.c             | 215 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 257 insertions(+), 12 deletions(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/3] list: Support getting most recent element in list_lru
  2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
@ 2021-04-05 20:37 ` Rick Edgecombe
  2021-04-05 20:37 ` [RFC 2/3] vmalloc: Support grouped page allocations Rick Edgecombe
  2021-04-05 20:37 ` [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag Rick Edgecombe
  2 siblings, 0 replies; 8+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

In future patches, some functionality will use list_lru that also needs
to keep track of the most recently used element on a node. Since this
information is already contained within list_lru, add a function to get
it so that an additional list is not needed in the caller.

Do not support memcg aware list_lru's since it is not needed by the
intended caller.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/list_lru.h | 13 +++++++++++++
 mm/list_lru.c            | 28 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 9dcaa3e582c9..4bde44a5024b 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -103,6 +103,19 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_get_mru: gets and removes the tail from one of the node lists
+ * @list_lru: the lru pointer
+ * @nid: the node id
+ *
+ * This function removes the most recently added item from one of the node
+ * id specified. This function should not be used if the list_lru is memcg
+ * aware.
+ *
+ * Return value: The element removed
+ */
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid);
+
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/mm/list_lru.c b/mm/list_lru.c
index fe230081690b..a18d9ed8abd5 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -156,6 +156,34 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid)
+{
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l = &nlru->lru;
+	struct list_head *ret;
+
+	/* This function does not attempt to search through the memcg lists */
+	if (list_lru_memcg_aware(lru)) {
+		WARN_ONCE(1, "list_lru: %s not supported on memcg aware list_lrus", __func__);
+		return NULL;
+	}
+
+	spin_lock(&nlru->lock);
+	if (list_empty(&l->list)) {
+		ret = NULL;
+	} else {
+		/* Get tail */
+		ret = l->list.prev;
+		list_del_init(ret);
+
+		l->nr_items--;
+		nlru->nr_items--;
+	}
+	spin_unlock(&nlru->lock);
+
+	return ret;
+}
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
  2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
@ 2021-04-05 20:37 ` Rick Edgecombe
  2021-04-05 21:01   ` Dave Hansen
  2021-04-05 20:37 ` [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag Rick Edgecombe
  2 siblings, 1 reply; 8+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

For x86, setting memory permissions on vmalloc allocations results in
fracturing large pages on the direct map. Direct map fracturing can be
reduced by locating pages that will have their permissions set close
together.

Create a simple page cache in vmalloc that allocates pages from huge
page size blocks. Define a new flag, VM_GROUP_PAGES, that vmalloc
callers can use to ask vmalloc to try to use pages from this new cache.
Don't guarantee that a page will come from a huge page grouping,
instead fallback to non-grouped pages to fulfill the allocation if
needed. Also, register a shrinker such that the system can ask for the
pages back if needed.

Since this is only needed for cases where there are direct map
permissions and there might be large pages on the direct map, compile
it out for non-x86 and highmem using a new arch CONFIG.

Free pages in the cache are kept track of in per-node list inside a
list_lru. NUMA_NO_NODE requests are serviced by checking each per-node
list in a round robin fashion. If pages are requested for a certain node
but the cache is empty for that node, a whole additional huge page size
page is allocated.

When replenishing the cache, use a specific GFP flag that matches the
intended user of this vm_flag (module_alloc()). In the case of the vm
and GFP flags mismatching, fail the page allocation. In the case of a
huge page size page not being available, fallback to the normal page
allocator logic and use non-grouped pages.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/Kconfig        |   1 +
 include/linux/vmalloc.h |   1 +
 mm/Kconfig              |   9 ++
 mm/vmalloc.c            | 215 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 215 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..c0dafa7e6e73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -80,6 +80,7 @@ config X86
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SET_DIRECT_MAP
+	select ARCH_HAS_HPAGE_DIRECT_MAP_PERMS  if !HIGHMEM
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index cedcda6593f6..6a27af32f3a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -26,6 +26,7 @@ struct notifier_block;		/* in notifier.h */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
 #define VM_FLUSH_RESET_PERMS	0x00000100	/* reset direct map and flush TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES	0x00000200	/* put pages and free array in vfree */
+#define VM_GROUP_PAGES		0x00000400	/* try to group the pages on the direct map */
 
 /*
  * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
diff --git a/mm/Kconfig b/mm/Kconfig
index f730605b8dcf..ecb04d9b6227 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -869,6 +869,15 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+#
+# Select if arch tries to preserve large pages on the direct map while also
+# setting permissions on the direct map. When set, generic mm code will make
+# an effort to centralize the breakage of the direct map that results from
+# setting kernel memory permissions.
+#
+config ARCH_HAS_HPAGE_DIRECT_MAP_PERMS
+	bool
+
 config MAPPING_DIRTY_HELPERS
         bool
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e6f352bf0498..d478dccda3a7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2194,6 +2194,203 @@ struct vm_struct *remove_vm_area(const void *addr)
 	return NULL;
 }
 
+static struct page *__vm_alloc_page_order(int node, gfp_t gfp_mask, int order)
+{
+	if (node == NUMA_NO_NODE)
+		return alloc_pages(gfp_mask, order);
+
+	return alloc_pages_node(node, gfp_mask, order);
+}
+
+#ifdef CONFIG_ARCH_HAS_HPAGE_DIRECT_MAP_PERMS
+#define GFP_GROUPED (__GFP_NOWARN | __GFP_HIGHMEM | GFP_KERNEL)
+
+/*
+ * The list_lru contains per-node lists of the cache of grouped pages. Huge
+ * page size blocks of pages are added whenever a request cannot be fulfilled
+ * by pages in the cache. For NUMA_NO_NODE, nid_round_robin is used as the
+ * start of a search through each node looking for a page. The shrinker will
+ * lock the per-node list.
+ */
+static struct list_lru grouped_pages_lru;
+static bool grouped_page_en;
+static atomic_t nid_round_robin;
+
+static unsigned long grouped_shrink_count(struct shrinker *shrinker,
+					  struct shrink_control *sc)
+{
+	unsigned long page_cnt = list_lru_shrink_count(&grouped_pages_lru, sc);
+
+	return page_cnt ? page_cnt : SHRINK_EMPTY;
+}
+
+static enum lru_status grouped_isolate(struct list_head *item,
+				       struct list_lru_one *list,
+				       spinlock_t *lock, void *cb_arg)
+{
+	struct list_head *dispose = cb_arg;
+
+	list_lru_isolate_move(list, item, dispose);
+
+	return LRU_REMOVED;
+}
+
+static void __dispose_pages(struct list_head *head)
+{
+	struct list_head *cur, *next;
+
+	list_for_each_safe(cur, next, head) {
+		list_del(cur);
+
+		/* The list head is stored at the start of the page */
+		free_page((unsigned long)cur);
+	}
+}
+
+static unsigned long grouped_shrink_scan(struct shrinker *shrinker,
+					 struct shrink_control *sc)
+{
+	unsigned long isolated;
+	LIST_HEAD(freeable);
+
+	if (!(sc->gfp_mask & GFP_GROUPED))
+		return SHRINK_STOP;
+
+	isolated = list_lru_shrink_walk(&grouped_pages_lru, sc, grouped_isolate,
+					&freeable);
+	__dispose_pages(&freeable);
+
+	/* Every item walked gets isolated */
+	sc->nr_scanned += isolated;
+
+	return isolated;
+}
+
+static struct shrinker grouped_page_shrinker = {
+	.count_objects = grouped_shrink_count,
+	.scan_objects = grouped_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+	.flags = SHRINKER_NUMA_AWARE
+};
+
+static int __init grouped_page_init(void)
+{
+	if (list_lru_init(&grouped_pages_lru))
+		goto out;
+
+	grouped_page_en = !register_shrinker(&grouped_page_shrinker);
+	if (!grouped_page_en)
+		list_lru_destroy(&grouped_pages_lru);
+
+out:
+	return !grouped_page_en;
+}
+
+device_initcall(grouped_page_init);
+
+static struct page *__remove_first_page(int node)
+{
+	unsigned int start_nid, i;
+	struct list_head *head;
+
+	if (node != NUMA_NO_NODE) {
+		head = list_lru_get_mru(&grouped_pages_lru, node);
+		if (head)
+			return virt_to_page(head);
+		return NULL;
+	}
+
+	/* If NUMA_NO_NODE, search the nodes in round robin for a page */
+	start_nid = (unsigned int)atomic_fetch_inc(&nid_round_robin) % nr_node_ids;
+	for (i = 0; i < nr_node_ids; i++) {
+		int cur_nid = (start_nid + i) % nr_node_ids;
+
+		head = list_lru_get_mru(&grouped_pages_lru, cur_nid);
+		if (head)
+			return virt_to_page(head);
+	}
+
+	return NULL;
+}
+
+static inline bool vmalloc_grouped_page_enabled(unsigned long vm_flags)
+{
+	return grouped_page_en && vm_flags & VM_GROUP_PAGES;
+}
+
+/* Return a page to the cache used by VM_GROUP_PAGES */
+static void __free_grouped_page(struct page *page)
+{
+	struct list_head *item;
+
+	item = (struct list_head *)page_address(page);
+	INIT_LIST_HEAD(item);
+	list_lru_add(&grouped_pages_lru, item);
+}
+
+/* Get and add some new pages to the cache to be used by VM_GROUP_PAGES */
+static struct page *__replenish_grouped_pages(int node)
+{
+	const unsigned int hpage_cnt = HPAGE_SIZE >> PAGE_SHIFT;
+	struct page *page;
+	int i;
+
+	page = __vm_alloc_page_order(node, GFP_GROUPED, HUGETLB_PAGE_ORDER);
+	if (!page)
+		return __vm_alloc_page_order(node, GFP_GROUPED, 0);
+
+	split_page(page, HUGETLB_PAGE_ORDER);
+
+	for (i = 1; i < hpage_cnt; i++)
+		__free_grouped_page(&page[i]);
+
+	return &page[0];
+}
+
+/* Get a page from the cache for VM_GROUP_PAGES */
+static struct page *__get_grouped_page(int node, gfp_t gfp_mask)
+{
+	struct page *page;
+
+	if (gfp_mask != GFP_GROUPED) {
+		WARN(1, "Unsupported GFP flags for VM_GROUP_PAGES\n");
+		return NULL;
+	}
+
+	page = __remove_first_page(node);
+
+	if (page)
+		return page;
+
+	return __replenish_grouped_pages(node);
+}
+#else /* CONFIG_ARCH_HAS_HPAGE_DIRECT_MAP_PERMS */
+static inline bool vmalloc_grouped_page_enabled(unsigned long vm_flags)
+{
+	return false;
+}
+
+extern void __free_grouped_page(struct page *page);
+extern struct page *__get_grouped_page(int node, gfp_t gfp_mask);
+#endif /* CONFIG_ARCH_HAS_HPAGE_DIRECT_MAP_PERMS */
+
+static struct page *vm_alloc_page(int node, unsigned long vm_flags,
+				  gfp_t gfp_mask)
+{
+	if (vmalloc_grouped_page_enabled(vm_flags))
+		return __get_grouped_page(node, gfp_mask);
+
+	return __vm_alloc_page_order(node, gfp_mask, 0);
+}
+
+static void vm_free_page(struct page *page, unsigned long vm_flags)
+{
+	if (vmalloc_grouped_page_enabled(vm_flags))
+		__free_grouped_page(page);
+	else
+		__free_pages(page, 0);
+}
+
 static inline void set_area_direct_map(const struct vm_struct *area,
 				       int (*set_direct_map)(struct page *page))
 {
@@ -2283,7 +2480,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
-			__free_pages(page, 0);
+			vm_free_page(page, area->flags);
 		}
 		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
 
@@ -2474,7 +2671,7 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
 #endif /* CONFIG_VMAP_PFN */
 
 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
-				 pgprot_t prot, int node)
+				 pgprot_t prot, int node, unsigned long vm_flags)
 {
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 	unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
@@ -2504,12 +2701,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	area->nr_pages = nr_pages;
 
 	for (i = 0; i < area->nr_pages; i++) {
-		struct page *page;
-
-		if (node == NUMA_NO_NODE)
-			page = alloc_page(gfp_mask);
-		else
-			page = alloc_pages_node(node, gfp_mask, 0);
+		struct page *page = vm_alloc_page(node, vm_flags, gfp_mask);
 
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vfree() */
@@ -2560,6 +2752,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller)
 {
+	unsigned long flags = VM_ALLOC | VM_UNINITIALIZED | vm_flags;
 	struct vm_struct *area;
 	void *addr;
 	unsigned long real_size = size;
@@ -2568,12 +2761,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
 		goto fail;
 
-	area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED |
-				vm_flags, start, end, node, gfp_mask, caller);
+	area = __get_vm_area_node(real_size, align, flags, start, end, node,
+				  gfp_mask, caller);
 	if (!area)
 		goto fail;
 
-	addr = __vmalloc_area_node(area, gfp_mask, prot, node);
+	addr = __vmalloc_area_node(area, gfp_mask, prot, node, flags);
 	if (!addr)
 		return NULL;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag
  2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
  2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
  2021-04-05 20:37 ` [RFC 2/3] vmalloc: Support grouped page allocations Rick Edgecombe
@ 2021-04-05 20:37 ` Rick Edgecombe
  2 siblings, 0 replies; 8+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

Callers of module_alloc() will set permissions on the allocation. Use
the VM_GROUP_PAGES to reduce direct map breakage.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/module.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 34b153cbd4ac..9161ce0e987f 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -75,7 +75,7 @@ void *module_alloc(unsigned long size)
 	p = __vmalloc_node_range(size, MODULE_ALIGN,
 				    MODULES_VADDR + get_module_load_offset(),
 				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL, 0, NUMA_NO_NODE,
+				    PAGE_KERNEL, VM_GROUP_PAGES, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 20:37 ` [RFC 2/3] vmalloc: Support grouped page allocations Rick Edgecombe
@ 2021-04-05 21:01   ` Dave Hansen
  2021-04-05 21:32     ` Matthew Wilcox
  2021-04-05 21:38     ` Edgecombe, Rick P
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Hansen @ 2021-04-05 21:01 UTC (permalink / raw)
  To: Rick Edgecombe, akpm, linux-mm, bpf, dave.hansen, peterz, luto,
	jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86

On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> +static void __dispose_pages(struct list_head *head)
> +{
> +	struct list_head *cur, *next;
> +
> +	list_for_each_safe(cur, next, head) {
> +		list_del(cur);
> +
> +		/* The list head is stored at the start of the page */
> +		free_page((unsigned long)cur);
> +	}
> +}

This is interesting.

While the page is in the allocator, you're using the page contents
themselves to store the list_head.  It took me a minute to figure out
what you were doing here because: "start of the page" is a bit
ambiguous.  It could mean:

 * the first 16 bytes in 'struct page'
or
 * the first 16 bytes in the page itself, aka *page_address(page)

The fact that this doesn't work on higmem systems makes this an OK thing
to do, but it is a bit weird.  It's also doubly susceptible to bugs
where there's a page_to_virt() or virt_to_page() screwup.

I was *hoping* there was still sufficient space in 'struct page' for
this second list_head in addition to page->lru.  I think there *should*
be.  That would at least make this allocator a bit more "normal" in not
caring about page contents while the page is free in the allocator.  If
you were able to do that you could do things like kmemcheck or page
alloc debugging while the page is in the allocator.

Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
But, if that doesn't work out, please comment the snot out of this thing
because it _is_ weird.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:01   ` Dave Hansen
@ 2021-04-05 21:32     ` Matthew Wilcox
  2021-04-05 21:49       ` Edgecombe, Rick P
  2021-04-05 21:38     ` Edgecombe, Rick P
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-04-05 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, akpm, linux-mm, bpf, dave.hansen, peterz, luto,
	jeyu, linux-kernel, ast, daniel, andrii, hch, x86

On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +	struct list_head *cur, *next;
> > +
> > +	list_for_each_safe(cur, next, head) {
> > +		list_del(cur);
> > +
> > +		/* The list head is stored at the start of the page */
> > +		free_page((unsigned long)cur);
> > +	}
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there *should*
> be.  That would at least make this allocator a bit more "normal" in not
> caring about page contents while the page is free in the allocator.  If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this thing
> because it _is_ weird.

Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!

I haven't read over these patches yet.  If these pages are in use by
vmalloc, they can't use mapping+index because get_user_pages() will call
page_mapping() and the list_head will confuse it.  I think it could use
index+private for a list_head.

If the pages are in the buddy, I _think_ mapping+index are free.  private
is in use for buddy order.  But I haven't read through the buddy code
in a while.

Does it need to be a doubly linked list?  Can it be an hlist?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:01   ` Dave Hansen
  2021-04-05 21:32     ` Matthew Wilcox
@ 2021-04-05 21:38     ` Edgecombe, Rick P
  1 sibling, 0 replies; 8+ messages in thread
From: Edgecombe, Rick P @ 2021-04-05 21:38 UTC (permalink / raw)
  To: luto@kernel.org, Hansen, Dave, dave.hansen@linux.intel.com,
	linux-mm@kvack.org, jeyu@kernel.org, peterz@infradead.org,
	akpm@linux-foundation.org, bpf@vger.kernel.org
  Cc: daniel@iogearbox.net, ast@kernel.org,
	linux-kernel@vger.kernel.org, andrii@kernel.org, x86@kernel.org,
	hch@infradead.org

On Mon, 2021-04-05 at 14:01 -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +       struct list_head *cur, *next;
> > +
> > +       list_for_each_safe(cur, next, head) {
> > +               list_del(cur);
> > +
> > +               /* The list head is stored at the start of the page
> > */
> > +               free_page((unsigned long)cur);
> > +       }
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK
> thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there
> *should*
> be.  That would at least make this allocator a bit more "normal" in
> not
> caring about page contents while the page is free in the allocator. 
> If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this
> thing
> because it _is_ weird.

Yes sorry, that deserved more explanation. I tried putting it in struct
page actually. The problem was list_lru automatically determines the
node id from the list_head provided to it via
page_to_nid(virt_to_page(head)). I guess it assumes the list_head is on
the actual item. I started adding another list_lru function that let
the node id be passed in separately, but I remembered this trick from
the deferred free list in vmalloc.

If this ever expands to handle direct map unmapped pages (which would
probably be the next step), the list_head will have to be moved out of
the actual page anyway. But in the meantime it resulted in the smallest
change.

I can try the other way if it's still too weird.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:32     ` Matthew Wilcox
@ 2021-04-05 21:49       ` Edgecombe, Rick P
  0 siblings, 0 replies; 8+ messages in thread
From: Edgecombe, Rick P @ 2021-04-05 21:49 UTC (permalink / raw)
  To: willy@infradead.org, Hansen, Dave
  Cc: luto@kernel.org, x86@kernel.org, dave.hansen@linux.intel.com,
	linux-mm@kvack.org, jeyu@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, hch@infradead.org,
	akpm@linux-foundation.org, ast@kernel.org, bpf@vger.kernel.org,
	daniel@iogearbox.net, andrii@kernel.org

On Mon, 2021-04-05 at 22:32 +0100, Matthew Wilcox wrote:
> On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> > On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > > +static void __dispose_pages(struct list_head *head)
> > > +{
> > > +       struct list_head *cur, *next;
> > > +
> > > +       list_for_each_safe(cur, next, head) {
> > > +               list_del(cur);
> > > +
> > > +               /* The list head is stored at the start of the
> > > page */
> > > +               free_page((unsigned long)cur);
> > > +       }
> > > +}
> > 
> > This is interesting.
> > 
> > While the page is in the allocator, you're using the page contents
> > themselves to store the list_head.  It took me a minute to figure
> > out
> > what you were doing here because: "start of the page" is a bit
> > ambiguous.  It could mean:
> > 
> >  * the first 16 bytes in 'struct page'
> > or
> >  * the first 16 bytes in the page itself, aka *page_address(page)
> > 
> > The fact that this doesn't work on higmem systems makes this an OK
> > thing
> > to do, but it is a bit weird.  It's also doubly susceptible to bugs
> > where there's a page_to_virt() or virt_to_page() screwup.
> > 
> > I was *hoping* there was still sufficient space in 'struct page'
> > for
> > this second list_head in addition to page->lru.  I think there
> > *should*
> > be.  That would at least make this allocator a bit more "normal" in
> > not
> > caring about page contents while the page is free in the
> > allocator.  If
> > you were able to do that you could do things like kmemcheck or page
> > alloc debugging while the page is in the allocator.
> > 
> > Anyway, I think I'd prefer that you *try* to use 'struct page'
> > alone.
> > But, if that doesn't work out, please comment the snot out of this
> > thing
> > because it _is_ weird.
> 
> Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!
> 
> I haven't read over these patches yet.  If these pages are in use by
> vmalloc, they can't use mapping+index because get_user_pages() will
> call
> page_mapping() and the list_head will confuse it.  I think it could
> use
> index+private for a list_head.
> 
> If the pages are in the buddy, I _think_ mapping+index are free. 
> private
> is in use for buddy order.  But I haven't read through the buddy code
> in a while.
> 
> Does it need to be a doubly linked list?  Can it be an hlist?

It does need to be a doubly linked list. I think they should never be
mapped to userspace. As far as the page allocator is concerned these
pages are not free. And they are not compound.

Originally I was just using the lru member. Would it be ok in that
case?

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-04-05 21:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
2021-04-05 20:37 ` [RFC 2/3] vmalloc: Support grouped page allocations Rick Edgecombe
2021-04-05 21:01   ` Dave Hansen
2021-04-05 21:32     ` Matthew Wilcox
2021-04-05 21:49       ` Edgecombe, Rick P
2021-04-05 21:38     ` Edgecombe, Rick P
2021-04-05 20:37 ` [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag Rick Edgecombe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).