LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts
@ 2021-04-05 23:00 Mike Kravetz
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

IMPORTANT NOTE FOR REVIEWERS:  v3 of this series is in Andrew's mmotm
tree v5.12-rc5-mmotm-2021-03-31-21-27.  Muchun Song noticed that Oscar
Salvador's series "Make alloc_contig_range handle Hugetlb pages" was also
added to that mmotm version.  v3 of this series did not take Oscar's
changes into account.  v4 addresses those omissions.

v4 is based on the following:
- Start with v5.12-rc5-mmotm-2021-03-31-21-27
- Revert v3 of this series

Patch changes from v3:
1	- Trivial context fixup due to cma changes.  No functional changes
2, 3	- No change
4, 5	- Changes required due to "Make alloc_contig_range handle
	  Hugetlb pages".  Specifically, alloc_and_dissolve_huge_page
	  changes are in need of review.
6	- No change
7	- Fairly straight forward spin_*lock calls to spin_*lock_irq*
	  changes in code from "Make alloc_contig_range handle Hugetlb pages"
	  series.
8	- Trivial change due to context changes.  No functional change.

If easier to review, I could also provide delta patches on top of
patches 4, 5, and 7 of v3.

Original cover letter follows:
This effort is the result a recent bug report [1].  Syzbot found a
potential deadlock in the hugetlb put_page/free_huge_page_path.
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
Since the free_huge_page_path already has code to 'hand off' page
free requests to a workqueue, a suggestion was proposed to make
the in_irq() detection accurate by always enabling PREEMPT_COUNT [2].
The outcome of that discussion was that the hugetlb put_page path
(free_huge_page) path should be properly fixed and safe for all calling
contexts.

This series is based on v5.12-rc3-mmotm-2021-03-17-22-24.  At a high
level, the series provides:
- Patches 1 & 2 change CMA bitmap mutex to an irq safe spinlock
- Patch 3 adds a mutex for proc/sysfs interfaces changing hugetlb counts
- Patches 4, 5 & 6 are aimed at reducing lock hold times.  To be clear
  the goal is to eliminate single lock hold times of a long duration.
  Overall lock hold time is not addressed.
- Patch 7 makes hugetlb_lock and subpool lock IRQ safe.  It also reverts
  the code which defers calls to a workqueue if !in_task.
- Patch 8 adds some lockdep_assert_held() calls

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
[2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com

v3 -> v4
- Add changes needed for the series "Make alloc_contig_range handle
  Hugetlb pages"

v2 -> v3
- Update commit message in patch 1 as suggested by Michal
- Do not use spin_lock_irqsave/spin_unlock_irqrestore when we know we
  are in task context as suggested by Michal
- Remove unnecessary INIT_LIST_HEAD() as suggested by Muchun

v1 -> v2
- Drop Roman's cma_release_nowait() patches and just change CMA mutex
  to an IRQ safe spinlock.
- Cleanups to variable names, commets and commit messages as suggested
  by Michal, Oscar, Miaohe and Muchun.
- Dropped unnecessary INIT_LIST_HEAD as suggested by Michal and list_del
  as suggested by Muchun.
- Created update_and_free_pages_bulk helper as suggested by Michal.
- Rebased on v5.12-rc4-mmotm-2021-03-28-16-37
- Added Acked-by: and Reviewed-by: from v1

RFC -> v1
- Add Roman's cma_release_nowait() patches.  This eliminated the need
  to do a workqueue handoff in hugetlb code.
- Use Michal's suggestion to batch pages for freeing.  This eliminated
  the need to recalculate loop control variables when dropping the lock.
- Added lockdep_assert_held() calls
- Rebased to v5.12-rc3-mmotm-2021-03-17-22-24


Mike Kravetz (8):
  mm/cma: change cma mutex to irq safe spinlock
  hugetlb: no need to drop hugetlb_lock to call cma_release
  hugetlb: add per-hstate mutex to synchronize user adjustments
  hugetlb: create remove_hugetlb_page() to separate functionality
  hugetlb: call update_and_free_page without hugetlb_lock
  hugetlb: change free_pool_huge_page to remove_pool_huge_page
  hugetlb: make free_huge_page irq safe
  hugetlb: add lockdep_assert_held() calls for hugetlb_lock

 include/linux/hugetlb.h |   1 +
 mm/cma.c                |  18 +-
 mm/cma.h                |   2 +-
 mm/cma_debug.c          |   8 +-
 mm/hugetlb.c            | 384 +++++++++++++++++++++-------------------
 mm/hugetlb_cgroup.c     |   8 +-
 6 files changed, 218 insertions(+), 203 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-06  9:17   ` Peter Zijlstra
                     ` (3 more replies)
  2021-04-05 23:00 ` [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
                   ` (7 subsequent siblings)
  8 siblings, 4 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

cma_release is currently a sleepable operatation because the bitmap
manipulation is protected by cma->lock mutex. Hugetlb code which relies
on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
irq safe.

The lock doesn't protect any sleepable operation so it can be changed to
a (irq aware) spin lock. The bitmap processing should be quite fast in
typical case but if cma sizes grow to TB then we will likely need to
replace the lock by a more optimized bitmap implementation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/cma.c       | 18 +++++++++---------
 mm/cma.h       |  2 +-
 mm/cma_debug.c |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index f3bca4178c7f..995e15480937 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -24,7 +24,6 @@
 #include <linux/memblock.h>
 #include <linux/err.h>
 #include <linux/mm.h>
-#include <linux/mutex.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
@@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
 			     unsigned long count)
 {
 	unsigned long bitmap_no, bitmap_count;
+	unsigned long flags;
 
 	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
 	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irqsave(&cma->lock, flags);
 	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irqrestore(&cma->lock, flags);
 }
 
 static void __init cma_activate_area(struct cma *cma)
@@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
 	     pfn += pageblock_nr_pages)
 		init_cma_reserved_pageblock(pfn_to_page(pfn));
 
-	mutex_init(&cma->lock);
+	spin_lock_init(&cma->lock);
 
 #ifdef CONFIG_CMA_DEBUGFS
 	INIT_HLIST_HEAD(&cma->mem_head);
@@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
 	unsigned long nr_part, nr_total = 0;
 	unsigned long nbits = cma_bitmap_maxno(cma);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	pr_info("number of available pages: ");
 	for (;;) {
 		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
@@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
 		start = next_zero_bit + nr_zero;
 	}
 	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 }
 #else
 static inline void cma_debug_show_areas(struct cma *cma) { }
@@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 		goto out;
 
 	for (;;) {
-		mutex_lock(&cma->lock);
+		spin_lock_irq(&cma->lock);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
 				bitmap_maxno, start, bitmap_count, mask,
 				offset);
 		if (bitmap_no >= bitmap_maxno) {
-			mutex_unlock(&cma->lock);
+			spin_unlock_irq(&cma->lock);
 			break;
 		}
 		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
@@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 		 * our exclusive use. If the migration fails we will take the
 		 * lock again and unmark it.
 		 */
-		mutex_unlock(&cma->lock);
+		spin_unlock_irq(&cma->lock);
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
diff --git a/mm/cma.h b/mm/cma.h
index 68ffad4e430d..2c775877eae2 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -15,7 +15,7 @@ struct cma {
 	unsigned long   count;
 	unsigned long   *bitmap;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
-	struct mutex    lock;
+	spinlock_t	lock;
 #ifdef CONFIG_CMA_DEBUGFS
 	struct hlist_head mem_head;
 	spinlock_t mem_head_lock;
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa34fdc..2e7704955f4f 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
 	struct cma *cma = data;
 	unsigned long used;
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	/* pages counter is smaller than sizeof(int) */
 	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 	*val = (u64)used << cma->order_per_bit;
 
 	return 0;
@@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
 	unsigned long start, end = 0;
 	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
 
-	mutex_lock(&cma->lock);
+	spin_lock_irq(&cma->lock);
 	for (;;) {
 		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
 		if (start >= bitmap_maxno)
@@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
 		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
 		maxchunk = max(end - start, maxchunk);
 	}
-	mutex_unlock(&cma->lock);
+	spin_unlock_irq(&cma->lock);
 	*val = (u64)maxchunk << cma->order_per_bit;
 
 	return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-06  7:18   ` Oscar Salvador
  2021-04-07  9:26   ` David Hildenbrand
  2021-04-05 23:00 ` [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

Now that cma_release is non-blocking and irq safe, there is no need to
drop hugetlb_lock before calling.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/hugetlb.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c3e4baa4156..1d62f0492e7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-04-05 23:00 ` [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-07  9:28   ` David Hildenbrand
  2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

The helper routine hstate_next_node_to_alloc accesses and modifies the
hstate variable next_nid_to_alloc.  The helper is used by the routines
alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
called with hugetlb_lock held.  However, alloc_pool_huge_page can not
be called with the hugetlb lock held as it will call the page allocator.
Two instances of alloc_pool_huge_page could be run in parallel or
alloc_pool_huge_page could run in parallel with adjust_pool_surplus
which may result in the variable next_nid_to_alloc becoming invalid
for the caller and pages being allocated on the wrong node.

Both alloc_pool_huge_page and adjust_pool_surplus are only called from
the routine set_max_huge_pages after boot.  set_max_huge_pages is only
called as the reusult of a user writing to the proc/sysfs nr_hugepages,
or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.

It makes little sense to allow multiple adjustment to the number of
hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
allow one hugetlb page adjustment at a time.  This will synchronize
modifications to the next_nid_to_alloc variable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c            | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9b78e82652f..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+	struct mutex resize_lock;
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d62f0492e7b..8497a3598c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	else
 		return -ENOMEM;
 
+	/*
+	 * resize_lock mutex prevents concurrent adjustments to number of
+	 * pages in hstate via the proc/sysfs interfaces.
+	 */
+	mutex_lock(&h->resize_lock);
 	spin_lock(&hugetlb_lock);
 
 	/*
@@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
@@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
 
@@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
 	h = &hstates[hugetlb_max_hstate++];
+	mutex_init(&h->resize_lock);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-06  9:56   ` Michal Hocko
                     ` (2 more replies)
  2021-04-05 23:00 ` [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.

update_and_free_page no longer decrements h->nr_huge_pages{_node} as
this is performed in remove_hugetlb_page.  The only functionality
performed by update_and_free_page is to free the base pages to the lower
level allocators.

update_and_free_page is typically called after remove_hugetlb_page.

remove_hugetlb_page is to be called with the hugetlb_lock held.

Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8497a3598c86..df2a3d1f632b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1055,18 +1055,13 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
-static void __enqueue_huge_page(struct list_head *list, struct page *page)
-{
-	list_move(&page->lru, list);
-	SetHPageFreed(page);
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
+	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
+	SetHPageFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1331,6 +1326,43 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
+/*
+ * Remove hugetlb page from lists, and update dtor so that page appears
+ * as just a compound page.  A reference is held on the page.
+ *
+ * Must be called with hugetlb lock held.
+ */
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	int nid = page_to_nid(page);
+
+	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+		return;
+
+	list_del(&page->lru);
+
+	if (HPageFreed(page)) {
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ClearHPageFreed(page);
+	}
+	if (adjust_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+
+	ClearHPageTemporary(page);
+	set_page_refcounted(page);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[nid]--;
+}
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1339,8 +1371,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
-	h->nr_huge_pages--;
-	h->nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1348,10 +1378,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
-	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
-	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
@@ -1419,15 +1445,12 @@ static void __free_huge_page(struct page *page)
 		h->resv_huge_pages++;
 
 	if (HPageTemporary(page)) {
-		list_del(&page->lru);
-		ClearHPageTemporary(page);
+		remove_hugetlb_page(h, page, false);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
-		list_del(&page->lru);
+		remove_hugetlb_page(h, page, true);
 		update_and_free_page(h, page);
-		h->surplus_huge_pages--;
-		h->surplus_huge_pages_node[nid]--;
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
@@ -1712,13 +1735,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 			struct page *page =
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[node]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[node]--;
-			}
+			remove_hugetlb_page(h, page, acct_surplus);
 			update_and_free_page(h, page);
 			ret = 1;
 			break;
@@ -1756,7 +1773,6 @@ int dissolve_free_huge_page(struct page *page)
 	if (!page_count(page)) {
 		struct page *head = compound_head(page);
 		struct hstate *h = page_hstate(head);
-		int nid = page_to_nid(head);
 		if (h->free_huge_pages - h->resv_huge_pages == 0)
 			goto out;
 
@@ -1787,9 +1803,7 @@ int dissolve_free_huge_page(struct page *page)
 			SetPageHWPoison(page);
 			ClearPageHWPoison(head);
 		}
-		list_del(&head->lru);
-		h->free_huge_pages--;
-		h->free_huge_pages_node[nid]--;
+		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
 		update_and_free_page(h, head);
 		rc = 0;
@@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		/*
 		 * Freed from under us. Drop new_page too.
 		 */
+		remove_hugetlb_page(h, new_page, false);
 		update_and_free_page(h, new_page);
 		goto unlock;
 	} else if (page_count(old_page)) {
@@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * Someone has grabbed the page, try to isolate it here.
 		 * Fail with -EBUSY if not possible.
 		 */
+		remove_hugetlb_page(h, new_page, false);
 		update_and_free_page(h, new_page);
 		spin_unlock(&hugetlb_lock);
 		if (!isolate_huge_page(old_page, list))
@@ -2323,13 +2339,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		/*
 		 * Ok, old_page is still a genuine free hugepage. Replace it
 		 * with the new one.
+		 * Note: h->free_huge_pages{_node} counters are decremented
+		 * in remove_hugetlb_page for old_page and incremented in
+		 * enqueue_huge_page for new page.  Net result is no change.
 		 */
-		list_del(&old_page->lru);
+		remove_hugetlb_page(h, old_page, false);
 		update_and_free_page(h, old_page);
-		/*
-		 * h->free_huge_pages{_node} counters do not need to be updated.
-		 */
-		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
+		enqueue_huge_page(h, new_page);
 	}
 unlock:
 	spin_unlock(&hugetlb_lock);
@@ -2667,10 +2683,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 				return;
 			if (PageHighMem(page))
 				continue;
-			list_del(&page->lru);
+			remove_hugetlb_page(h, page, false);
 			update_and_free_page(h, page);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (3 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-06  9:57   ` Michal Hocko
  2021-04-07  8:27   ` Oscar Salvador
  2021-04-05 23:00 ` [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock.  Change all callers to
drop the lock before calling.

With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.

The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df2a3d1f632b..be6031a8e2a9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1446,16 +1446,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
+		spin_unlock(&hugetlb_lock);
 	}
-	spin_unlock(&hugetlb_lock);
 }
 
 /*
@@ -1736,7 +1738,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 				list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
+			/*
+			 * unlock/lock around update_and_free_page is temporary
+			 * and will be removed with subsequent patch.
+			 */
+			spin_unlock(&hugetlb_lock);
 			update_and_free_page(h, page);
+			spin_lock(&hugetlb_lock);
 			ret = 1;
 			break;
 		}
@@ -1805,8 +1813,9 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
+		spin_unlock(&hugetlb_lock);
 		update_and_free_page(h, head);
-		rc = 0;
+		return 0;
 	}
 out:
 	spin_unlock(&hugetlb_lock);
@@ -2291,6 +2300,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nid = page_to_nid(old_page);
 	struct page *new_page;
+	struct page *page_to_free;
 	int ret = 0;
 
 	/*
@@ -2313,16 +2323,16 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * Freed from under us. Drop new_page too.
 		 */
 		remove_hugetlb_page(h, new_page, false);
-		update_and_free_page(h, new_page);
-		goto unlock;
+		page_to_free = new_page;
+		goto unlock_free;
 	} else if (page_count(old_page)) {
 		/*
 		 * Someone has grabbed the page, try to isolate it here.
 		 * Fail with -EBUSY if not possible.
 		 */
 		remove_hugetlb_page(h, new_page, false);
-		update_and_free_page(h, new_page);
 		spin_unlock(&hugetlb_lock);
+		update_and_free_page(h, new_page);
 		if (!isolate_huge_page(old_page, list))
 			ret = -EBUSY;
 		return ret;
@@ -2344,11 +2354,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * enqueue_huge_page for new page.  Net result is no change.
 		 */
 		remove_hugetlb_page(h, old_page, false);
-		update_and_free_page(h, old_page);
 		enqueue_huge_page(h, new_page);
+		page_to_free = old_page;
 	}
-unlock:
+unlock_free:
 	spin_unlock(&hugetlb_lock);
+	update_and_free_page(h, page_to_free);
 
 	return ret;
 }
@@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
+	struct page *page, *next;
+	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
 		return;
 
+	/*
+	 * Collect pages to be freed on a list, and free after dropping lock
+	 */
 	for_each_node_mask(i, *nodes_allowed) {
-		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
-				return;
+				goto out;
 			if (PageHighMem(page))
 				continue;
 			remove_hugetlb_page(h, page, false);
-			update_and_free_page(h, page);
+			list_add(&page->lru, &page_list);
 		}
 	}
+
+out:
+	spin_unlock(&hugetlb_lock);
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+	spin_lock(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (4 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-07  8:44   ` Oscar Salvador
  2021-04-05 23:00 ` [PATCH v4 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.

Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.

Add new helper routine to call update_and_free_page for a list of pages.

Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
race when freeing surplus pages") modified this routine to address a
race which could occur when dropping the hugetlb_lock in the loop that
removes pool pages.  Accounting changes introduced in that commit were
subtle and took some thought to understand.  This commit removes the
cond_resched_lock() and the potential race.  Therefore, remove the
subtle code and restore the more straight forward accounting effectively
reverting the commit.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/hugetlb.c | 93 ++++++++++++++++++++++++++++------------------------
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index be6031a8e2a9..93a2a11b9376 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1204,7 +1204,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
 }
 
 /*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
  * node ["this node"] from which to free a huge page.  Advance the
  * next node id whether or not we find a free huge page to free so
  * that the next attempt to free addresses the next node.
@@ -1386,6 +1386,16 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	}
 }
 
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+{
+	struct page *page, *t_page;
+
+	list_for_each_entry_safe(page, t_page, list, lru) {
+		update_and_free_page(h, page);
+		cond_resched();
+	}
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
 	struct hstate *h;
@@ -1716,16 +1726,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 /*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free.  Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page.  The caller must make
+ * an additional call to free the page to low level allocators.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-							 bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+						nodemask_t *nodes_allowed,
+						 bool acct_surplus)
 {
 	int nr_nodes, node;
-	int ret = 0;
+	struct page *page = NULL;
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
@@ -1734,23 +1746,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 		 */
 		if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
 		    !list_empty(&h->hugepage_freelists[node])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[node].next,
+			page = list_entry(h->hugepage_freelists[node].next,
 					  struct page, lru);
 			remove_hugetlb_page(h, page, acct_surplus);
-			/*
-			 * unlock/lock around update_and_free_page is temporary
-			 * and will be removed with subsequent patch.
-			 */
-			spin_unlock(&hugetlb_lock);
-			update_and_free_page(h, page);
-			spin_lock(&hugetlb_lock);
-			ret = 1;
 			break;
 		}
 	}
 
-	return ret;
+	return page;
 }
 
 /*
@@ -2070,17 +2073,16 @@ static int gather_surplus_pages(struct hstate *h, long delta)
  *    to the associated reservation map.
  * 2) Free any unused surplus pages that may have been allocated to satisfy
  *    the reservation.  As many as unused_resv_pages may be freed.
- *
- * Called with hugetlb_lock held.  However, the lock could be dropped (and
- * reacquired) during calls to cond_resched_lock.  Whenever dropping the lock,
- * we must make sure nobody else can claim pages we are in the process of
- * freeing.  Do this by ensuring resv_huge_page always is greater than the
- * number of huge pages we plan to free when dropping the lock.
  */
 static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	struct page *page;
+	LIST_HEAD(page_list);
+
+	/* Uncommit the reservation */
+	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -2097,24 +2099,21 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * evenly across all nodes with memory. Iterate across these nodes
 	 * until we can no longer free unreserved surplus pages. This occurs
 	 * when the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the freed pages across the
+	 * remove_pool_huge_page() will balance the freed pages across the
 	 * on-line nodes with memory and will handle the hstate accounting.
-	 *
-	 * Note that we decrement resv_huge_pages as we free the pages.  If
-	 * we drop the lock, resv_huge_pages will still be sufficiently large
-	 * to cover subsequent pages we may free.
 	 */
 	while (nr_pages--) {
-		h->resv_huge_pages--;
-		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
+		if (!page)
 			goto out;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
 
 out:
-	/* Fully uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
 }
 
 
@@ -2682,7 +2681,6 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 						nodemask_t *nodes_allowed)
 {
 	int i;
-	struct page *page, *next;
 	LIST_HEAD(page_list);
 
 	if (hstate_is_gigantic(h))
@@ -2692,6 +2690,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	 * Collect pages to be freed on a list, and free after dropping lock
 	 */
 	for_each_node_mask(i, *nodes_allowed) {
+		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
@@ -2705,10 +2704,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 
 out:
 	spin_unlock(&hugetlb_lock);
-	list_for_each_entry_safe(page, next, &page_list, lru) {
-		update_and_free_page(h, page);
-		cond_resched();
-	}
+	update_and_free_pages_bulk(h, &page_list);
 	spin_lock(&hugetlb_lock);
 }
 #else
@@ -2755,6 +2751,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	struct page *page;
+	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
 	/*
@@ -2867,11 +2865,22 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count, nodes_allowed);
+
+	/*
+	 * Collect pages to be removed on list without dropping lock
+	 */
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, nodes_allowed, 0))
+		page = remove_pool_huge_page(h, nodes_allowed, 0);
+		if (!page)
 			break;
-		cond_resched_lock(&hugetlb_lock);
+
+		list_add(&page->lru, &page_list);
 	}
+	/* free the pages after dropping lock */
+	spin_unlock(&hugetlb_lock);
+	update_and_free_pages_bulk(h, &page_list);
+	spin_lock(&hugetlb_lock);
+
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 7/8] hugetlb: make free_huge_page irq safe
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (5 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-07  9:12   ` Oscar Salvador
  2021-04-05 23:00 ` [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
  2021-04-08  0:56 ` [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  8 siblings, 1 reply; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context.  That commit hands off free_huge_page
processing to a workqueue if !in_task.  However, this doesn't cover
all the cases as pointed out by 0day bot lockdep report [1].

:  Possible interrupt unsafe locking scenario:
:
:        CPU0                    CPU1
:        ----                    ----
:   lock(hugetlb_lock);
:                                local_irq_disable();
:                                lock(slock-AF_INET);
:                                lock(hugetlb_lock);
:   <Interrupt>
:     lock(slock-AF_INET);

Shakeel has later explained that this is very likely TCP TX zerocopy
from hugetlb pages scenario when the networking code drops a last
reference to hugetlb page while having IRQ disabled. Hugetlb freeing
path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
chain can lead to a deadlock.

This commit addresses the issue by doing the following:
- Make hugetlb_lock irq safe.  This is mostly a simple process of
  changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c        | 183 +++++++++++++++++---------------------------
 mm/hugetlb_cgroup.c |   8 +-
 2 files changed, 74 insertions(+), 117 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 93a2a11b9376..15b6e7aadb52 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
 	return true;
 }
 
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+						unsigned long irq_flags)
 {
-	spin_unlock(&spool->lock);
+	spin_unlock_irqrestore(&spool->lock, irq_flags);
 
 	/* If no pages are used, and no other handles to the subpool
 	 * remain, give up any reservations based on minimum size and
@@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 
 void hugepage_put_subpool(struct hugepage_subpool *spool)
 {
-	spin_lock(&spool->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&spool->lock, flags);
 	BUG_ON(!spool->count);
 	spool->count--;
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 }
 
 /*
@@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	if (!spool)
 		return ret;
 
-	spin_lock(&spool->lock);
+	spin_lock_irq(&spool->lock);
 
 	if (spool->max_hpages != -1) {		/* maximum size accounting */
 		if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
 	}
 
 unlock_ret:
-	spin_unlock(&spool->lock);
+	spin_unlock_irq(&spool->lock);
 	return ret;
 }
 
@@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 				       long delta)
 {
 	long ret = delta;
+	unsigned long flags;
 
 	if (!spool)
 		return delta;
 
-	spin_lock(&spool->lock);
+	spin_lock_irqsave(&spool->lock, flags);
 
 	if (spool->max_hpages != -1)		/* maximum size accounting */
 		spool->used_hpages -= delta;
@@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
 	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
 	 * quota reference, free it now.
 	 */
-	unlock_or_release_subpool(spool);
+	unlock_or_release_subpool(spool, flags);
 
 	return ret;
 }
@@ -1407,7 +1411,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1417,6 +1421,7 @@ static void __free_huge_page(struct page *page)
 	int nid = page_to_nid(page);
 	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
 	bool restore_reserve;
+	unsigned long flags;
 
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1445,7 +1450,7 @@ static void __free_huge_page(struct page *page)
 			restore_reserve = true;
 	}
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irqsave(&hugetlb_lock, flags);
 	ClearHPageMigratable(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
@@ -1456,66 +1461,18 @@ static void __free_huge_page(struct page *page)
 
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_page(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
-		spin_unlock(&hugetlb_lock);
-	}
-}
-
-/*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
- *
- * free_hpage_workfn() locklessly retrieves the linked list of pages to
- * be freed and frees them one-by-one. As the page->mapping pointer is
- * going to be cleared in __free_huge_page() anyway, it is reused as the
- * llist_node structure of a lockless linked list of huge pages to be freed.
- */
-static LLIST_HEAD(hpage_freelist);
-
-static void free_hpage_workfn(struct work_struct *work)
-{
-	struct llist_node *node;
-	struct page *page;
-
-	node = llist_del_all(&hpage_freelist);
-
-	while (node) {
-		page = container_of((struct address_space **)node,
-				     struct page, mapping);
-		node = node->next;
-		__free_huge_page(page);
-	}
-}
-static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
-
-void free_huge_page(struct page *page)
-{
-	/*
-	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
-	 */
-	if (!in_task()) {
-		/*
-		 * Only call schedule_work() if hpage_freelist is previously
-		 * empty. Otherwise, schedule_work() had been called but the
-		 * workfn hasn't retrieved the list yet.
-		 */
-		if (llist_add((struct llist_node *)&page->mapping,
-			      &hpage_freelist))
-			schedule_work(&free_hpage_work);
-		return;
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
-
-	__free_huge_page(page);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1525,11 +1482,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	hugetlb_set_page_subpool(page, NULL);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	ClearHPageFreed(page);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 }
 
 static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1775,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
 	if (!PageHuge(page))
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(page)) {
 		rc = 0;
 		goto out;
@@ -1792,7 +1749,7 @@ int dissolve_free_huge_page(struct page *page)
 		 * when it is dissolved.
 		 */
 		if (unlikely(!HPageFreed(head))) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			cond_resched();
 
 			/*
@@ -1816,12 +1773,12 @@ int dissolve_free_huge_page(struct page *page)
 		}
 		remove_hugetlb_page(h, page, false);
 		h->max_huge_pages--;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		update_and_free_page(h, head);
 		return 0;
 	}
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return rc;
 }
 
@@ -1863,16 +1820,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
 		goto out_unlock;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
@@ -1882,7 +1839,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		SetHPageTemporary(page);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		put_page(page);
 		return NULL;
 	} else {
@@ -1891,7 +1848,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	}
 
 out_unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return page;
 }
@@ -1941,17 +1898,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask, gfp_t gfp_mask)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
 		struct page *page;
 
 		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
 		if (page) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			return page;
 		}
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
 }
@@ -1999,7 +1956,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	ret = -ENOMEM;
 retry:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
 				NUMA_NO_NODE, NULL);
@@ -2016,7 +1973,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * After retaking hugetlb_lock, we need to recalculate 'needed'
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) -
 			(h->free_huge_pages + allocated);
 	if (needed > 0) {
@@ -2056,12 +2013,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		enqueue_huge_page(h, page);
 	}
 free:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/* Free unnecessary surplus pages to the buddy allocator */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
 		put_page(page);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
 }
@@ -2111,9 +2068,9 @@ static void return_unused_surplus_pages(struct hstate *h,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 
 
@@ -2316,7 +2273,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 	 */
 	page_ref_dec(new_page);
 retry:
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHuge(old_page)) {
 		/*
 		 * Freed from under us. Drop new_page too.
@@ -2330,7 +2287,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * Fail with -EBUSY if not possible.
 		 */
 		remove_hugetlb_page(h, new_page, false);
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		update_and_free_page(h, new_page);
 		if (!isolate_huge_page(old_page, list))
 			ret = -EBUSY;
@@ -2341,7 +2298,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * freelist yet. Race window is small, so we can succed here if
 		 * we retry.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		cond_resched();
 		goto retry;
 	} else {
@@ -2357,7 +2314,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		page_to_free = old_page;
 	}
 unlock_free:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_page(h, page_to_free);
 
 	return ret;
@@ -2374,15 +2331,15 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	 * to carefully check the state under the lock.
 	 * Return success when racing as if we dissolved the page ourselves.
 	 */
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (PageHuge(page)) {
 		head = compound_head(page);
 		h = page_hstate(head);
 	} else {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		return 0;
 	}
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	/*
 	 * Fence off gigantic pages as there is a cyclic dependency between
@@ -2462,7 +2419,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * glb_chg is passed to indicate whether or not a page must be taken
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
@@ -2470,7 +2427,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 */
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
@@ -2478,7 +2435,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			SetHPageRestoreReserve(page);
 			h->resv_huge_pages--;
 		}
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		list_add(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
@@ -2491,7 +2448,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 						  h_cg, page);
 	}
 
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	hugetlb_set_page_subpool(page, spool);
 
@@ -2703,9 +2660,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	}
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 }
 #else
 static inline void try_to_free_low(struct hstate *h, unsigned long count,
@@ -2770,7 +2727,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * pages in hstate via the proc/sysfs interfaces.
 	 */
 	mutex_lock(&h->resize_lock);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	/*
 	 * Check for a node specific request.
@@ -2801,7 +2758,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 */
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			mutex_unlock(&h->resize_lock);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
@@ -2831,14 +2788,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (!ret)
 			goto out;
 
@@ -2877,9 +2834,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		list_add(&page->lru, &page_list);
 	}
 	/* free the pages after dropping lock */
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, 1))
@@ -2887,7 +2844,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	}
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	mutex_unlock(&h->resize_lock);
 
 	NODEMASK_FREE(node_alloc_noretry);
@@ -3043,9 +3000,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	if (err)
 		return err;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h->nr_overcommit_huge_pages = input;
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 
 	return count;
 }
@@ -3632,9 +3589,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 		goto out;
 
 	if (write) {
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		h->nr_overcommit_huge_pages = tmp;
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 out:
 	return ret;
@@ -3730,7 +3687,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	if (!delta)
 		return 0;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * When cpuset is configured, it breaks the strict hugetlb page
 	 * reservation as the accounting is done on a global variable. Such
@@ -3769,7 +3726,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
@@ -5833,7 +5790,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 {
 	bool ret = true;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	if (!PageHeadHuge(page) ||
 	    !HPageMigratable(page) ||
 	    !get_page_unless_zero(page)) {
@@ -5843,16 +5800,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	ClearHPageMigratable(page);
 	list_move_tail(&page->lru, list);
 unlock:
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return ret;
 }
 
 void putback_active_hugepage(struct page *page)
 {
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	SetHPageMigratable(page);
 	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	put_page(page);
 }
 
@@ -5886,12 +5843,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		 */
 		if (new_nid == old_nid)
 			return;
-		spin_lock(&hugetlb_lock);
+		spin_lock_irq(&hugetlb_lock);
 		if (h->surplus_huge_pages_node[old_nid]) {
 			h->surplus_huge_pages_node[old_nid]--;
 			h->surplus_huge_pages_node[new_nid]++;
 		}
-		spin_unlock(&hugetlb_lock);
+		spin_unlock_irq(&hugetlb_lock);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 726b85f4f303..5383023d0cca 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		idx = 0;
 		for_each_hstate(h) {
-			spin_lock(&hugetlb_lock);
+			spin_lock_irq(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);
 
-			spin_unlock(&hugetlb_lock);
+			spin_unlock_irq(&hugetlb_lock);
 			idx++;
 		}
 		cond_resched();
@@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	if (hugetlb_cgroup_disabled())
 		return;
 
-	spin_lock(&hugetlb_lock);
+	spin_lock_irq(&hugetlb_lock);
 	h_cg = hugetlb_cgroup_from_page(oldhpage);
 	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
 	set_hugetlb_cgroup(oldhpage, NULL);
@@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	set_hugetlb_cgroup(newhpage, h_cg);
 	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
-	spin_unlock(&hugetlb_lock);
+	spin_unlock_irq(&hugetlb_lock);
 	return;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (6 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-04-05 23:00 ` Mike Kravetz
  2021-04-07  9:13   ` Oscar Salvador
  2021-04-08  0:56 ` [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
  8 siblings, 1 reply; 37+ messages in thread
From: Mike Kravetz @ 2021-04-05 23:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	David Hildenbrand, Muchun Song, David Rientjes, Miaohe Lin,
	Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton,
	Mike Kravetz

After making hugetlb lock irq safe and separating some functionality
done under the lock, add some lockdep_assert_held to help verify
locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 15b6e7aadb52..5d9f74e2b963 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1062,6 +1062,8 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
+
+	lockdep_assert_held(&hugetlb_lock);
 	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
@@ -1073,6 +1075,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	struct page *page;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
+	lockdep_assert_held(&hugetlb_lock);
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
 		if (pin && !is_pinnable_page(page))
 			continue;
@@ -1341,6 +1344,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 {
 	int nid = page_to_nid(page);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
@@ -1696,6 +1700,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
 	int nr_nodes, node;
 	struct page *page = NULL;
 
+	lockdep_assert_held(&hugetlb_lock);
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
 		/*
 		 * If we're returning unused surplus pages, only examine
@@ -1945,6 +1950,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	long needed, allocated;
 	bool alloc_ok = true;
 
+	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
 		h->resv_huge_pages += delta;
@@ -2038,6 +2044,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	struct page *page;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
@@ -2640,6 +2647,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 	int i;
 	LIST_HEAD(page_list);
 
+	lockdep_assert_held(&hugetlb_lock);
 	if (hstate_is_gigantic(h))
 		return;
 
@@ -2681,6 +2689,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 {
 	int nr_nodes, node;
 
+	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-04-05 23:00 ` [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
@ 2021-04-06  7:18   ` Oscar Salvador
  2021-04-07  9:26   ` David Hildenbrand
  1 sibling, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-06  7:18 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:37PM -0700, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  mm/hugetlb.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>  	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
> -		/*
> -		 * Temporarily drop the hugetlb_lock, because
> -		 * we might block in free_gigantic_page().
> -		 */
> -		spin_unlock(&hugetlb_lock);
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> -		spin_lock(&hugetlb_lock);
>  	} else {
>  		__free_pages(page, huge_page_order(h));
>  	}
> -- 
> 2.30.2
> 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
@ 2021-04-06  9:17   ` Peter Zijlstra
  2021-04-06  9:35   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2021-04-06  9:17 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, Oscar Salvador, David Hildenbrand, Muchun Song,
	David Rientjes, Miaohe Lin, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:36PM -0700, Mike Kravetz wrote:
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.

Or an rb-tree that stores ranges, which should also be much cheaper
storage wise, for those sizes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-04-06  9:17   ` Peter Zijlstra
@ 2021-04-06  9:35   ` Michal Hocko
  2021-04-07  9:23   ` David Hildenbrand
  2021-04-07 18:24   ` Roman Gushchin
  3 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-04-06  9:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon 05-04-21 16:00:36, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

I belive I have acked the previous version already. Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/cma.c       | 18 +++++++++---------
>  mm/cma.h       |  2 +-
>  mm/cma_debug.c |  8 ++++----
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f3bca4178c7f..995e15480937 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include <linux/memblock.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> -#include <linux/mutex.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>  			     unsigned long count)
>  {
>  	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>  
>  	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>  	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>  	     pfn += pageblock_nr_pages)
>  		init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>  	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		goto out;
>  
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irq(&cma->lock);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irq(&cma->lock);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irq(&cma->lock);
>  
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>  	struct cma *cma = data;
>  	unsigned long used;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)used << cma->order_per_bit;
>  
>  	return 0;
> @@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)maxchunk << cma->order_per_bit;
>  
>  	return 0;
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
@ 2021-04-06  9:56   ` Michal Hocko
  2021-04-06 12:50     ` Oscar Salvador
  2021-04-06 16:49     ` Mike Kravetz
  2021-04-06 13:41   ` Oscar Salvador
  2021-04-06 13:44   ` Oscar Salvador
  2 siblings, 2 replies; 37+ messages in thread
From: Michal Hocko @ 2021-04-06  9:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
> this is performed in remove_hugetlb_page.  The only functionality
> performed by update_and_free_page is to free the base pages to the lower
> level allocators.
> 
> update_and_free_page is typically called after remove_hugetlb_page.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.  This commit should not
> introduce any changes to functionality.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Btw. I would prefer to reverse the ordering of this and Oscar's
patchset. This one is a bug fix which might be interesting for stable
backports while Oscar's work can be looked as a general functionality
improvement.

> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		/*
>  		 * Freed from under us. Drop new_page too.
>  		 */
> +		remove_hugetlb_page(h, new_page, false);
>  		update_and_free_page(h, new_page);
>  		goto unlock;
>  	} else if (page_count(old_page)) {
> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * Someone has grabbed the page, try to isolate it here.
>  		 * Fail with -EBUSY if not possible.
>  		 */
> +		remove_hugetlb_page(h, new_page, false);
>  		update_and_free_page(h, new_page);
>  		spin_unlock(&hugetlb_lock);
>  		if (!isolate_huge_page(old_page, list))

the page is not enqued anywhere here so remove_hugetlb_page would blow
when linked list debugging is enabled.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-04-05 23:00 ` [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
@ 2021-04-06  9:57   ` Michal Hocko
  2021-04-07  8:27   ` Oscar Salvador
  1 sibling, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-04-06  9:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon 05-04-21 16:00:40, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
> 
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
> 
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Still looks good.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df2a3d1f632b..be6031a8e2a9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1446,16 +1446,18 @@ static void __free_huge_page(struct page *page)
>  
>  	if (HPageTemporary(page)) {
>  		remove_hugetlb_page(h, page, false);
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_page(h, page, true);
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, page);
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> +		spin_unlock(&hugetlb_lock);
>  	}
> -	spin_unlock(&hugetlb_lock);
>  }
>  
>  /*
> @@ -1736,7 +1738,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  				list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
>  			remove_hugetlb_page(h, page, acct_surplus);
> +			/*
> +			 * unlock/lock around update_and_free_page is temporary
> +			 * and will be removed with subsequent patch.
> +			 */
> +			spin_unlock(&hugetlb_lock);
>  			update_and_free_page(h, page);
> +			spin_lock(&hugetlb_lock);
>  			ret = 1;
>  			break;
>  		}
> @@ -1805,8 +1813,9 @@ int dissolve_free_huge_page(struct page *page)
>  		}
>  		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
> +		spin_unlock(&hugetlb_lock);
>  		update_and_free_page(h, head);
> -		rc = 0;
> +		return 0;
>  	}
>  out:
>  	spin_unlock(&hugetlb_lock);
> @@ -2291,6 +2300,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  	int nid = page_to_nid(old_page);
>  	struct page *new_page;
> +	struct page *page_to_free;
>  	int ret = 0;
>  
>  	/*
> @@ -2313,16 +2323,16 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * Freed from under us. Drop new_page too.
>  		 */
>  		remove_hugetlb_page(h, new_page, false);
> -		update_and_free_page(h, new_page);
> -		goto unlock;
> +		page_to_free = new_page;
> +		goto unlock_free;
>  	} else if (page_count(old_page)) {
>  		/*
>  		 * Someone has grabbed the page, try to isolate it here.
>  		 * Fail with -EBUSY if not possible.
>  		 */
>  		remove_hugetlb_page(h, new_page, false);
> -		update_and_free_page(h, new_page);
>  		spin_unlock(&hugetlb_lock);
> +		update_and_free_page(h, new_page);
>  		if (!isolate_huge_page(old_page, list))
>  			ret = -EBUSY;
>  		return ret;
> @@ -2344,11 +2354,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * enqueue_huge_page for new page.  Net result is no change.
>  		 */
>  		remove_hugetlb_page(h, old_page, false);
> -		update_and_free_page(h, old_page);
>  		enqueue_huge_page(h, new_page);
> +		page_to_free = old_page;
>  	}
> -unlock:
> +unlock_free:
>  	spin_unlock(&hugetlb_lock);
> +	update_and_free_page(h, page_to_free);
>  
>  	return ret;
>  }
> @@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  						nodemask_t *nodes_allowed)
>  {
>  	int i;
> +	struct page *page, *next;
> +	LIST_HEAD(page_list);
>  
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> +	/*
> +	 * Collect pages to be freed on a list, and free after dropping lock
> +	 */
>  	for_each_node_mask(i, *nodes_allowed) {
> -		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
>  			if (count >= h->nr_huge_pages)
> -				return;
> +				goto out;
>  			if (PageHighMem(page))
>  				continue;
>  			remove_hugetlb_page(h, page, false);
> -			update_and_free_page(h, page);
> +			list_add(&page->lru, &page_list);
>  		}
>  	}
> +
> +out:
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +	spin_lock(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-06  9:56   ` Michal Hocko
@ 2021-04-06 12:50     ` Oscar Salvador
  2021-04-06 16:49     ` Mike Kravetz
  1 sibling, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-06 12:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Tue, Apr 06, 2021 at 11:56:11AM +0200, Michal Hocko wrote:
> Btw. I would prefer to reverse the ordering of this and Oscar's
> patchset. This one is a bug fix which might be interesting for stable
> backports while Oscar's work can be looked as a general functionality
> improvement.

If it makes things easier, I do not mind this work gets first and then I
work on top of that.
Given said that, I will try to have a look at this series.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
  2021-04-06  9:56   ` Michal Hocko
@ 2021-04-06 13:41   ` Oscar Salvador
  2021-04-06 20:52     ` Mike Kravetz
  2021-04-06 13:44   ` Oscar Salvador
  2 siblings, 1 reply; 37+ messages in thread
From: Oscar Salvador @ 2021-04-06 13:41 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:39PM -0700, Mike Kravetz wrote:
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> +							bool adjust_surplus)
> +{
> +	int nid = page_to_nid(page);
> +
> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +		return;
> +
> +	list_del(&page->lru);
> +
> +	if (HPageFreed(page)) {
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		ClearHPageFreed(page);
> +	}
> +	if (adjust_surplus) {
> +		h->surplus_huge_pages--;
> +		h->surplus_huge_pages_node[nid]--;
> +	}
> +
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);

These checks feel a bit odd here.
I would move them above, before we start messing with the counters?

> +
> +	ClearHPageTemporary(page);

Why clearing it unconditionally? I guess we do not really care, but
someone might wonder why when reading the core.
So I would either do as we used to do and only clear it in case of
HPageTemporary(), or drop a comment.

> +	set_page_refcounted(page);
> +	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +
> +	h->nr_huge_pages--;
> +	h->nr_huge_pages_node[nid]--;
> +}

As Michal pointed out, remove_hugetlb_page() from
alloc_and_dissolve_huge_page() might be problematic in some cases because
the new_page has not been enqueued yet.

Now, I guess this might be easily fixed with checking list_empty()
before going ahead with list_del() call, or with another bool parameter
'delete', with a fat comment explaining why we can get to call remove_huge_page()
on a page that is not in any list.

Another solution that comes to my mind is to split remove_huge_page() functionality in
1) delete from list + unaccount free and surplus pages and 2) reset page's state + unaccount
nr_huge_pages

But that might be just biased as would fit for my usecase.
So maybe a list_empty() or the bool parameter might just do.
 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
  2021-04-06  9:56   ` Michal Hocko
  2021-04-06 13:41   ` Oscar Salvador
@ 2021-04-06 13:44   ` Oscar Salvador
  2 siblings, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-06 13:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:39PM -0700, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
> this is performed in remove_hugetlb_page.  The only functionality
> performed by update_and_free_page is to free the base pages to the lower
> level allocators.
> 
> update_and_free_page is typically called after remove_hugetlb_page.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.  This commit should not
> introduce any changes to functionality.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Btw, it seems you were just doing fine before realizing that my series
went in.
So, as this seems a rather urgent matter to move forward (for obvious
reasons and also because it holds hotplug-vmemmap stuff), I wonder if
it would make your life easier to just ask Andrew to remove my series
for the time being and give it yours priority.

I can later work on top of that.

> ---
>  mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++----------------------
>  1 file changed, 51 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8497a3598c86..df2a3d1f632b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1055,18 +1055,13 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  	return false;
>  }
>  
> -static void __enqueue_huge_page(struct list_head *list, struct page *page)
> -{
> -	list_move(&page->lru, list);
> -	SetHPageFreed(page);
> -}
> -
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> -	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
> +	list_move(&page->lru, &h->hugepage_freelists[nid]);
>  	h->free_huge_pages++;
>  	h->free_huge_pages_node[nid]++;
> +	SetHPageFreed(page);
>  }
>  
>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> @@ -1331,6 +1326,43 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>  						unsigned int order) { }
>  #endif
>  
> +/*
> + * Remove hugetlb page from lists, and update dtor so that page appears
> + * as just a compound page.  A reference is held on the page.
> + *
> + * Must be called with hugetlb lock held.
> + */
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> +							bool adjust_surplus)
> +{
> +	int nid = page_to_nid(page);
> +
> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +		return;
> +
> +	list_del(&page->lru);
> +
> +	if (HPageFreed(page)) {
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		ClearHPageFreed(page);
> +	}
> +	if (adjust_surplus) {
> +		h->surplus_huge_pages--;
> +		h->surplus_huge_pages_node[nid]--;
> +	}
> +
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> +
> +	ClearHPageTemporary(page);
> +	set_page_refcounted(page);
> +	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +
> +	h->nr_huge_pages--;
> +	h->nr_huge_pages_node[nid]--;
> +}
> +
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> @@ -1339,8 +1371,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
>  
> -	h->nr_huge_pages--;
> -	h->nr_huge_pages_node[page_to_nid(page)]--;
>  	for (i = 0; i < pages_per_huge_page(h);
>  	     i++, subpage = mem_map_next(subpage, page, i)) {
>  		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> @@ -1348,10 +1378,6 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  				1 << PG_active | 1 << PG_private |
>  				1 << PG_writeback);
>  	}
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> -	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> -	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> -	set_page_refcounted(page);
>  	if (hstate_is_gigantic(h)) {
>  		destroy_compound_gigantic_page(page, huge_page_order(h));
>  		free_gigantic_page(page, huge_page_order(h));
> @@ -1419,15 +1445,12 @@ static void __free_huge_page(struct page *page)
>  		h->resv_huge_pages++;
>  
>  	if (HPageTemporary(page)) {
> -		list_del(&page->lru);
> -		ClearHPageTemporary(page);
> +		remove_hugetlb_page(h, page, false);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
> -		list_del(&page->lru);
> +		remove_hugetlb_page(h, page, true);
>  		update_and_free_page(h, page);
> -		h->surplus_huge_pages--;
> -		h->surplus_huge_pages_node[nid]--;
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> @@ -1712,13 +1735,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  			struct page *page =
>  				list_entry(h->hugepage_freelists[node].next,
>  					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[node]--;
> -			if (acct_surplus) {
> -				h->surplus_huge_pages--;
> -				h->surplus_huge_pages_node[node]--;
> -			}
> +			remove_hugetlb_page(h, page, acct_surplus);
>  			update_and_free_page(h, page);
>  			ret = 1;
>  			break;
> @@ -1756,7 +1773,6 @@ int dissolve_free_huge_page(struct page *page)
>  	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
> -		int nid = page_to_nid(head);
>  		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
>  
> @@ -1787,9 +1803,7 @@ int dissolve_free_huge_page(struct page *page)
>  			SetPageHWPoison(page);
>  			ClearPageHWPoison(head);
>  		}
> -		list_del(&head->lru);
> -		h->free_huge_pages--;
> -		h->free_huge_pages_node[nid]--;
> +		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
>  		update_and_free_page(h, head);
>  		rc = 0;
> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		/*
>  		 * Freed from under us. Drop new_page too.
>  		 */
> +		remove_hugetlb_page(h, new_page, false);
>  		update_and_free_page(h, new_page);
>  		goto unlock;
>  	} else if (page_count(old_page)) {
> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * Someone has grabbed the page, try to isolate it here.
>  		 * Fail with -EBUSY if not possible.
>  		 */
> +		remove_hugetlb_page(h, new_page, false);
>  		update_and_free_page(h, new_page);
>  		spin_unlock(&hugetlb_lock);
>  		if (!isolate_huge_page(old_page, list))
> @@ -2323,13 +2339,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		/*
>  		 * Ok, old_page is still a genuine free hugepage. Replace it
>  		 * with the new one.
> +		 * Note: h->free_huge_pages{_node} counters are decremented
> +		 * in remove_hugetlb_page for old_page and incremented in
> +		 * enqueue_huge_page for new page.  Net result is no change.
>  		 */
> -		list_del(&old_page->lru);
> +		remove_hugetlb_page(h, old_page, false);
>  		update_and_free_page(h, old_page);
> -		/*
> -		 * h->free_huge_pages{_node} counters do not need to be updated.
> -		 */
> -		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
> +		enqueue_huge_page(h, new_page);
>  	}
>  unlock:
>  	spin_unlock(&hugetlb_lock);
> @@ -2667,10 +2683,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  				return;
>  			if (PageHighMem(page))
>  				continue;
> -			list_del(&page->lru);
> +			remove_hugetlb_page(h, page, false);
>  			update_and_free_page(h, page);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[page_to_nid(page)]--;
>  		}
>  	}
>  }
> -- 
> 2.30.2
> 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-06  9:56   ` Michal Hocko
  2021-04-06 12:50     ` Oscar Salvador
@ 2021-04-06 16:49     ` Mike Kravetz
  2021-04-06 17:57       ` Oscar Salvador
  2021-04-07  8:21       ` Michal Hocko
  1 sibling, 2 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-06 16:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On 4/6/21 2:56 AM, Michal Hocko wrote:
> On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
>> The new remove_hugetlb_page() routine is designed to remove a hugetlb
>> page from hugetlbfs processing.  It will remove the page from the active
>> or free list, update global counters and set the compound page
>> destructor to NULL so that PageHuge() will return false for the 'page'.
>> After this call, the 'page' can be treated as a normal compound page or
>> a collection of base size pages.
>>
>> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
>> this is performed in remove_hugetlb_page.  The only functionality
>> performed by update_and_free_page is to free the base pages to the lower
>> level allocators.
>>
>> update_and_free_page is typically called after remove_hugetlb_page.
>>
>> remove_hugetlb_page is to be called with the hugetlb_lock held.
>>
>> Creating this routine and separating functionality is in preparation for
>> restructuring code to reduce lock hold times.  This commit should not
>> introduce any changes to functionality.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Btw. I would prefer to reverse the ordering of this and Oscar's
> patchset. This one is a bug fix which might be interesting for stable
> backports while Oscar's work can be looked as a general functionality
> improvement.

Ok, that makes sense.

Andrew, can we make this happen?  It would require removing Oscar's
series until it can be modified to work on top of this.
There is only one small issue with this series as it originally went
into mmotm.  There is a missing conversion of spin_lock to spin_lock_irq
in patch 7.  In addition, there are some suggested changes from Oscar to
this patch.  I do not think they are necessary, but I could make those
as well.  Let me know what I can do to help make this happen.

>> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>>  		/*
>>  		 * Freed from under us. Drop new_page too.
>>  		 */
>> +		remove_hugetlb_page(h, new_page, false);
>>  		update_and_free_page(h, new_page);
>>  		goto unlock;
>>  	} else if (page_count(old_page)) {
>> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>>  		 * Someone has grabbed the page, try to isolate it here.
>>  		 * Fail with -EBUSY if not possible.
>>  		 */
>> +		remove_hugetlb_page(h, new_page, false);
>>  		update_and_free_page(h, new_page);
>>  		spin_unlock(&hugetlb_lock);
>>  		if (!isolate_huge_page(old_page, list))
> 
> the page is not enqued anywhere here so remove_hugetlb_page would blow
> when linked list debugging is enabled.

I also thought this would be an issue.  However, INIT_LIST_HEAD would
have been called for the page so,

static inline void INIT_LIST_HEAD(struct list_head *list)
{
        WRITE_ONCE(list->next, list);
        list->prev = list;
}

The debug checks of concern in __list_del_entry_valid are:

            CHECK_DATA_CORRUPTION(prev->next != entry,
                        "list_del corruption. prev->next should be %px, but was
%px\n",
                        entry, prev->next) ||
            CHECK_DATA_CORRUPTION(next->prev != entry,
                        "list_del corruption. next->prev should be %px, but was
%px\n",
                        entry, next->prev))

Since, all pointers point back to the list(head) the check passes.  My
track record with the list routines is not so good, so I actually
forced list_del after INIT_LIST_HEAD with list debugging enabled and did
not enounter any issues.

Going forward, I agree it would be better to perhaps add a list_empty
check so that things do not blow up if the debugging code is changed.

At one time I also thought of splitting the functionality in
alloc_fresh_huge_page and prep_new_huge_page so that it would only
allocate/prep the page but not increment nr_huge_pages.  A new routine
would be used to increment the counter when it was actually put into use.
I thought this could be used when doing bulk adjustments in set_max_huge_pages
but the benefit would be minimal.  This seems like something that would
be useful in Oscar's alloc_and_dissolve_huge_page routine.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-06 16:49     ` Mike Kravetz
@ 2021-04-06 17:57       ` Oscar Salvador
  2021-04-07  8:21       ` Michal Hocko
  1 sibling, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-06 17:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On 2021-04-06 18:49, Mike Kravetz wrote:
> 
> Andrew, can we make this happen?  It would require removing Oscar's
> series until it can be modified to work on top of this.
> There is only one small issue with this series as it originally went
> into mmotm.  There is a missing conversion of spin_lock to 
> spin_lock_irq
> in patch 7.  In addition, there are some suggested changes from Oscar 
> to
> this patch.  I do not think they are necessary, but I could make those
> as well.  Let me know what I can do to help make this happen.

I agree that it might not be necesary to make such changes, but I still 
would like to see an explanation on the points I raised(excluding the 
list_del() as you already proved that is not necesary), just to be sure 
I am not missing anything.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-06 13:41   ` Oscar Salvador
@ 2021-04-06 20:52     ` Mike Kravetz
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-06 20:52 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On 4/6/21 6:41 AM, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:39PM -0700, Mike Kravetz wrote:
>> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
>> +							bool adjust_surplus)
>> +{
>> +	int nid = page_to_nid(page);
>> +
>> +	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>> +		return;
>> +
>> +	list_del(&page->lru);
>> +
>> +	if (HPageFreed(page)) {
>> +		h->free_huge_pages--;
>> +		h->free_huge_pages_node[nid]--;
>> +		ClearHPageFreed(page);
>> +	}
>> +	if (adjust_surplus) {
>> +		h->surplus_huge_pages--;
>> +		h->surplus_huge_pages_node[nid]--;
>> +	}
>> +
>> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
>> +	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
> 
> These checks feel a bit odd here.
> I would move them above, before we start messing with the counters?
> 

This routine is comprised of code that was previously in update_and_free_page
and __free_huge_page.   In those routines, the VM_BUG_ON_PAGE came after the
counter adjustments.  That is the only reason they are positioned as they are.

I agree that it makes more sense to add them to the beginning of the routine.

>> +
>> +	ClearHPageTemporary(page);
> 
> Why clearing it unconditionally? I guess we do not really care, but
> someone might wonder why when reading the core.
> So I would either do as we used to do and only clear it in case of
> HPageTemporary(), or drop a comment.
> 

Technically, the HPage* flags are meaningless after calling this
routine.  So, there really is no need to modify them at all.  The
flag clearing code is left over from the routines in which they
previously existed.

Any clearing of HPage* flags in this routine is unnecessary and should
be removed to avoid any questions.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
  2021-04-06 16:49     ` Mike Kravetz
  2021-04-06 17:57       ` Oscar Salvador
@ 2021-04-07  8:21       ` Michal Hocko
  1 sibling, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-04-07  8:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Tue 06-04-21 09:49:13, Mike Kravetz wrote:
> On 4/6/21 2:56 AM, Michal Hocko wrote:
> > On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
[...]
> >> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> >>  		/*
> >>  		 * Freed from under us. Drop new_page too.
> >>  		 */
> >> +		remove_hugetlb_page(h, new_page, false);
> >>  		update_and_free_page(h, new_page);
> >>  		goto unlock;
> >>  	} else if (page_count(old_page)) {
> >> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> >>  		 * Someone has grabbed the page, try to isolate it here.
> >>  		 * Fail with -EBUSY if not possible.
> >>  		 */
> >> +		remove_hugetlb_page(h, new_page, false);
> >>  		update_and_free_page(h, new_page);
> >>  		spin_unlock(&hugetlb_lock);
> >>  		if (!isolate_huge_page(old_page, list))
> > 
> > the page is not enqued anywhere here so remove_hugetlb_page would blow
> > when linked list debugging is enabled.
> 
> I also thought this would be an issue.  However, INIT_LIST_HEAD would
> have been called for the page so,

OK, this is true for a freshly allocated hugetlb page (prep_new_huge_page.
It's a very sublte dependency though. In case somebody ever wants to
fortify linked lists and decides to check list_del on an empty list then
this would wait silently to blow up.

> Going forward, I agree it would be better to perhaps add a list_empty
> check so that things do not blow up if the debugging code is changed.

Yes this is less tricky then a bool flag or making more stages of the
tear down. 2 stages are more than enough IMHO.

> At one time I also thought of splitting the functionality in
> alloc_fresh_huge_page and prep_new_huge_page so that it would only
> allocate/prep the page but not increment nr_huge_pages.

We already have that distinction. alloc_buddy_huge_page is there to
allocate a fresh huge page without any hstate  accunting. Considering
that giga pages are not supported for the migration anyway, maybe this
would make Oscar's work slightly less tricky?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-04-05 23:00 ` [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
  2021-04-06  9:57   ` Michal Hocko
@ 2021-04-07  8:27   ` Oscar Salvador
  2021-04-07  9:28     ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  8:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:40PM -0700, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
> 
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
> 
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Without looking too close at the changes made to alloc_and_dissolve_huge_page():

Reviewed-by: Oscar Salvador <osalvador@suse.de>

One question below:

> @@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  						nodemask_t *nodes_allowed)
>  {
>  	int i;
> +	struct page *page, *next;
> +	LIST_HEAD(page_list);
>  
>  	if (hstate_is_gigantic(h))
>  		return;
>  
> +	/*
> +	 * Collect pages to be freed on a list, and free after dropping lock
> +	 */
>  	for_each_node_mask(i, *nodes_allowed) {
> -		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
>  			if (count >= h->nr_huge_pages)
> -				return;
> +				goto out;
>  			if (PageHighMem(page))
>  				continue;
>  			remove_hugetlb_page(h, page, false);
> -			update_and_free_page(h, page);
> +			list_add(&page->lru, &page_list);
>  		}
>  	}
> +
> +out:
> +	spin_unlock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		update_and_free_page(h, page);
> +		cond_resched();
> +	}
> +	spin_lock(&hugetlb_lock);

Can we get here with an empty list? Maybe if someone raced with us manipulating
nr_huge_pages? AFAICS, this gets called under the lock, and the adjusting in
remove_hugetlb_page() gets also done under the lock, so I guess this is not
possible to happen.
The reason I am asking is whether we want to check for the list to be empty before
we do the unacquire/acquire lock dancing.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
  2021-04-05 23:00 ` [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
@ 2021-04-07  8:44   ` Oscar Salvador
  0 siblings, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  8:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:41PM -0700, Mike Kravetz wrote:
> free_pool_huge_page was called with hugetlb_lock held.  It would remove
> a hugetlb page, and then free the corresponding pages to the lower level
> allocators such as buddy.  free_pool_huge_page was called in a loop to
> remove hugetlb pages and these loops could hold the hugetlb_lock for a
> considerable time.
> 
> Create new routine remove_pool_huge_page to replace free_pool_huge_page.
> remove_pool_huge_page will remove the hugetlb page, and it must be
> called with the hugetlb_lock held.  It will return the removed page and
> it is the responsibility of the caller to free the page to the lower
> level allocators.  The hugetlb_lock is dropped before freeing to these
> allocators which results in shorter lock hold times.
> 
> Add new helper routine to call update_and_free_page for a list of pages.
> 
> Note: Some changes to the routine return_unused_surplus_pages are in
> need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
> race when freeing surplus pages") modified this routine to address a
> race which could occur when dropping the hugetlb_lock in the loop that
> removes pool pages.  Accounting changes introduced in that commit were
> subtle and took some thought to understand.  This commit removes the
> cond_resched_lock() and the potential race.  Therefore, remove the
> subtle code and restore the more straight forward accounting effectively
> reverting the commit.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

It crossed my mind that we could use update_and_free_pages_bulk() from
dissolve_free_huge_pages() <-> dissolve_free_huge_page() but the latter
looks too hairy already and it is probably not worth it.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 7/8] hugetlb: make free_huge_page irq safe
  2021-04-05 23:00 ` [PATCH v4 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
@ 2021-04-07  9:12   ` Oscar Salvador
  2021-04-07  9:33     ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  9:12 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:42PM -0700, Mike Kravetz wrote:
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context.  That commit hands off free_huge_page
> processing to a workqueue if !in_task.  However, this doesn't cover
> all the cases as pointed out by 0day bot lockdep report [1].
> 
> :  Possible interrupt unsafe locking scenario:
> :
> :        CPU0                    CPU1
> :        ----                    ----
> :   lock(hugetlb_lock);
> :                                local_irq_disable();
> :                                lock(slock-AF_INET);
> :                                lock(hugetlb_lock);
> :   <Interrupt>
> :     lock(slock-AF_INET);
> 
> Shakeel has later explained that this is very likely TCP TX zerocopy
> from hugetlb pages scenario when the networking code drops a last
> reference to hugetlb page while having IRQ disabled. Hugetlb freeing
> path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
> chain can lead to a deadlock.
> 
> This commit addresses the issue by doing the following:
> - Make hugetlb_lock irq safe.  This is mostly a simple process of
>   changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
> 
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

So, irq_lock_irqsave/spin_unlock_irqrestore is to be used in places
that might have been called from an IRQ context?

Looks good to me:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  mm/hugetlb.c        | 183 +++++++++++++++++---------------------------
>  mm/hugetlb_cgroup.c |   8 +-
>  2 files changed, 74 insertions(+), 117 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 93a2a11b9376..15b6e7aadb52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -93,9 +93,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  	return true;
>  }
>  
> -static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
> +						unsigned long irq_flags)
>  {
> -	spin_unlock(&spool->lock);
> +	spin_unlock_irqrestore(&spool->lock, irq_flags);
>  
>  	/* If no pages are used, and no other handles to the subpool
>  	 * remain, give up any reservations based on minimum size and
> @@ -134,10 +135,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
>  
>  void hugepage_put_subpool(struct hugepage_subpool *spool)
>  {
> -	spin_lock(&spool->lock);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&spool->lock, flags);
>  	BUG_ON(!spool->count);
>  	spool->count--;
> -	unlock_or_release_subpool(spool);
> +	unlock_or_release_subpool(spool, flags);
>  }
>  
>  /*
> @@ -156,7 +159,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>  	if (!spool)
>  		return ret;
>  
> -	spin_lock(&spool->lock);
> +	spin_lock_irq(&spool->lock);
>  
>  	if (spool->max_hpages != -1) {		/* maximum size accounting */
>  		if ((spool->used_hpages + delta) <= spool->max_hpages)
> @@ -183,7 +186,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
>  	}
>  
>  unlock_ret:
> -	spin_unlock(&spool->lock);
> +	spin_unlock_irq(&spool->lock);
>  	return ret;
>  }
>  
> @@ -197,11 +200,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>  				       long delta)
>  {
>  	long ret = delta;
> +	unsigned long flags;
>  
>  	if (!spool)
>  		return delta;
>  
> -	spin_lock(&spool->lock);
> +	spin_lock_irqsave(&spool->lock, flags);
>  
>  	if (spool->max_hpages != -1)		/* maximum size accounting */
>  		spool->used_hpages -= delta;
> @@ -222,7 +226,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
>  	 * If hugetlbfs_put_super couldn't free spool due to an outstanding
>  	 * quota reference, free it now.
>  	 */
> -	unlock_or_release_subpool(spool);
> +	unlock_or_release_subpool(spool, flags);
>  
>  	return ret;
>  }
> @@ -1407,7 +1411,7 @@ struct hstate *size_to_hstate(unsigned long size)
>  	return NULL;
>  }
>  
> -static void __free_huge_page(struct page *page)
> +void free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1417,6 +1421,7 @@ static void __free_huge_page(struct page *page)
>  	int nid = page_to_nid(page);
>  	struct hugepage_subpool *spool = hugetlb_page_subpool(page);
>  	bool restore_reserve;
> +	unsigned long flags;
>  
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(page_mapcount(page), page);
> @@ -1445,7 +1450,7 @@ static void __free_huge_page(struct page *page)
>  			restore_reserve = true;
>  	}
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irqsave(&hugetlb_lock, flags);
>  	ClearHPageMigratable(page);
>  	hugetlb_cgroup_uncharge_page(hstate_index(h),
>  				     pages_per_huge_page(h), page);
> @@ -1456,66 +1461,18 @@ static void __free_huge_page(struct page *page)
>  
>  	if (HPageTemporary(page)) {
>  		remove_hugetlb_page(h, page, false);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  		update_and_free_page(h, page);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_page(h, page, true);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  		update_and_free_page(h, page);
>  	} else {
>  		arch_clear_hugepage_flags(page);
>  		enqueue_huge_page(h, page);
> -		spin_unlock(&hugetlb_lock);
> -	}
> -}
> -
> -/*
> - * As free_huge_page() can be called from a non-task context, we have
> - * to defer the actual freeing in a workqueue to prevent potential
> - * hugetlb_lock deadlock.
> - *
> - * free_hpage_workfn() locklessly retrieves the linked list of pages to
> - * be freed and frees them one-by-one. As the page->mapping pointer is
> - * going to be cleared in __free_huge_page() anyway, it is reused as the
> - * llist_node structure of a lockless linked list of huge pages to be freed.
> - */
> -static LLIST_HEAD(hpage_freelist);
> -
> -static void free_hpage_workfn(struct work_struct *work)
> -{
> -	struct llist_node *node;
> -	struct page *page;
> -
> -	node = llist_del_all(&hpage_freelist);
> -
> -	while (node) {
> -		page = container_of((struct address_space **)node,
> -				     struct page, mapping);
> -		node = node->next;
> -		__free_huge_page(page);
> -	}
> -}
> -static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> -
> -void free_huge_page(struct page *page)
> -{
> -	/*
> -	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> -	 */
> -	if (!in_task()) {
> -		/*
> -		 * Only call schedule_work() if hpage_freelist is previously
> -		 * empty. Otherwise, schedule_work() had been called but the
> -		 * workfn hasn't retrieved the list yet.
> -		 */
> -		if (llist_add((struct llist_node *)&page->mapping,
> -			      &hpage_freelist))
> -			schedule_work(&free_hpage_work);
> -		return;
> +		spin_unlock_irqrestore(&hugetlb_lock, flags);
>  	}
> -
> -	__free_huge_page(page);
>  }
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1525,11 +1482,11 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	hugetlb_set_page_subpool(page, NULL);
>  	set_hugetlb_cgroup(page, NULL);
>  	set_hugetlb_cgroup_rsvd(page, NULL);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h->nr_huge_pages++;
>  	h->nr_huge_pages_node[nid]++;
>  	ClearHPageFreed(page);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  }
>  
>  static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> @@ -1775,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
>  	if (!PageHuge(page))
>  		return 0;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (!PageHuge(page)) {
>  		rc = 0;
>  		goto out;
> @@ -1792,7 +1749,7 @@ int dissolve_free_huge_page(struct page *page)
>  		 * when it is dissolved.
>  		 */
>  		if (unlikely(!HPageFreed(head))) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			cond_resched();
>  
>  			/*
> @@ -1816,12 +1773,12 @@ int dissolve_free_huge_page(struct page *page)
>  		}
>  		remove_hugetlb_page(h, page, false);
>  		h->max_huge_pages--;
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		update_and_free_page(h, head);
>  		return 0;
>  	}
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return rc;
>  }
>  
> @@ -1863,16 +1820,16 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
>  		goto out_unlock;
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
>  	if (!page)
>  		return NULL;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * We could have raced with the pool size change.
>  	 * Double check that and simply deallocate the new page
> @@ -1882,7 +1839,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	 */
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>  		SetHPageTemporary(page);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		put_page(page);
>  		return NULL;
>  	} else {
> @@ -1891,7 +1848,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	}
>  
>  out_unlock:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return page;
>  }
> @@ -1941,17 +1898,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>  		nodemask_t *nmask, gfp_t gfp_mask)
>  {
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
>  		struct page *page;
>  
>  		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
>  		if (page) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			return page;
>  		}
>  	}
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
>  }
> @@ -1999,7 +1956,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  
>  	ret = -ENOMEM;
>  retry:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
>  		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
>  				NUMA_NO_NODE, NULL);
> @@ -2016,7 +1973,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  	 * After retaking hugetlb_lock, we need to recalculate 'needed'
>  	 * because either resv_huge_pages or free_huge_pages may have changed.
>  	 */
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	needed = (h->resv_huge_pages + delta) -
>  			(h->free_huge_pages + allocated);
>  	if (needed > 0) {
> @@ -2056,12 +2013,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		enqueue_huge_page(h, page);
>  	}
>  free:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	/* Free unnecessary surplus pages to the buddy allocator */
>  	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
>  		put_page(page);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	return ret;
>  }
> @@ -2111,9 +2068,9 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	}
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	update_and_free_pages_bulk(h, &page_list);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  }
>  
>  
> @@ -2316,7 +2273,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  	 */
>  	page_ref_dec(new_page);
>  retry:
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (!PageHuge(old_page)) {
>  		/*
>  		 * Freed from under us. Drop new_page too.
> @@ -2330,7 +2287,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * Fail with -EBUSY if not possible.
>  		 */
>  		remove_hugetlb_page(h, new_page, false);
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		update_and_free_page(h, new_page);
>  		if (!isolate_huge_page(old_page, list))
>  			ret = -EBUSY;
> @@ -2341,7 +2298,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		 * freelist yet. Race window is small, so we can succed here if
>  		 * we retry.
>  		 */
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		cond_resched();
>  		goto retry;
>  	} else {
> @@ -2357,7 +2314,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  		page_to_free = old_page;
>  	}
>  unlock_free:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	update_and_free_page(h, page_to_free);
>  
>  	return ret;
> @@ -2374,15 +2331,15 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>  	 * to carefully check the state under the lock.
>  	 * Return success when racing as if we dissolved the page ourselves.
>  	 */
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (PageHuge(page)) {
>  		head = compound_head(page);
>  		h = page_hstate(head);
>  	} else {
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		return 0;
>  	}
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	/*
>  	 * Fence off gigantic pages as there is a cyclic dependency between
> @@ -2462,7 +2419,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	if (ret)
>  		goto out_uncharge_cgroup_reservation;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * glb_chg is passed to indicate whether or not a page must be taken
>  	 * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -2470,7 +2427,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	 */
>  	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>  	if (!page) {
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  		page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
>  		if (!page)
>  			goto out_uncharge_cgroup;
> @@ -2478,7 +2435,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  			SetHPageRestoreReserve(page);
>  			h->resv_huge_pages--;
>  		}
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		list_add(&page->lru, &h->hugepage_activelist);
>  		/* Fall through */
>  	}
> @@ -2491,7 +2448,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  						  h_cg, page);
>  	}
>  
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	hugetlb_set_page_subpool(page, spool);
>  
> @@ -2703,9 +2660,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
>  	}
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	update_and_free_pages_bulk(h, &page_list);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  }
>  #else
>  static inline void try_to_free_low(struct hstate *h, unsigned long count,
> @@ -2770,7 +2727,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	 * pages in hstate via the proc/sysfs interfaces.
>  	 */
>  	mutex_lock(&h->resize_lock);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	/*
>  	 * Check for a node specific request.
> @@ -2801,7 +2758,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	 */
>  	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>  		if (count > persistent_huge_pages(h)) {
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			mutex_unlock(&h->resize_lock);
>  			NODEMASK_FREE(node_alloc_noretry);
>  			return -EINVAL;
> @@ -2831,14 +2788,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		 * page, free_huge_page will handle it by freeing the page
>  		 * and reducing the surplus.
>  		 */
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  
>  		/* yield cpu to avoid soft lockup */
>  		cond_resched();
>  
>  		ret = alloc_pool_huge_page(h, nodes_allowed,
>  						node_alloc_noretry);
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
>  
> @@ -2877,9 +2834,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		list_add(&page->lru, &page_list);
>  	}
>  	/* free the pages after dropping lock */
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	update_and_free_pages_bulk(h, &page_list);
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, 1))
> @@ -2887,7 +2844,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	}
>  out:
>  	h->max_huge_pages = persistent_huge_pages(h);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	mutex_unlock(&h->resize_lock);
>  
>  	NODEMASK_FREE(node_alloc_noretry);
> @@ -3043,9 +3000,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  	if (err)
>  		return err;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h->nr_overcommit_huge_pages = input;
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  
>  	return count;
>  }
> @@ -3632,9 +3589,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>  		goto out;
>  
>  	if (write) {
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		h->nr_overcommit_huge_pages = tmp;
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  	}
>  out:
>  	return ret;
> @@ -3730,7 +3687,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>  	if (!delta)
>  		return 0;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * When cpuset is configured, it breaks the strict hugetlb page
>  	 * reservation as the accounting is done on a global variable. Such
> @@ -3769,7 +3726,7 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>  		return_unused_surplus_pages(h, (unsigned long) -delta);
>  
>  out:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return ret;
>  }
>  
> @@ -5833,7 +5790,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  {
>  	bool ret = true;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	if (!PageHeadHuge(page) ||
>  	    !HPageMigratable(page) ||
>  	    !get_page_unless_zero(page)) {
> @@ -5843,16 +5800,16 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  	ClearHPageMigratable(page);
>  	list_move_tail(&page->lru, list);
>  unlock:
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return ret;
>  }
>  
>  void putback_active_hugepage(struct page *page)
>  {
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	SetHPageMigratable(page);
>  	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	put_page(page);
>  }
>  
> @@ -5886,12 +5843,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  		 */
>  		if (new_nid == old_nid)
>  			return;
> -		spin_lock(&hugetlb_lock);
> +		spin_lock_irq(&hugetlb_lock);
>  		if (h->surplus_huge_pages_node[old_nid]) {
>  			h->surplus_huge_pages_node[old_nid]--;
>  			h->surplus_huge_pages_node[new_nid]++;
>  		}
> -		spin_unlock(&hugetlb_lock);
> +		spin_unlock_irq(&hugetlb_lock);
>  	}
>  }
>  
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 726b85f4f303..5383023d0cca 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -204,11 +204,11 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	do {
>  		idx = 0;
>  		for_each_hstate(h) {
> -			spin_lock(&hugetlb_lock);
> +			spin_lock_irq(&hugetlb_lock);
>  			list_for_each_entry(page, &h->hugepage_activelist, lru)
>  				hugetlb_cgroup_move_parent(idx, h_cg, page);
>  
> -			spin_unlock(&hugetlb_lock);
> +			spin_unlock_irq(&hugetlb_lock);
>  			idx++;
>  		}
>  		cond_resched();
> @@ -784,7 +784,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>  	if (hugetlb_cgroup_disabled())
>  		return;
>  
> -	spin_lock(&hugetlb_lock);
> +	spin_lock_irq(&hugetlb_lock);
>  	h_cg = hugetlb_cgroup_from_page(oldhpage);
>  	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
>  	set_hugetlb_cgroup(oldhpage, NULL);
> @@ -794,7 +794,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>  	set_hugetlb_cgroup(newhpage, h_cg);
>  	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
>  	list_move(&newhpage->lru, &h->hugepage_activelist);
> -	spin_unlock(&hugetlb_lock);
> +	spin_unlock_irq(&hugetlb_lock);
>  	return;
>  }
>  
> -- 
> 2.30.2
> 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock
  2021-04-05 23:00 ` [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
@ 2021-04-07  9:13   ` Oscar Salvador
  0 siblings, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  9:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:43PM -0700, Mike Kravetz wrote:
> After making hugetlb lock irq safe and separating some functionality
> done under the lock, add some lockdep_assert_held to help verify
> locking.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
  2021-04-06  9:17   ` Peter Zijlstra
  2021-04-06  9:35   ` Michal Hocko
@ 2021-04-07  9:23   ` David Hildenbrand
  2021-04-07 18:24   ` Roman Gushchin
  3 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-07  9:23 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon, Andrew Morton

On 06.04.21 01:00, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/cma.c       | 18 +++++++++---------
>   mm/cma.h       |  2 +-
>   mm/cma_debug.c |  8 ++++----
>   3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f3bca4178c7f..995e15480937 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>   #include <linux/memblock.h>
>   #include <linux/err.h>
>   #include <linux/mm.h>
> -#include <linux/mutex.h>
>   #include <linux/sizes.h>
>   #include <linux/slab.h>
>   #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>   			     unsigned long count)
>   {
>   	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>   
>   	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>   	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>   }
>   
>   static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>   	     pfn += pageblock_nr_pages)
>   		init_cma_reserved_pageblock(pfn_to_page(pfn));
>   
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>   
>   #ifdef CONFIG_CMA_DEBUGFS
>   	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   	unsigned long nr_part, nr_total = 0;
>   	unsigned long nbits = cma_bitmap_maxno(cma);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	pr_info("number of available pages: ");
>   	for (;;) {
>   		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   		start = next_zero_bit + nr_zero;
>   	}
>   	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   }
>   #else
>   static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>   		goto out;
>   
>   	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irq(&cma->lock);
>   		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   				bitmap_maxno, start, bitmap_count, mask,
>   				offset);
>   		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irq(&cma->lock);
>   			break;
>   		}
>   		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>   		 * our exclusive use. If the migration fails we will take the
>   		 * lock again and unmark it.
>   		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irq(&cma->lock);
>   
>   		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   	unsigned long   count;
>   	unsigned long   *bitmap;
>   	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>   #ifdef CONFIG_CMA_DEBUGFS
>   	struct hlist_head mem_head;
>   	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>   	struct cma *cma = data;
>   	unsigned long used;
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	/* pages counter is smaller than sizeof(int) */
>   	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   	*val = (u64)used << cma->order_per_bit;
>   
>   	return 0;
> @@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>   	unsigned long start, end = 0;
>   	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
>   
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>   	for (;;) {
>   		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>   		if (start >= bitmap_maxno)
> @@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>   		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>   		maxchunk = max(end - start, maxchunk);
>   	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>   	*val = (u64)maxchunk << cma->order_per_bit;
>   
>   	return 0;
> 

You seem to have dropped my

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release
  2021-04-05 23:00 ` [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
  2021-04-06  7:18   ` Oscar Salvador
@ 2021-04-07  9:26   ` David Hildenbrand
  1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-07  9:26 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon, Andrew Morton

On 06.04.21 01:00, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>   mm/hugetlb.c | 6 ------
>   1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>   	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>   	set_page_refcounted(page);
>   	if (hstate_is_gigantic(h)) {
> -		/*
> -		 * Temporarily drop the hugetlb_lock, because
> -		 * we might block in free_gigantic_page().
> -		 */
> -		spin_unlock(&hugetlb_lock);
>   		destroy_compound_gigantic_page(page, huge_page_order(h));
>   		free_gigantic_page(page, huge_page_order(h));
> -		spin_lock(&hugetlb_lock);
>   	} else {
>   		__free_pages(page, huge_page_order(h));
>   	}
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments
  2021-04-05 23:00 ` [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
@ 2021-04-07  9:28   ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-04-07  9:28 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, Oscar Salvador,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon, Andrew Morton

On 06.04.21 01:00, Mike Kravetz wrote:
> The helper routine hstate_next_node_to_alloc accesses and modifies the
> hstate variable next_nid_to_alloc.  The helper is used by the routines
> alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
> called with hugetlb_lock held.  However, alloc_pool_huge_page can not
> be called with the hugetlb lock held as it will call the page allocator.
> Two instances of alloc_pool_huge_page could be run in parallel or
> alloc_pool_huge_page could run in parallel with adjust_pool_surplus
> which may result in the variable next_nid_to_alloc becoming invalid
> for the caller and pages being allocated on the wrong node.
> 
> Both alloc_pool_huge_page and adjust_pool_surplus are only called from
> the routine set_max_huge_pages after boot.  set_max_huge_pages is only
> called as the reusult of a user writing to the proc/sysfs nr_hugepages,
> or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
> 
> It makes little sense to allow multiple adjustment to the number of
> hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
> allow one hugetlb page adjustment at a time.  This will synchronize
> modifications to the next_nid_to_alloc variable.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
>   include/linux/hugetlb.h | 1 +
>   mm/hugetlb.c            | 8 ++++++++
>   2 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d9b78e82652f..b92f25ccef58 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -566,6 +566,7 @@ HPAGEFLAG(Freed, freed)
>   #define HSTATE_NAME_LEN 32
>   /* Defines one hugetlb page size */
>   struct hstate {
> +	struct mutex resize_lock;
>   	int next_nid_to_alloc;
>   	int next_nid_to_free;
>   	unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1d62f0492e7b..8497a3598c86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2730,6 +2730,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	else
>   		return -ENOMEM;
>   
> +	/*
> +	 * resize_lock mutex prevents concurrent adjustments to number of
> +	 * pages in hstate via the proc/sysfs interfaces.
> +	 */
> +	mutex_lock(&h->resize_lock);
>   	spin_lock(&hugetlb_lock);
>   
>   	/*
> @@ -2762,6 +2767,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>   		if (count > persistent_huge_pages(h)) {
>   			spin_unlock(&hugetlb_lock);
> +			mutex_unlock(&h->resize_lock);
>   			NODEMASK_FREE(node_alloc_noretry);
>   			return -EINVAL;
>   		}
> @@ -2836,6 +2842,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   out:
>   	h->max_huge_pages = persistent_huge_pages(h);
>   	spin_unlock(&hugetlb_lock);
> +	mutex_unlock(&h->resize_lock);
>   
>   	NODEMASK_FREE(node_alloc_noretry);
>   
> @@ -3323,6 +3330,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>   	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>   	BUG_ON(order == 0);
>   	h = &hstates[hugetlb_max_hstate++];
> +	mutex_init(&h->resize_lock);
>   	h->order = order;
>   	h->mask = ~(huge_page_size(h) - 1);
>   	for (i = 0; i < MAX_NUMNODES; ++i)
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-04-07  8:27   ` Oscar Salvador
@ 2021-04-07  9:28     ` Michal Hocko
  2021-04-07  9:37       ` Oscar Salvador
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-04-07  9:28 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Wed 07-04-21 10:27:49, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:40PM -0700, Mike Kravetz wrote:
[...]
> > @@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
> >  						nodemask_t *nodes_allowed)
> >  {
> >  	int i;
> > +	struct page *page, *next;
> > +	LIST_HEAD(page_list);
> >  
> >  	if (hstate_is_gigantic(h))
> >  		return;
> >  
> > +	/*
> > +	 * Collect pages to be freed on a list, and free after dropping lock
> > +	 */
> >  	for_each_node_mask(i, *nodes_allowed) {
> > -		struct page *page, *next;
> >  		struct list_head *freel = &h->hugepage_freelists[i];
> >  		list_for_each_entry_safe(page, next, freel, lru) {
> >  			if (count >= h->nr_huge_pages)
> > -				return;
> > +				goto out;
> >  			if (PageHighMem(page))
> >  				continue;
> >  			remove_hugetlb_page(h, page, false);
> > -			update_and_free_page(h, page);
> > +			list_add(&page->lru, &page_list);
> >  		}
> >  	}
> > +
> > +out:
> > +	spin_unlock(&hugetlb_lock);
> > +	list_for_each_entry_safe(page, next, &page_list, lru) {
> > +		update_and_free_page(h, page);
> > +		cond_resched();
> > +	}
> > +	spin_lock(&hugetlb_lock);
> 
> Can we get here with an empty list?

An emoty page_list? If yes then sure, this can happen but
list_for_each_entry_safe will simply not iterate. Or what do you mean?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 7/8] hugetlb: make free_huge_page irq safe
  2021-04-07  9:12   ` Oscar Salvador
@ 2021-04-07  9:33     ` Michal Hocko
  2021-04-07  9:38       ` Oscar Salvador
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-04-07  9:33 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Wed 07-04-21 11:12:37, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:42PM -0700, Mike Kravetz wrote:
> > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> > non-task context") was added to address the issue of free_huge_page
> > being called from irq context.  That commit hands off free_huge_page
> > processing to a workqueue if !in_task.  However, this doesn't cover
> > all the cases as pointed out by 0day bot lockdep report [1].
> > 
> > :  Possible interrupt unsafe locking scenario:
> > :
> > :        CPU0                    CPU1
> > :        ----                    ----
> > :   lock(hugetlb_lock);
> > :                                local_irq_disable();
> > :                                lock(slock-AF_INET);
> > :                                lock(hugetlb_lock);
> > :   <Interrupt>
> > :     lock(slock-AF_INET);
> > 
> > Shakeel has later explained that this is very likely TCP TX zerocopy
> > from hugetlb pages scenario when the networking code drops a last
> > reference to hugetlb page while having IRQ disabled. Hugetlb freeing
> > path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
> > chain can lead to a deadlock.
> > 
> > This commit addresses the issue by doing the following:
> > - Make hugetlb_lock irq safe.  This is mostly a simple process of
> >   changing spin_*lock calls to spin_*lock_irq* calls.
> > - Make subpool lock irq safe in a similar manner.
> > - Revert the !in_task check and workqueue handoff.
> > 
> > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
> So, irq_lock_irqsave/spin_unlock_irqrestore is to be used in places
> that might have been called from an IRQ context?

Yes. spin_unlock_irq will enable interrupts unconditionally which is
certainly not what we want if the path is called with IRQ disabled by
the caller.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock
  2021-04-07  9:28     ` Michal Hocko
@ 2021-04-07  9:37       ` Oscar Salvador
  0 siblings, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Wed, Apr 07, 2021 at 11:28:51AM +0200, Michal Hocko wrote:
> An emoty page_list? If yes then sure, this can happen but
> list_for_each_entry_safe will simply not iterate. Or what do you mean?

Yes, I meant page_list.
Yeah, I figured list_for_each_entry_safe() would simply not iterate but I
wondered whether we still want the spin_unlock()/spin_lock() in that case.

But probably not worth it adding more code, so it is fine.

Thanks
 
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 7/8] hugetlb: make free_huge_page irq safe
  2021-04-07  9:33     ` Michal Hocko
@ 2021-04-07  9:38       ` Oscar Salvador
  0 siblings, 0 replies; 37+ messages in thread
From: Oscar Salvador @ 2021-04-07  9:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Wed, Apr 07, 2021 at 11:33:18AM +0200, Michal Hocko wrote:
> Yes. spin_unlock_irq will enable interrupts unconditionally which is
> certainly not what we want if the path is called with IRQ disabled by
> the caller.

I see, thanks for confirming.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock
  2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
                     ` (2 preceding siblings ...)
  2021-04-07  9:23   ` David Hildenbrand
@ 2021-04-07 18:24   ` Roman Gushchin
  3 siblings, 0 replies; 37+ messages in thread
From: Roman Gushchin @ 2021-04-07 18:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Shakeel Butt,
	Oscar Salvador, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon, Andrew Morton

On Mon, Apr 05, 2021 at 04:00:36PM -0700, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Roman Gushchin <guro@fb.com>

> ---
>  mm/cma.c       | 18 +++++++++---------
>  mm/cma.h       |  2 +-
>  mm/cma_debug.c |  8 ++++----
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f3bca4178c7f..995e15480937 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include <linux/memblock.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> -#include <linux/mutex.h>
>  #include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/log2.h>
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
>  			     unsigned long count)
>  {
>  	unsigned long bitmap_no, bitmap_count;
> +	unsigned long flags;
>  
>  	bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>  	bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irqsave(&cma->lock, flags);
>  	bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irqrestore(&cma->lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>  	     pfn += pageblock_nr_pages)
>  		init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> -	mutex_init(&cma->lock);
> +	spin_lock_init(&cma->lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>  	INIT_HLIST_HEAD(&cma->mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  	unsigned long nr_part, nr_total = 0;
>  	unsigned long nbits = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	pr_info("number of available pages: ");
>  	for (;;) {
>  		next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>  		start = next_zero_bit + nr_zero;
>  	}
>  	pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		goto out;
>  
>  	for (;;) {
> -		mutex_lock(&cma->lock);
> +		spin_lock_irq(&cma->lock);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> +			spin_unlock_irq(&cma->lock);
>  			break;
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		 * our exclusive use. If the migration fails we will take the
>  		 * lock again and unmark it.
>  		 */
> -		mutex_unlock(&cma->lock);
> +		spin_unlock_irq(&cma->lock);
>  
>  		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>  		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>  	unsigned long   count;
>  	unsigned long   *bitmap;
>  	unsigned int order_per_bit; /* Order of pages represented by one bit */
> -	struct mutex    lock;
> +	spinlock_t	lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>  	struct hlist_head mem_head;
>  	spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>  	struct cma *cma = data;
>  	unsigned long used;
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	/* pages counter is smaller than sizeof(int) */
>  	used = bitmap_weight(cma->bitmap, (int)cma_bitmap_maxno(cma));
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)used << cma->order_per_bit;
>  
>  	return 0;
> @@ -53,7 +53,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  	unsigned long start, end = 0;
>  	unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
>  
> -	mutex_lock(&cma->lock);
> +	spin_lock_irq(&cma->lock);
>  	for (;;) {
>  		start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>  		if (start >= bitmap_maxno)
> @@ -61,7 +61,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>  		end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>  		maxchunk = max(end - start, maxchunk);
>  	}
> -	mutex_unlock(&cma->lock);
> +	spin_unlock_irq(&cma->lock);
>  	*val = (u64)maxchunk << cma->order_per_bit;
>  
>  	return 0;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts
  2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
                   ` (7 preceding siblings ...)
  2021-04-05 23:00 ` [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
@ 2021-04-08  0:56 ` Mike Kravetz
  2021-04-08  7:11   ` Oscar Salvador
  8 siblings, 1 reply; 37+ messages in thread
From: Mike Kravetz @ 2021-04-08  0:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Oscar Salvador, Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Shakeel Butt, David Hildenbrand,
	Muchun Song, David Rientjes, Miaohe Lin, Peter Zijlstra,
	Matthew Wilcox, HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long,
	Peter Xu, Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song,
	Will Deacon

Hello Andrew,

It has been suggested that this series be included before Oscar Salvador's
series "Make alloc_contig_range handle Hugetlb pages".  At a logical
level, here is what I think needs to happen.  However, I am not sure how
you do tree management and I am open to anything you suggest.  Please do
not start until we get an Ack from Oscar as he will need to participate.

Remove patches for this series in your tree from Mike Kravetz:
- hugetlb: add lockdep_assert_held() calls for hugetlb_lock
- hugetlb: fix irq locking omissions
- hugetlb: make free_huge_page irq safe
- hugetlb: change free_pool_huge_page to remove_pool_huge_page
- hugetlb: call update_and_free_page without hugetlb_lock
- hugetlb: create remove_hugetlb_page() to separate functionality
  /*
   * Technically, the following patches do not need to be removed as
   * they do not interact with Oscar's changes.  However, they do
   * contain 'cover letter comments' in the commit messages which may
   * not make sense out of context.
   */
- hugetlb: add per-hstate mutex to synchronize user adjustment
- hugetlb: no need to drop hugetlb_lock to call cma_release
- mm/cma: change cma mutex to irq safe spinlock

Remove patches for the series "Make alloc_contig_range handle Hugetlb pages"
from Oscar Salvador.
- mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig
- mm: make alloc_contig_range handle in-use hugetlb pages
- mm: make alloc_contig_range handle free hugetlb pages
  /*
   * Technically, the following patches do not need to be removed as
   * they do not interact with Mike's changes.  Again, they do
   * contain 'cover letter comments' in the commit messages which may
   * not make sense out of context.
   */
- mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix
- mm,compaction: let isolate_migratepages_{range,block} return error codes
- mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range

After removing patches above, Mike will provide updated versions of:
/* If removed above */
- mm/cma: change cma mutex to irq safe spinlock
- hugetlb: no need to drop hugetlb_lock to call cma_release
- hugetlb: add per-hstate mutex to synchronize user adjustment
/* end of If removed above */
- hugetlb: create remove_hugetlb_page() to separate functionality
- hugetlb: call update_and_free_page without hugetlb_lock
- hugetlb: change free_pool_huge_page to remove_pool_huge_page
- hugetlb: make free_huge_page irq safe
- hugetlb: add lockdep_assert_held() calls for hugetlb_lock

With these patches in place, Oscar will provide updated versions of:
/* If removed above */
- mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range
- mm,compaction: let isolate_migratepages_{range,block} return error codes
/* end of If removed above */
- mm: make alloc_contig_range handle free hugetlb pages
- mm: make alloc_contig_range handle in-use hugetlb pages
- mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig

Sorry that things ended up in their current state as it will cause more
work for you.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts
  2021-04-08  0:56 ` [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
@ 2021-04-08  7:11   ` Oscar Salvador
  2021-04-09  5:05     ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Oscar Salvador @ 2021-04-08  7:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Andrew Morton, Roman Gushchin,
	Michal Hocko, Shakeel Butt, David Hildenbrand, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon

On Wed, Apr 07, 2021 at 05:56:55PM -0700, Mike Kravetz wrote:
> Hello Andrew,
> 
> It has been suggested that this series be included before Oscar Salvador's
> series "Make alloc_contig_range handle Hugetlb pages".  At a logical
> level, here is what I think needs to happen.  However, I am not sure how
> you do tree management and I am open to anything you suggest.  Please do
> not start until we get an Ack from Oscar as he will need to participate.

As I said, this is fine by me.
I think it is the most straightforward way to proceed with this series
as this is a problem that has been bugging us fore quite some time now.

See below:

 
> Remove patches for the series "Make alloc_contig_range handle Hugetlb pages"
> from Oscar Salvador.
> - mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig
> - mm: make alloc_contig_range handle in-use hugetlb pages
> - mm: make alloc_contig_range handle free hugetlb pages

Yes, those need to be removed

>   /*
>    * Technically, the following patches do not need to be removed as
>    * they do not interact with Mike's changes.  Again, they do
>    * contain 'cover letter comments' in the commit messages which may
>    * not make sense out of context.
>    */
> - mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix
> - mm,compaction: let isolate_migratepages_{range,block} return error codes

Those could stay as well, but they mention a change in
alloc_contig_range() and without the context of the whole patchset might
be misleading, so I would pull those out as well.

> - mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range

I think this one can stay.

But if It is going to be easier for Andrew, just pull them all out and I
will resend the whole series once this work goes in.

Thanks!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts
  2021-04-08  7:11   ` Oscar Salvador
@ 2021-04-09  5:05     ` Andrew Morton
  2021-04-09 20:43       ` Mike Kravetz
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2021-04-09  5:05 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, linux-mm, linux-kernel, Roman Gushchin,
	Michal Hocko, Shakeel Butt, David Hildenbrand, Muchun Song,
	David Rientjes, Miaohe Lin, Peter Zijlstra, Matthew Wilcox,
	HORIGUCHI NAOYA, Aneesh Kumar K . V, Waiman Long, Peter Xu,
	Mina Almasry, Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon

On Thu, 8 Apr 2021 09:11:30 +0200 Oscar Salvador <osalvador@suse.de> wrote:

> But if It is going to be easier for Andrew, just pull them all out and I
> will resend the whole series once this work goes in.

I think so.

I shall drop these:

mmpage_alloc-bail-out-earlier-on-enomem-in-alloc_contig_migrate_range.patch
mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes.patch
mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix.patch
mm-make-alloc_contig_range-handle-free-hugetlb-pages.patch
mm-make-alloc_contig_range-handle-in-use-hugetlb-pages.patch
mmpage_alloc-drop-unnecessary-checks-from-pfn_range_valid_contig.patch

and these:

mm-cma-change-cma-mutex-to-irq-safe-spinlock.patch
hugetlb-no-need-to-drop-hugetlb_lock-to-call-cma_release.patch
hugetlb-add-per-hstate-mutex-to-synchronize-user-adjustments.patch
hugetlb-create-remove_hugetlb_page-to-separate-functionality.patch
hugetlb-call-update_and_free_page-without-hugetlb_lock.patch
hugetlb-change-free_pool_huge_page-to-remove_pool_huge_page.patch
hugetlb-make-free_huge_page-irq-safe.patch
hugetlb-make-free_huge_page-irq-safe-fix.patch
hugetlb-add-lockdep_assert_held-calls-for-hugetlb_lock.patch

Along with notes-to-self that this:

  https://lkml.kernel.org/r/YGwnPCPaq1xKh/88@hirez.programming.kicks-ass.net

might need attention and that this:

  hugetlb-make-free_huge_page-irq-safe.patch

might need updating.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts
  2021-04-09  5:05     ` Andrew Morton
@ 2021-04-09 20:43       ` Mike Kravetz
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Kravetz @ 2021-04-09 20:43 UTC (permalink / raw)
  To: Andrew Morton, Oscar Salvador
  Cc: linux-mm, linux-kernel, Roman Gushchin, Michal Hocko,
	Shakeel Butt, David Hildenbrand, Muchun Song, David Rientjes,
	Miaohe Lin, Peter Zijlstra, Matthew Wilcox, HORIGUCHI NAOYA,
	Aneesh Kumar K . V, Waiman Long, Peter Xu, Mina Almasry,
	Hillf Danton, Joonsoo Kim, Barry Song, Will Deacon

On 4/8/21 10:05 PM, Andrew Morton wrote:
> On Thu, 8 Apr 2021 09:11:30 +0200 Oscar Salvador <osalvador@suse.de> wrote:
> 
>> But if It is going to be easier for Andrew, just pull them all out and I
>> will resend the whole series once this work goes in.
> 
> I think so.
> 
> I shall drop these:
> 
> mmpage_alloc-bail-out-earlier-on-enomem-in-alloc_contig_migrate_range.patch
> mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes.patch
> mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix.patch
> mm-make-alloc_contig_range-handle-free-hugetlb-pages.patch
> mm-make-alloc_contig_range-handle-in-use-hugetlb-pages.patch
> mmpage_alloc-drop-unnecessary-checks-from-pfn_range_valid_contig.patch
> 
> and these:
> 
> mm-cma-change-cma-mutex-to-irq-safe-spinlock.patch
> hugetlb-no-need-to-drop-hugetlb_lock-to-call-cma_release.patch
> hugetlb-add-per-hstate-mutex-to-synchronize-user-adjustments.patch
> hugetlb-create-remove_hugetlb_page-to-separate-functionality.patch
> hugetlb-call-update_and_free_page-without-hugetlb_lock.patch
> hugetlb-change-free_pool_huge_page-to-remove_pool_huge_page.patch
> hugetlb-make-free_huge_page-irq-safe.patch
> hugetlb-make-free_huge_page-irq-safe-fix.patch
> hugetlb-add-lockdep_assert_held-calls-for-hugetlb_lock.patch
> 
> Along with notes-to-self that this:
> 
>   https://lkml.kernel.org/r/YGwnPCPaq1xKh/88@hirez.programming.kicks-ass.net
> 
> might need attention and that this:
> 
>   hugetlb-make-free_huge_page-irq-safe.patch
> 
> might need updating.
> 

Thank you Andrew!

I will send a v5 shortly based on dropping the above patch.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-04-09 20:44 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05 23:00 [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
2021-04-05 23:00 ` [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock Mike Kravetz
2021-04-06  9:17   ` Peter Zijlstra
2021-04-06  9:35   ` Michal Hocko
2021-04-07  9:23   ` David Hildenbrand
2021-04-07 18:24   ` Roman Gushchin
2021-04-05 23:00 ` [PATCH v4 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release Mike Kravetz
2021-04-06  7:18   ` Oscar Salvador
2021-04-07  9:26   ` David Hildenbrand
2021-04-05 23:00 ` [PATCH v4 3/8] hugetlb: add per-hstate mutex to synchronize user adjustments Mike Kravetz
2021-04-07  9:28   ` David Hildenbrand
2021-04-05 23:00 ` [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality Mike Kravetz
2021-04-06  9:56   ` Michal Hocko
2021-04-06 12:50     ` Oscar Salvador
2021-04-06 16:49     ` Mike Kravetz
2021-04-06 17:57       ` Oscar Salvador
2021-04-07  8:21       ` Michal Hocko
2021-04-06 13:41   ` Oscar Salvador
2021-04-06 20:52     ` Mike Kravetz
2021-04-06 13:44   ` Oscar Salvador
2021-04-05 23:00 ` [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock Mike Kravetz
2021-04-06  9:57   ` Michal Hocko
2021-04-07  8:27   ` Oscar Salvador
2021-04-07  9:28     ` Michal Hocko
2021-04-07  9:37       ` Oscar Salvador
2021-04-05 23:00 ` [PATCH v4 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page Mike Kravetz
2021-04-07  8:44   ` Oscar Salvador
2021-04-05 23:00 ` [PATCH v4 7/8] hugetlb: make free_huge_page irq safe Mike Kravetz
2021-04-07  9:12   ` Oscar Salvador
2021-04-07  9:33     ` Michal Hocko
2021-04-07  9:38       ` Oscar Salvador
2021-04-05 23:00 ` [PATCH v4 8/8] hugetlb: add lockdep_assert_held() calls for hugetlb_lock Mike Kravetz
2021-04-07  9:13   ` Oscar Salvador
2021-04-08  0:56 ` [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts Mike Kravetz
2021-04-08  7:11   ` Oscar Salvador
2021-04-09  5:05     ` Andrew Morton
2021-04-09 20:43       ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).