From: John Hubbard <jhubbard@nvidia.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-rdma@vger.kernel.org, linux-mm@kvack.org,
John Hubbard <jhubbard@nvidia.com>,
Mike Marciniszyn <mike.marciniszyn@intel.com>,
Leon Romanovsky <leon@kernel.org>,
Artemy Kovalyov <artemyko@nvidia.com>,
Michael Guralnik <michaelgur@nvidia.com>,
Alistair Popple <apopple@nvidia.com>,
Pak Markthub <pmarkthub@nvidia.com>
Subject: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
Date: Tue, 30 Apr 2024 17:31:17 -0700 [thread overview]
Message-ID: <20240501003117.257735-1-jhubbard@nvidia.com> (raw)
pin_user_pages*() occasionally fails due to migrate_pages() failures
that, in turn, are due to temporarily elevated folio refcounts. This
happens because a few years ago, pin_user_pages*() APIs were upgraded to
automatically migrate pages away from ZONE_MOVABLE, but the callers were
not upgraded to handle any migration failures. And in fact, they can't
easily do so anyway, because the migration return code was filtered out:
-EAGAIN failures from migration are squashed, along with any other
failure, into -ENOMEM, thus hiding details from the upper layer callers.
One failure case that we ran into recently looks like this:
pin_user_pages_fast()
internal_get_user_pages_fast()
__gup_longterm_locked()
check_and_migrate_movable_pages()
migrate_longterm_unpinnable_pages()
migrate_pages()
migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10)
migrate_folio_move()
move_to_new_folio()
migrate_folio()
migrate_folio_extra()
folio_migrate_mapping()
if (folio_ref_count(folio) != expected_count)
return -EAGAIN;
// ...and migrate_longterm_unpinnable_pages()
// translates -EAGAIN to -ENOMEM
Although so far I have not pinpointed the cause of such transient
refcount increases, these are sufficiently common (and expected by the
entire design) that I think we have enough information to proceed
directly to a fix. This patch shows my preferred solution, which does
the following:
a) Restore the -EAGAIN return code: pass it unchanged all the way back
to pin_user_pages*() callers.
b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB
driver is displaying real failures in the field, and we've confirmed
that a retry at this location will fix those failures. Retrying at this
higher level (as compared to the pre-existing, low-level retry in
migrate_pages_batch()) allows more things to happen, and more time to
elapse, between retries.
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Artemy Kovalyov <artemyko@nvidia.com>
Cc: Michael Guralnik <michaelgur@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Pak Markthub <pmarkthub@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++-------
mm/gup.c | 14 +++++++++++---
2 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..7c691faacc8a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
while (npages) {
cond_resched();
- pinned = pin_user_pages_fast(cur_base,
- min_t(unsigned long, npages,
- PAGE_SIZE /
- sizeof(struct page *)),
- gup_flags, page_list);
- if (pinned < 0) {
+ pinned = -ENOMEM;
+ int attempts = 0;
+ /*
+ * pin_user_pages_fast() can return -EAGAIN, due to falling back
+ * to gup-slow and then failing to migrate pages out of
+ * ZONE_MOVABLE due to a transient elevated page refcount.
+ *
+ * One retry is enough to avoid this problem, so far, but let's
+ * use a slightly higher retry count just in case even larger
+ * systems have a longer-lasting transient refcount problem.
+ *
+ */
+ static const int MAX_ATTEMPTS = 3;
+
+ while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
+ pinned = pin_user_pages_fast(cur_base,
+ min_t(unsigned long,
+ npages, PAGE_SIZE /
+ sizeof(struct page *)),
+ gup_flags, page_list);
ret = pinned;
- goto umem_release;
+ attempts++;
+
+ if (pinned == -EAGAIN)
+ continue;
}
+ if (pinned < 0)
+ goto umem_release;
cur_base += pinned * PAGE_SIZE;
npages -= pinned;
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..edb069f937cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages(
}
if (!list_empty(movable_page_list)) {
+ int rc;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};
- if (migrate_pages(movable_page_list, alloc_migration_target,
+ rc = migrate_pages(movable_page_list, alloc_migration_target,
NULL, (unsigned long)&mtc, MIGRATE_SYNC,
- MR_LONGTERM_PIN, NULL)) {
- ret = -ENOMEM;
+ MR_LONGTERM_PIN, NULL);
+ if (rc) {
+ /*
+ * Report any failure *except* -EAGAIN as "not enough
+ * memory". -EAGAIN is valuable because callers further
+ * up the call stack can benefit from a retry.
+ */
+ if (rc != -EAGAIN)
+ ret = -ENOMEM;
goto err;
}
}
base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354
--
2.45.0
next reply other threads:[~2024-05-01 0:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-01 0:31 John Hubbard [this message]
2024-05-01 5:10 ` [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches Christoph Hellwig
2024-05-01 12:10 ` Jason Gunthorpe
2024-05-01 17:32 ` John Hubbard
2024-05-02 1:05 ` Alistair Popple
2024-05-02 6:49 ` John Hubbard
2024-05-02 6:56 ` David Hildenbrand
2024-05-02 18:10 ` John Hubbard
2024-05-02 18:34 ` Jason Gunthorpe
2024-05-02 18:44 ` Matthew Wilcox
2024-05-10 7:54 ` David Hildenbrand
2024-05-11 0:32 ` Kasireddy, Vivek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240501003117.257735-1-jhubbard@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=artemyko@nvidia.com \
--cc=jgg@nvidia.com \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=michaelgur@nvidia.com \
--cc=mike.marciniszyn@intel.com \
--cc=pmarkthub@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).