Linux-mm Archive mirror
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-rdma@vger.kernel.org, linux-mm@kvack.org,
	John Hubbard <jhubbard@nvidia.com>,
	Mike Marciniszyn <mike.marciniszyn@intel.com>,
	Leon Romanovsky <leon@kernel.org>,
	Artemy Kovalyov <artemyko@nvidia.com>,
	Michael Guralnik <michaelgur@nvidia.com>,
	Alistair Popple <apopple@nvidia.com>,
	Pak Markthub <pmarkthub@nvidia.com>
Subject: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches
Date: Tue, 30 Apr 2024 17:31:17 -0700	[thread overview]
Message-ID: <20240501003117.257735-1-jhubbard@nvidia.com> (raw)

pin_user_pages*() occasionally fails due to migrate_pages() failures
that, in turn, are due to temporarily elevated folio refcounts. This
happens because a few years ago, pin_user_pages*() APIs were upgraded to
automatically migrate pages away from ZONE_MOVABLE, but the callers were
not upgraded to handle any migration failures. And in fact, they can't
easily do so anyway, because the migration return code was filtered out:
-EAGAIN failures from migration are squashed, along with any other
failure, into -ENOMEM, thus hiding details from the upper layer callers.

One failure case that we ran into recently looks like this:

pin_user_pages_fast()
  internal_get_user_pages_fast()
    __gup_longterm_locked()
      check_and_migrate_movable_pages()
        migrate_longterm_unpinnable_pages()
          migrate_pages()
            migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10)
              migrate_folio_move()
                  move_to_new_folio()
                    migrate_folio()
                      migrate_folio_extra()
                        folio_migrate_mapping()
                          if (folio_ref_count(folio) != expected_count)
                            return -EAGAIN;
                              // ...and migrate_longterm_unpinnable_pages()
                              // translates -EAGAIN to -ENOMEM

Although so far I have not pinpointed the cause of such transient
refcount increases, these are sufficiently common (and expected by the
entire design) that I think we have enough information to proceed
directly to a fix. This patch shows my preferred solution, which does
the following:

a) Restore the -EAGAIN return code: pass it unchanged all the way back
to pin_user_pages*() callers.

b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB
driver is displaying real failures in the field, and we've confirmed
that a retry at this location will fix those failures. Retrying at this
higher level (as compared to the pre-existing, low-level retry in
migrate_pages_batch()) allows more things to happen, and more time to
elapse, between retries.

Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Artemy Kovalyov <artemyko@nvidia.com>
Cc: Michael Guralnik <michaelgur@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Pak Markthub <pmarkthub@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++-------
 mm/gup.c                       | 14 +++++++++++---
 2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..7c691faacc8a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 
 	while (npages) {
 		cond_resched();
-		pinned = pin_user_pages_fast(cur_base,
-					  min_t(unsigned long, npages,
-						PAGE_SIZE /
-						sizeof(struct page *)),
-					  gup_flags, page_list);
-		if (pinned < 0) {
+		pinned = -ENOMEM;
+		int attempts = 0;
+		/*
+		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
+		 * to gup-slow and then failing to migrate pages out of
+		 * ZONE_MOVABLE due to a transient elevated page refcount.
+		 *
+		 * One retry is enough to avoid this problem, so far, but let's
+		 * use a slightly higher retry count just in case even larger
+		 * systems have a longer-lasting transient refcount problem.
+		 *
+		 */
+		static const int MAX_ATTEMPTS = 3;
+
+		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
+			pinned = pin_user_pages_fast(cur_base,
+						     min_t(unsigned long,
+							npages, PAGE_SIZE /
+							sizeof(struct page *)),
+						     gup_flags, page_list);
 			ret = pinned;
-			goto umem_release;
+			attempts++;
+
+			if (pinned == -EAGAIN)
+				continue;
 		}
+		if (pinned < 0)
+			goto umem_release;
 
 		cur_base += pinned * PAGE_SIZE;
 		npages -= pinned;
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..edb069f937cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages(
 	}
 
 	if (!list_empty(movable_page_list)) {
+		int rc;
 		struct migration_target_control mtc = {
 			.nid = NUMA_NO_NODE,
 			.gfp_mask = GFP_USER | __GFP_NOWARN,
 		};
 
-		if (migrate_pages(movable_page_list, alloc_migration_target,
+		rc = migrate_pages(movable_page_list, alloc_migration_target,
 				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
-				  MR_LONGTERM_PIN, NULL)) {
-			ret = -ENOMEM;
+				  MR_LONGTERM_PIN, NULL);
+		if (rc) {
+			/*
+			 * Report any failure *except* -EAGAIN as "not enough
+			 * memory". -EAGAIN is valuable because callers further
+			 * up the call stack can benefit from a retry.
+			 */
+			if (rc != -EAGAIN)
+				ret = -ENOMEM;
 			goto err;
 		}
 	}

base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354
-- 
2.45.0



             reply	other threads:[~2024-05-01  0:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01  0:31 John Hubbard [this message]
2024-05-01  5:10 ` [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches Christoph Hellwig
2024-05-01 12:10   ` Jason Gunthorpe
2024-05-01 17:32     ` John Hubbard
2024-05-02  1:05     ` Alistair Popple
2024-05-02  6:49       ` John Hubbard
2024-05-02  6:56       ` David Hildenbrand
2024-05-02 18:10         ` John Hubbard
2024-05-02 18:34           ` Jason Gunthorpe
2024-05-02 18:44             ` Matthew Wilcox
2024-05-10  7:54           ` David Hildenbrand
2024-05-11  0:32             ` Kasireddy, Vivek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240501003117.257735-1-jhubbard@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=artemyko@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=michaelgur@nvidia.com \
    --cc=mike.marciniszyn@intel.com \
    --cc=pmarkthub@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).