Linux-NFS Archive mirror
 help / color / mirror / Atom feed
From: Jon Curley <jcurley@purestorage.com>
To: Trond Myklebust <trondmy@kernel.org>
Cc: linux-nfs@vger.kernel.org
Subject: Re: NFS close does not block when server is unavailable
Date: Mon, 6 Apr 2026 15:49:50 -0700	[thread overview]
Message-ID: <CAHeb9+=3d+yErB+a0rhW0v+n1kV348E+8veznYszXzkkbOMVJg@mail.gmail.com> (raw)
In-Reply-To: <5b6c25a9f1d8d0f8c9a1fe8f7bf83f749aeacc20.camel@kernel.org>

On Mon, Apr 6, 2026 at 1:58 PM Trond Myklebust <trondmy@kernel.org> wrote:
>
> Hi Jon,
>
> On Mon, 2026-04-06 at 08:36 -0700, Jon Curley wrote:
> > Hi Trond,
> >
> > We've recently been investigating test failures in our automation
> > linked to server failures. What we've noticed is that failures during
> > WRITE do not block close(). Close simply returns success even though
> > the writes are in a failure loop.
> >
> > This is easy to reproduce using flexfiles. For example, the flexfile
> > RFC states that returning EACCES is a mechanism for fencing files. If
> > you create a server where writes always return that error, commands
> > like dd will return success while the writes are in a RESET_TO_PNFS
> > loop.
> >
> > I'm not trying to get into an O_PONIES argument here but I always
> > thought NFS close-to-open semantics meant that NFS close makes a
> > strict effort to flush writes to the server before finishing.
> >
> > I've been digging into how the NFS client guarantees writes get
> > flushed. I noticed that nfs/file.c:nfs_file_fsync goes through a much
> > more rigorous algorithm to flush writes in the face of server
> > unavailability when compared to nfs/file.c:nfs_file_flush. The
> > redirtied_pages interlock is especially critical for ensuring
> > RESET_TO_PNFS events block the syscall from returning.
> >
> > At first I thought this was intentional. But after looking at the
> > history, I found this commit:
> >
> > "NFS: Don't inadvertently clear writeback errors" -
> > https://github.com/torvalds/linux/commit/aded8d7b54f250af6deb72fde475291cfba513d1
> >
> > That diff replaced calls to "vfs_fsync" with "nfs_wb_all" in several
> > places including nfs_file_flush. Since the diff makes no mention of
> > reducing the guarantees of nfs_file_flush & etc I'm wondering if this
> > effect was unintentional?
> >
> > Does it make sense to modify nfs_wb_all to have a loop like
> > nfs_file_fsync does? Or introduce a new function?
> >
>
> We've always had the rule that errors that are classified as "fatal" on
> the server should be reported as such through the same mechanisms as
> they would in POSIX, and that the effects should be the same. While
> "EACCES" is not technically allowed by POSIX as a response to WRITE, it
> was necessary to allow it in NFS because NFS servers do check
> permissions on I/O. Even in NFSv4, the presence of a stateid that
> passed OPEN checks does not guarantee that the client is allowed to
> write out its data to the file.
>
> Now that said, the pNFS flexfiles protocol introduces a subtlety here
> in that the client should not be considering any error that was
> returned by a data server as being fatal. Instead, it should be
> reporting the error back to the metadata server, and either failing
> over to another mirror (for the case of a READ) or returning the layout
> and asking for a new layout (for the case of a WRITE or if there are no
> more mirrors to READ from). The LAYOUTGET may then return the fatal
> error, if the metadata server is out of options for fixing the problem.
>
> So the above governs how hard mounts handle write errors.
>
> Now the difference between how fsync() and close() work is really down
> to whether or not we need to give a hard guarantee that the data is
> really on disk or not when the syscall completes.
>
> fsync() definitely needs to guarantee that data is really on storage in
> the case where it doesn't flag an error. The problem is that a system
> that relies on resends, and that doesn't block new writes from
> occurring is inherently prone to live locks. We try to handle that in
> nfs_file_fsync() by monitoring a 'redirtied_pages' counter that tells
> us how many resends occurred in the last attempt to flush, however that
> doesn't suffice to completely avoid live locks.
>
> close() does not need to guarantee that the data is on storage, because
> POSIX doesn't require it to. It's just a convenience that ensures
> close-to-open semantics can be made to work. For that reason, it just
> calls nfs_wb_all(), which does not wait for resends to finish, and thus
> avoids live locks.
>
> Now we can definitely discuss whether or not that close() behaviour is
> correct: it does not lose data (so there are no hard mount violations),
> but it also doesn't offer the same guarantees that fsync() does w.r.t.
> whether the data is on disk on exit. I just wanted to first make sure
> that we're on the same page concerning hard mount semantics.

Right, I agree that POSIX doesn't require persistence to storage after
close. If this were simply a matter of persistence I think I might be
satisfied with that answer. My concern is incorrect application
behavior & compatibility.

NFS introduces additional complications because of its distributed
nature. Here is an example from our automated testing. The test
involved writing to a file and then calling truncate from a single
client:

1. Write to the file, extending it to size N. (dd write to file)
2. Assert the file size equals N. (stat file)
3. Truncate the file to size M. M < N (ftruncate file)
4. Assert that the file size equals M. (stat file)

This test would occasionally fail at step 4 in our automation. The
file size will equal the written size N. We discovered that because
the close flush did not finish the writes in step 1, the client
reordered writeback after the setattr in step 3. This resulted in an
application failure regardless of data persistence. Reordering
operations to storage caused incorrect behavior. I suspect this
violates POSIX semantics?

NFS client as currently implemented seems to require strict close
semantics to achieve correct application behavior. Other cases can be
imagined with applications that require close-to-open semantics when
coordinating updates among multiple clients. I already hear plenty of
complaints that NFS with default mount settings isn't POSIX compatible
because writes aren't visible to other clients after the write syscall
completes. Would we start telling people that the rule is really
fsync-close-to-open?

      reply	other threads:[~2026-04-06 22:50 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-06 15:36 NFS close does not block when server is unavailable Jon Curley
2026-04-06 20:58 ` Trond Myklebust
2026-04-06 22:49   ` Jon Curley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHeb9+=3d+yErB+a0rhW0v+n1kV348E+8veznYszXzkkbOMVJg@mail.gmail.com' \
    --to=jcurley@purestorage.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).