Linux-NFS Archive mirror
 help / color / mirror / Atom feed
From: Nikolay Amiantov <ab@fmap.me>
To: linux-nfs@vger.kernel.org
Subject: Help tracking down a possible race
Date: Thu, 16 Apr 2026 00:24:09 +0700	[thread overview]
Message-ID: <42bcb43b-5782-4353-b3f7-6e68d919f3c6@fmap.me> (raw)

Hi all,

tl;dr: weird test results in 
https://github.com/abbradar/nfs_stale_cache_test possibly showing a race 
in NFS, or that I'm stupid :)

Disclaimer: I'm usually not using NFS at all, and I'm now in a rabbit 
hole while researching a bug in a network FS using FUSE (JuiceFS 
[1]). Since I'm way over my head here (my first time with VFS/FUSE/NFS 
kernel subsystems, some prior experience with kernelspace) I may not be 
understanding what I'm talking about at all, so feel free to send me 
away if I'm missing something obvious.

The race I'm talking about goes like this:
* On a host A, a writer appends to a file. In the MRE I have it just 
writes 0xAA one byte at a time;
* On a host B, we simultaneously:
   + Read the file, also a byte at a time, possibly from multiple 
processes/threads simultaneously;
   + At the same time, hammer the same file with stat() calls.

In this case you may randomly read a zero byte instead of the byte you 
are expecting to read.

After hunting this bug in JuiceFS, I went down to the FUSE level and 
managed to implement an MRE [2]. The setup is similar, only instead of a 
writer there is a FUSE FS presenting a slowly growing file.

After much (disclosure: LLM-assisted) research of the kernel code, the 
race, as I understand it, is actually relevant to *any* network FS when 
updates may happen bypassing the cache layer. I tried checking if it 
happens with NFS, and indeed, I can randomly observe zero bytes: 
https://github.com/abbradar/nfs_stale_cache_test . I'm repeating my 
understanding of the issue here for convenience:

------

When a file grows remotely, the page before the old EOF in the read 
cache contains zero-fill beyond the old size. Those zeroes are valid 
while new size <= old size (they are beyond EOF), but become stale once 
the new size is updated to reflect the remote growth: the remote host 
wrote real data there, but the local cache still has the old zero-fill.

In filemap_read() (mm/filemap.c) we have:

```
do {
     ...
     error = filemap_get_pages(iocb, ...);  // (1) get cached folios
     ...
     isize = i_size_read(inode);            // (2) get file size
     ...
     // (3) copy from folio to user, capped at isize
} while (...);
```

If we grow the inode size in-between (1) and (2), the race happens; the 
old page gets capped at the new size, so the userspace reads zeroes 
where there should be actual data.

To trigger this bug, something must change the inode size in parallel 
with a read and not come from a user's `write()` since writes are 
coherent with reads via the cache layer. In a network FS this may happen 
on getattr when we discover that the remote file has grown, and update 
the inode's size. When this happens we need to mark the cache pages as 
stale, but there is no way to "lock" the page and the inode size 
simultaneously, so the race cannot be fixed just by stalling the cache 
in getattr.

NFS does stall the cache already — it sets NFS_INO_INVALID_DATA, then 
before we read invalidates the cache as needed. However the window 
between (1) and (2) is still there.

------

Apart from the patches introducing NFS_INO_INVALIDATING I can't find any 
prior discussion of this issue for FUSE-based, NFS, CIFS or any other 
network FS.

I'd be glad for any help denying or confirming my findings, or 
generally pointing me in the right direction.

Cheers,
Nikolay.

1: https://github.com/juicedata/juicefs/issues/5038
2: https://github.com/abbradar/fuse_growtest


                 reply	other threads:[~2026-04-15 17:29 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42bcb43b-5782-4353-b3f7-6e68d919f3c6@fmap.me \
    --to=ab@fmap.me \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).