From: Dai Ngo <dai.ngo@oracle.com>
To: Trond Myklebust <trondmy@kernel.org>, anna@kernel.org
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] Don't always trust server inode size while client holds write delegation
Date: Wed, 15 Apr 2026 18:43:41 -0700 [thread overview]
Message-ID: <32195c82-24f6-4e83-9818-b3d981b4c3ce@oracle.com> (raw)
In-Reply-To: <42bbd9bd83e0887c469a555bc5c76ec06aee722c.camel@kernel.org>
Hi Trond,
Thank you for your feedback.
On 4/13/26 3:19 PM, Trond Myklebust wrote:
> Hi Dai,
>
> On Wed, 2026-04-01 at 11:52 -0700, Dai Ngo wrote:
>> With a write delegation, the server permits the client to buffer
>> writes
>> and associated metadata updates (including file size changes) locally
>> and
>> defer propagating them to the server. As a result, the server-
>> reported
>> GETATTR size may legitimately lag behind the client’s cached size for
>> the
>> duration of the delegation, so it must not be treated as
>> authoritative.
>>
>> This patch modifies nfs_wcc_update_inode to update the cached inode
>> size
>> only when the client does not hold a write delegation or the
>> cache_validity
>> of the nfs_inode has the NFS_INO_INVALID_SIZE bit set.
>>
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>> fs/nfs/inode.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 4786343eeee0..21161ebbd953 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -1649,8 +1649,11 @@ static void nfs_wcc_update_inode(struct inode
>> *inode, struct nfs_fattr *fattr)
>> && (fattr->valid & NFS_ATTR_FATTR_SIZE)
>> && i_size_read(inode) ==
>> nfs_size_to_loff_t(fattr->pre_size)
>> && !nfs_have_writebacks(inode)) {
>> - trace_nfs_size_wcc(inode, fattr->size);
>> - i_size_write(inode, nfs_size_to_loff_t(fattr-
>>> size));
>> + if ((!nfs_have_write_delegation(inode)) ||
>> + (NFS_I(inode)->cache_validity &
>> NFS_INO_INVALID_SIZE)) {
>> + trace_nfs_size_wcc(inode, fattr->size);
>> + i_size_write(inode,
>> nfs_size_to_loff_t(fattr->size));
>> + }
>> }
>> }
>>
> Under what circumstances are you seeing this making a difference?
I see file corruption when running the git test with v4.2 and pNFS
with SCSI layout:
# GIT_TEST_EXT_CHAIN_LINT=0 && export GIT_TEST_EXT_CHAIN_LINT && time make -j100
rm -f -r 'test-results'
diff --git a/chainlinttmp/expect b/chainlinttmp/actual
index 178e6e4..8370f45 100644
Binary files a/chainlinttmp/expect and b/chainlinttmp/actual differ
make: *** [Makefile:83: check-chainlint] Error 1
real 0m10.935s
user 0m10.977s
sys 0m10.425s
#
> Please see the call to nfs_writeback_check_extend() in
> nfs_writeback_update_inode() which is supposed to prevent the issue
> you're talking about.
in nfs_writeback_update_inode(), if the inode has delegated mtime
then nfs_writeback_check_extend is not called.
This is the sequence that leads to the file corruption:
. thread 1 writes 2 bytes to offset 4094 of a file.
. inode size grows to 4096.
. thread1 writes 32 bytes to offset 4096 of the same file.
. inode size grows to 4128.
. thread2 calls nfs_getattr on the same file.
. nfs_getattr checks if it needs up-to-date ctime/mtime/change.
If client has nfs_have_delegated_mtime it calls filemap_fdatawrite
to flush dirty page-cache data to the server asynchronously.
nfs_writepages breaks dirty pages into 2 folios, one for each page.
. thread2 calls bl_write_pagelist to write dirty page 0 to the DS.
. pNFS FLUSH/WRITE of page 0 to DS done.
kworker updates pnfs_layout_hdr.plh_lwb to 4096.
. thread2 calls bl_write_pagelist to write dirty page 1 to the DS.
. thread2 then sends LAYOUTCOMMIT/GETATTR and uses pnfs_layout_hdr.plh_lwb
(4096) to set lastbytewritten which is used by the server to update
the remote inode size which is then returned in the GETATTR reply.
. pNFS FLUSH/WRITE of page 1 to DS done.
kworker updates pnfs_layout_hdr.plh_lwb to 4128.
. thread2 calls nfs_wcc_update_inode to update inode size to 4096 as
returned by the GETATTR while the local inode size is 4128.
. thread2 then writes 77 bytes to offset 4128. Since the inode size is
now 4096 nfs_truncate_last_folio is called to zero out the 1st 32
bytes of the page. This causes the file to be corrupted.
-Dai
>
prev parent reply other threads:[~2026-04-16 1:43 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 18:52 [PATCH 1/1] Don't always trust server inode size while client holds write delegation Dai Ngo
2026-04-13 22:19 ` Trond Myklebust
2026-04-16 1:43 ` Dai Ngo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=32195c82-24f6-4e83-9818-b3d981b4c3ce@oracle.com \
--to=dai.ngo@oracle.com \
--cc=anna@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).