Linux-NFS Archive mirror
 help / color / mirror / Atom feed
From: Dai Ngo <dai.ngo@oracle.com>
To: Trond Myklebust <trondmy@kernel.org>, anna@kernel.org
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] Don't always trust server inode size while client holds write delegation
Date: Wed, 15 Apr 2026 18:43:41 -0700	[thread overview]
Message-ID: <32195c82-24f6-4e83-9818-b3d981b4c3ce@oracle.com> (raw)
In-Reply-To: <42bbd9bd83e0887c469a555bc5c76ec06aee722c.camel@kernel.org>

Hi Trond,

Thank you for your feedback.

On 4/13/26 3:19 PM, Trond Myklebust wrote:
> Hi Dai,
>
> On Wed, 2026-04-01 at 11:52 -0700, Dai Ngo wrote:
>> With a write delegation, the server permits the client to buffer
>> writes
>> and associated metadata updates (including file size changes) locally
>> and
>> defer propagating them to the server. As a result, the server-
>> reported
>> GETATTR size may legitimately lag behind the client’s cached size for
>> the
>> duration of the delegation, so it must not be treated as
>> authoritative.
>>
>> This patch modifies nfs_wcc_update_inode to update the cached inode
>> size
>> only when the client does not hold a write delegation or the
>> cache_validity
>> of the nfs_inode has the NFS_INO_INVALID_SIZE bit set.
>>
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfs/inode.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 4786343eeee0..21161ebbd953 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -1649,8 +1649,11 @@ static void nfs_wcc_update_inode(struct inode
>> *inode, struct nfs_fattr *fattr)
>>   			&& (fattr->valid & NFS_ATTR_FATTR_SIZE)
>>   			&& i_size_read(inode) ==
>> nfs_size_to_loff_t(fattr->pre_size)
>>   			&& !nfs_have_writebacks(inode)) {
>> -		trace_nfs_size_wcc(inode, fattr->size);
>> -		i_size_write(inode, nfs_size_to_loff_t(fattr-
>>> size));
>> +		if ((!nfs_have_write_delegation(inode)) ||
>> +			(NFS_I(inode)->cache_validity &
>> NFS_INO_INVALID_SIZE)) {
>> +			trace_nfs_size_wcc(inode, fattr->size);
>> +			i_size_write(inode,
>> nfs_size_to_loff_t(fattr->size));
>> +		}
>>   	}
>>   }
>>   
> Under what circumstances are you seeing this making a difference?

I see file corruption when running the git test with v4.2 and pNFS
with SCSI layout:

# GIT_TEST_EXT_CHAIN_LINT=0 && export GIT_TEST_EXT_CHAIN_LINT && time make -j100
rm -f -r 'test-results'
diff --git a/chainlinttmp/expect b/chainlinttmp/actual
index 178e6e4..8370f45 100644
Binary files a/chainlinttmp/expect and b/chainlinttmp/actual differ
make: *** [Makefile:83: check-chainlint] Error 1
                                                  
real    0m10.935s
user    0m10.977s
sys     0m10.425s
#

> Please see the call to nfs_writeback_check_extend() in
> nfs_writeback_update_inode() which is supposed to prevent the issue
> you're talking about.

in nfs_writeback_update_inode(), if the inode has delegated mtime
then nfs_writeback_check_extend is not called.

This is the sequence that leads to the file corruption:

. thread 1 writes 2 bytes to offset 4094 of a file.
. inode size grows to 4096.

. thread1  writes 32 bytes to offset 4096 of the same file.
. inode size grows to 4128.

. thread2 calls nfs_getattr on the same file.
. nfs_getattr checks if it needs up-to-date ctime/mtime/change.
   If client has nfs_have_delegated_mtime it calls filemap_fdatawrite
   to flush dirty page-cache data to the server asynchronously.

   nfs_writepages breaks dirty pages into 2 folios, one for each page.

. thread2 calls bl_write_pagelist to write dirty page 0 to the DS.

. pNFS FLUSH/WRITE of page 0 to DS done.
   kworker updates pnfs_layout_hdr.plh_lwb to 4096.

. thread2 calls bl_write_pagelist to write dirty page 1 to the DS.

. thread2 then sends LAYOUTCOMMIT/GETATTR and uses pnfs_layout_hdr.plh_lwb
   (4096) to set lastbytewritten which is used by the server to update
   the remote inode size which is then returned in the GETATTR reply.
                                 
. pNFS FLUSH/WRITE of page 1 to DS done.
   kworker updates pnfs_layout_hdr.plh_lwb to 4128.

. thread2 calls nfs_wcc_update_inode to update inode size to 4096 as
   returned by the GETATTR while the local inode size is 4128.

. thread2 then writes 77 bytes to offset 4128. Since the inode size is
   now 4096 nfs_truncate_last_folio is called to zero out the 1st 32
   bytes of the page. This causes the file to be corrupted.

-Dai

>

      reply	other threads:[~2026-04-16  1:43 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 18:52 [PATCH 1/1] Don't always trust server inode size while client holds write delegation Dai Ngo
2026-04-13 22:19 ` Trond Myklebust
2026-04-16  1:43   ` Dai Ngo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32195c82-24f6-4e83-9818-b3d981b4c3ce@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=anna@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).