Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com,
	hch@infradead.org, djwong@kernel.org, willy@infradead.org,
	zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com,
	yukuai3@huawei.com, wangkefeng.wang@huawei.com
Subject: Re: [RFC PATCH v4 24/34] ext4: implement buffered write iomap path
Date: Tue, 7 May 2024 13:10:53 +0800	[thread overview]
Message-ID: <998cae29-61c3-bb10-b05b-853edfd176b0@huaweicloud.com> (raw)
In-Reply-To: <ZjllkHuyOedA/Tzg@dread.disaster.area>

On 2024/5/7 7:19, Dave Chinner wrote:
> On Mon, May 06, 2024 at 07:44:44PM +0800, Zhang Yi wrote:
>> On 2024/5/1 16:33, Dave Chinner wrote:
>>> On Wed, May 01, 2024 at 06:11:13PM +1000, Dave Chinner wrote:
>>>> On Wed, Apr 10, 2024 at 10:29:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> Implement buffered write iomap path, use ext4_da_map_blocks() to map
>>>>> delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if
>>>>> delalloc is disabled or free space is about to run out.
>>>>>
>>>>> Note that we always allocate unwritten extents for new blocks in the
>>>>> iomap write path, this means that the allocation type is no longer
>>>>> controlled by the dioread_nolock mount option. After that, we could
>>>>> postpone the i_disksize updating to the writeback path, and drop journal
>>>>> handle in the buffered dealloc write path completely.
>>> .....
>>>>> +/*
>>>>> + * Drop the staled delayed allocation range from the write failure,
>>>>> + * including both start and end blocks. If not, we could leave a range
>>>>> + * of delayed extents covered by a clean folio, it could lead to
>>>>> + * inaccurate space reservation.
>>>>> + */
>>>>> +static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>>>>> +				     loff_t length)
>>>>> +{
>>>>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>>>>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>>>>>  	return 0;
>>>>>  }
>>>>>  
>>>>> +static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset,
>>>>> +					 loff_t length, ssize_t written,
>>>>> +					 unsigned int flags,
>>>>> +					 struct iomap *iomap)
>>>>> +{
>>>>> +	handle_t *handle;
>>>>> +	loff_t end;
>>>>> +	int ret = 0, ret2;
>>>>> +
>>>>> +	/* delalloc */
>>>>> +	if (iomap->flags & IOMAP_F_EXT4_DELALLOC) {
>>>>> +		ret = iomap_file_buffered_write_punch_delalloc(inode, iomap,
>>>>> +			offset, length, written, ext4_iomap_punch_delalloc);
>>>>> +		if (ret)
>>>>> +			ext4_warning(inode->i_sb,
>>>>> +			     "Failed to clean up delalloc for inode %lu, %d",
>>>>> +			     inode->i_ino, ret);
>>>>> +		return ret;
>>>>> +	}
>>>>
>>>> Why are you creating a delalloc extent for the write operation and
>>>> then immediately deleting it from the extent tree once the write
>>>> operation is done?
>>>
>>> Ignore this, I mixed up the ext4_iomap_punch_delalloc() code
>>> directly above with iomap_file_buffered_write_punch_delalloc().
>>>
>>> In hindsight, iomap_file_buffered_write_punch_delalloc() is poorly
>>> named, as it is handling a short write situation which requires
>>> newly allocated delalloc blocks to be punched.
>>> iomap_file_buffered_write_finish() would probably be a better name
>>> for it....
>>>
>>>> Also, why do you need IOMAP_F_EXT4_DELALLOC? Isn't a delalloc iomap
>>>> set up with iomap->type = IOMAP_DELALLOC? Why can't that be used?
>>>
>>> But this still stands - the first thing
>>> iomap_file_buffered_write_punch_delalloc() is:
>>>
>>> 	if (iomap->type != IOMAP_DELALLOC)
>>>                 return 0;
>>>
>>
>> Thanks for the suggestion, the delalloc and non-delalloc write paths
>> share the same ->iomap_end() now (i.e. ext4_iomap_buffered_write_end()),
>> I use the IOMAP_F_EXT4_DELALLOC to identify the write path.
> 
> Again, you don't need that. iomap tracks newly allocated
> IOMAP_DELALLOC extents via the IOMAP_F_NEW flag that should be
> getting set in the ->iomap_begin() call when it creates a new
> delalloc extent.
> 
> Please look at the second check in
> iomap_file_buffered_write_punch_delalloc():
> 
> 	if (iomap->type != IOMAP_DELALLOC)
>                 return 0;
> 
>         /* If we didn't reserve the blocks, we're not allowed to punch them. */
>         if (!(iomap->flags & IOMAP_F_NEW))
>                 return 0;
> 
>> For
>> non-delalloc path, If we have allocated more blocks and copied less, we
>> should truncate extra blocks that newly allocated by ->iomap_begin().
> 
> Why? If they were allocated as unwritten, then you can just leave
> them there as unwritten extents, same as XFS. Keep in mind that if
> we get a short write, it is extremely likely the application is
> going to rewrite the remaining data immediately, so if we allocated
> blocks they are likely to still be needed, anyway....
> 

Make sense, we don't need to free the extra blocks beyond EOF since they
are unwritten, we can drop this handle for non-delalloc path on ext4 now.

>> If we use IOMAP_DELALLOC, we can't tell if the blocks are
>> pre-existing or newly allocated, we can't truncate the
>> pre-existing blocks, so I have to introduce IOMAP_F_EXT4_DELALLOC.
>> But if we split the delalloc and non-delalloc handler, we could
>> drop IOMAP_F_EXT4_DELALLOC.
> 
> As per above: IOMAP_F_NEW tells us -exactly- this.
> 
> IOMAP_F_NEW should be set on any newly allocated block - delalloc or
> real - because that's the flag that tells the iomap infrastructure
> whether zero-around is needed for partial block writes. If ext4 is
> not setting this flag on delalloc regions allocated by
> ->iomap_begin(), then that's a serious bug.
> 
>> I also checked xfs, IIUC, xfs doesn't free the extra blocks beyond EOF
>> in xfs_buffered_write_iomap_end() for non-delalloc case since they will
>> be freed by xfs_free_eofblocks in some other inactive paths, like
>> xfs_release()/xfs_inactive()/..., is that right?
> 
> XFS doesn't care about real blocks beyond EOF existing -
> xfs_free_eofblocks() is an optimistic operation that does not
> guarantee that it will remove blocks beyond EOF. Similarly, we don't
> care about real blocks within EOF because we alway allocate data
> extents as unwritten, so we don't have any stale data exposure
> issues to worry about on short writes leaving allocated blocks
> behind.
> 
> OTOH, delalloc extents without dirty page cache pages over them
> cannot be allowed to exist. Without dirty pages, there is no trigger
> to convert those to real extents (i.e. nothing to write back). Hence
> the only sane thing that can be done with them on a write error or
> short write is remove them in the context where they were created.
> 
> This is the only reason that the
> iomap_file_buffered_write_punch_delalloc() exists - it abstracts
> this nasty corner case away from filesystems that support delalloc
> so they don't have to worry about getting this right. That's whole
> point of having delalloc aware infrastructure - individual
> filesysetms don't need to handle all these weird corner cases
> themselves because the infrastructure takes care of them...
> 

Yeah, thanks for the explanation. The iomap_file_buffered_write_punch_delalloc()
is very useful, it find pages that have dirty data still pending in the page
cache, punch out all the delalloc blocks beside those blocks. I realized that
it is used to fix a race condition between either writeback or mmap page
faults that xfs encountered [1].

We will meet the same problem for ext3 and ext2 which are not extent based.
Their new allocated blocks were written, we need to free them if we get a short
write, but we can't simply do it through ext2_write_failed() and
ext4_truncate_failed_write(), we still need to use
iomap_file_buffered_write_punch_delalloc().

[1] https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/

Thanks,
Yi.


  reply	other threads:[~2024-05-07  5:11 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-10 14:29 [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-10 14:29 ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
2024-04-26 11:55   ` Ritesh Harjani
2024-04-10 14:29 ` [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block Zhang Yi
2024-04-26 12:31   ` Ritesh Harjani
2024-04-26 12:57     ` Ritesh Harjani
2024-04-26 13:19       ` Zhang Yi
2024-04-26 16:39         ` Ritesh Harjani
2024-04-28  3:00           ` Zhang Yi
2024-04-29 14:59             ` Ritesh Harjani
2024-05-07  3:15               ` Zhang Yi
2024-05-01  7:47           ` Dave Chinner
2024-05-01  6:51   ` Dave Chinner
2024-05-01 12:19     ` Ritesh Harjani
2024-05-01 22:49       ` Dave Chinner
2024-05-02  4:11         ` Ritesh Harjani
2024-05-06  3:49           ` Zhang Yi
2024-04-10 14:29 ` [PATCH v4 03/34] ext4: trim delalloc extent Zhang Yi
2024-05-01 14:31   ` Ritesh Harjani
2024-05-06  6:15     ` Zhang Yi
2024-04-10 14:29 ` [PATCH v4 04/34] ext4: drop iblock parameter Zhang Yi
2024-05-01 14:41   ` Ritesh Harjani
2024-04-10 14:29 ` [PATCH v4 05/34] ext4: make ext4_es_insert_delayed_block() insert multi-blocks Zhang Yi
2024-04-10 14:29 ` [PATCH v4 06/34] ext4: make ext4_da_reserve_space() reserve multi-clusters Zhang Yi
2024-04-10 14:29 ` [PATCH v4 07/34] ext4: factor out check for whether a cluster is allocated Zhang Yi
2024-04-10 14:29 ` [PATCH v4 08/34] ext4: make ext4_insert_delayed_block() insert multi-blocks Zhang Yi
2024-04-10 14:29 ` [PATCH v4 09/34] ext4: make ext4_da_map_blocks() buffer_head unaware Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 10/34] ext4: factor out ext4_map_create_blocks() to allocate new blocks Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 11/34] ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 12/34] ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 13/34] ext4: let __revise_pending() return newly inserted pendings Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 14/34] ext4: count removed reserved blocks for delalloc only extent entry Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 16/34] ext4: drop ext4_es_delayed_clu() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 18/34] ext4: drop ext4_es_is_delonly() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 19/34] ext4: drop all delonly descriptions Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 20/34] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 21/34] ext4: introduce seq counter for the extent status entry Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 22/34] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 24/34] ext4: implement buffered write " Zhang Yi
2024-05-01  8:11   ` Dave Chinner
2024-05-01  8:33     ` Dave Chinner
2024-05-06 11:44       ` Zhang Yi
2024-05-06 23:19         ` Dave Chinner
2024-05-07  5:10           ` Zhang Yi [this message]
2024-05-06 11:21     ` Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 25/34] ext4: implement writeback " Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 26/34] ext4: implement mmap " Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 27/34] ext4: implement zero_range " Zhang Yi
2024-05-01  9:40   ` Dave Chinner
2024-05-06 12:33     ` Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 28/34] ext4: writeback partial blocks before zeroing out range Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 29/34] ext4: fall back to buffer_head path for defrag Zhang Yi
2024-05-01  9:32   ` Dave Chinner
2024-05-06 13:05     ` Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 30/34] ext4: partial enable iomap for regular file's buffered IO path Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 31/34] filemap: support disable large folios on active inode Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 32/34] ext4: enable large folio for regular file with iomap buffered IO path Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 33/34] ext4: don't mark IOMAP_F_DIRTY for buffer write Zhang Yi
2024-05-01  9:27   ` Dave Chinner
2024-05-06 14:02     ` Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 34/34] ext4: add mount option for buffered IO iomap path Zhang Yi
2024-04-11  1:12 ` [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-24  8:12 ` Zhang Yi
  -- strict thread matches above, loose matches on Subject: below --
2024-04-10 13:27 [RFC " Zhang Yi
2024-04-10 13:28 ` [RFC PATCH v4 24/34] ext4: implement buffered write iomap path Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=998cae29-61c3-bb10-b05b-853edfd176b0@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).