* [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes @ 2024-05-14 14:23 fdmanana 2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> Fix a bug in an error path for direct IO writes in COW mode, which can make a subsequent fsync log invalid extent items (pointing to unwritten data). The second patch is just a cleanup. Details in the change logs. Filipe Manana (2): btrfs: drop extent maps after failed COW dio write btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fs/btrfs/inode.c | 37 ++++++++++++++++++++++++++----------- 1 file changed, 26 insertions(+), 11 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] btrfs: drop extent maps after failed COW dio write 2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana @ 2024-05-14 14:23 ` fdmanana 2024-05-14 22:15 ` Qu Wenruo 2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2 siblings, 1 reply; 13+ messages in thread From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> During a COW direct IO write if the call to btrfs_extract_ordered_extent() fails, we don't submit any bios, cleanup the ordered extent but we leave the extent maps we created for the write around. These extent maps point to locations on disk that were not written to, since we haven't submitted a bio for them, and they are new an in the list of modified extents. This means that if we fsync the file after, we log file extent items based on those extent maps, which are invalid since they point to unwritten locations. If a power failure happens after the fsync, on log replay we get a corrupted file range. Fix this by dropping the extent maps if btrfs_extract_ordered_extent() failed. Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit") Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/inode.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5a1014122088..f04852e44123 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, * remaining pages is blocked on the outstanding ordered extent. */ if (iter->flags & IOMAP_WRITE) { + struct btrfs_ordered_extent *oe = dio_data->ordered; int ret; - ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered); + ret = btrfs_extract_ordered_extent(bbio, oe); if (ret) { - btrfs_finish_ordered_extent(dio_data->ordered, NULL, + /* + * If this is a COW write it means we created new extent + * maps for the range and they point to an unwritten + * location since we got an error and we don't submit + * a bio. We must drop any extent maps within the range, + * otherwise a fast fsync would log them and after a + * crash and log replay we would have file extent items + * that point to unwritten locations (garbage). + */ + if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { + const u64 start = oe->file_offset; + const u64 end = start + oe->num_bytes - 1; + + btrfs_drop_extent_map_range(bbio->inode, start, end, false); + } + + btrfs_finish_ordered_extent(oe, NULL, file_offset, dip->bytes, !ret); bio->bi_status = errno_to_blk_status(ret); -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: drop extent maps after failed COW dio write 2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana @ 2024-05-14 22:15 ` Qu Wenruo 2024-05-15 9:47 ` Filipe Manana 0 siblings, 1 reply; 13+ messages in thread From: Qu Wenruo @ 2024-05-14 22:15 UTC (permalink / raw To: fdmanana, linux-btrfs 在 2024/5/14 23:53, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > During a COW direct IO write if the call to btrfs_extract_ordered_extent() > fails, we don't submit any bios, cleanup the ordered extent but we leave > the extent maps we created for the write around. > > These extent maps point to locations on disk that were not written to, > since we haven't submitted a bio for them, and they are new an in the list > of modified extents. > > This means that if we fsync the file after, we log file extent items based > on those extent maps, which are invalid since they point to unwritten > locations. If a power failure happens after the fsync, on log replay we > get a corrupted file range. > > Fix this by dropping the extent maps if btrfs_extract_ordered_extent() > failed. > > Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Just a small question inlined below. > --- > fs/btrfs/inode.c | 21 +++++++++++++++++++-- > 1 file changed, 19 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 5a1014122088..f04852e44123 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > * remaining pages is blocked on the outstanding ordered extent. > */ > if (iter->flags & IOMAP_WRITE) { > + struct btrfs_ordered_extent *oe = dio_data->ordered; > int ret; > > - ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered); > + ret = btrfs_extract_ordered_extent(bbio, oe); > if (ret) { > - btrfs_finish_ordered_extent(dio_data->ordered, NULL, > + /* > + * If this is a COW write it means we created new extent > + * maps for the range and they point to an unwritten > + * location since we got an error and we don't submit > + * a bio. We must drop any extent maps within the range, > + * otherwise a fast fsync would log them and after a > + * crash and log replay we would have file extent items > + * that point to unwritten locations (garbage). > + */ Is regular read also being affected? Since the em is already inserted, and we're doing DIO, there should be no page cache for the whole dio range (as long as we didn't fallback to buffered IO). In that case, the problem is not only limited to fsync, but also any read. Thanks, Qu > + if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { > + const u64 start = oe->file_offset; > + const u64 end = start + oe->num_bytes - 1; > + > + btrfs_drop_extent_map_range(bbio->inode, start, end, false); > + } > + > + btrfs_finish_ordered_extent(oe, NULL, > file_offset, dip->bytes, > !ret); > bio->bi_status = errno_to_blk_status(ret); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: drop extent maps after failed COW dio write 2024-05-14 22:15 ` Qu Wenruo @ 2024-05-15 9:47 ` Filipe Manana 0 siblings, 0 replies; 13+ messages in thread From: Filipe Manana @ 2024-05-15 9:47 UTC (permalink / raw To: Qu Wenruo; +Cc: linux-btrfs On Tue, May 14, 2024 at 11:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2024/5/14 23:53, fdmanana@kernel.org 写道: > > From: Filipe Manana <fdmanana@suse.com> > > > > During a COW direct IO write if the call to btrfs_extract_ordered_extent() > > fails, we don't submit any bios, cleanup the ordered extent but we leave > > the extent maps we created for the write around. > > > > These extent maps point to locations on disk that were not written to, > > since we haven't submitted a bio for them, and they are new an in the list > > of modified extents. > > > > This means that if we fsync the file after, we log file extent items based > > on those extent maps, which are invalid since they point to unwritten > > locations. If a power failure happens after the fsync, on log replay we > > get a corrupted file range. > > > > Fix this by dropping the extent maps if btrfs_extract_ordered_extent() > > failed. > > > > Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > Just a small question inlined below. > > --- > > fs/btrfs/inode.c | 21 +++++++++++++++++++-- > > 1 file changed, 19 insertions(+), 2 deletions(-) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 5a1014122088..f04852e44123 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > > * remaining pages is blocked on the outstanding ordered extent. > > */ > > if (iter->flags & IOMAP_WRITE) { > > + struct btrfs_ordered_extent *oe = dio_data->ordered; > > int ret; > > > > - ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered); > > + ret = btrfs_extract_ordered_extent(bbio, oe); > > if (ret) { > > - btrfs_finish_ordered_extent(dio_data->ordered, NULL, > > + /* > > + * If this is a COW write it means we created new extent > > + * maps for the range and they point to an unwritten > > + * location since we got an error and we don't submit > > + * a bio. We must drop any extent maps within the range, > > + * otherwise a fast fsync would log them and after a > > + * crash and log replay we would have file extent items > > + * that point to unwritten locations (garbage). > > + */ > > Is regular read also being affected? No, it isn't, I'll explain below. > > Since the em is already inserted, and we're doing DIO, there should be > no page cache for the whole dio range (as long as we didn't fallback to > buffered IO). > > In that case, the problem is not only limited to fsync, but also any read. Nop, it has nothing to do with not being buffered IO. So what happens in this particular error path is that on error we call btrfs_finish_ordered_extent() with "uptodate" false. This sets the IO_ERR flag on the ordered extent and then queues the ordered extent completion, which results in executing btrfs_finish_one_ordered() in a work queue. There we check the error flag on the ordered extent, and in fact we drop extent maps in range - exactly what this patch is doing. The problem is that once we unlock the inode in the dio write path, another task can call fsync. If the fsync is a "fast fsync", it does not wait for ordered extents to complete before going over new extent maps. So if the work queue job only finishes after the fsync logs extent maps, we get the corruption - a log tree with file extent items pointing to unwritten extents. For the read path this doesn't happen. Because for reads, when they start we lock the extent range and wait for any ordered extents in the range to complete (we call btrfs_lock_and_flush_ordered_range()), which will result in dropping the extent maps in the range when the ordered extents complete. These are probably a lot of non-obvious details. So I'll send a v2 with these details in the changelog and comment. I also noticed there's at least one more error path with the same problem, so I'll move the call to drop extent maps to btrfs_finish_ordered_extent(). I'm giving a full fstests run with that updated version and sending it out later. Thanks. > > Thanks, > Qu > > + if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { > > + const u64 start = oe->file_offset; > > + const u64 end = start + oe->num_bytes - 1; > > + > > + btrfs_drop_extent_map_range(bbio->inode, start, end, false); > > + } > > + > > + btrfs_finish_ordered_extent(oe, NULL, > > file_offset, dip->bytes, > > !ret); > > bio->bi_status = errno_to_blk_status(ret); ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation 2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana @ 2024-05-14 14:23 ` fdmanana 2024-05-14 22:23 ` Qu Wenruo 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2 siblings, 1 reply; 13+ messages in thread From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> Refactor btrfs_dio_submit_io() to avoid so much nesting and the need to split long lines, making it a bit easier to read. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/inode.c | 52 +++++++++++++++++++++++------------------------- 1 file changed, 25 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f04852e44123..c7f0239c3b68 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7869,6 +7869,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, struct btrfs_dio_private *dip = container_of(bbio, struct btrfs_dio_private, bbio); struct btrfs_dio_data *dio_data = iter->private; + struct btrfs_ordered_extent *oe = dio_data->ordered; + int ret; btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, btrfs_dio_end_io, bio->bi_private); @@ -7880,6 +7882,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, dio_data->submitted += bio->bi_iter.bi_size; + if (!(iter->flags & IOMAP_WRITE)) + goto submit_bio; /* * Check if we are doing a partial write. If we are, we need to split * the ordered extent to match the submitted bio. Hang on to the @@ -7887,37 +7891,31 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, * cancelled in iomap_end to avoid a deadlock wherein faulting the * remaining pages is blocked on the outstanding ordered extent. */ - if (iter->flags & IOMAP_WRITE) { - struct btrfs_ordered_extent *oe = dio_data->ordered; - int ret; - - ret = btrfs_extract_ordered_extent(bbio, oe); - if (ret) { - /* - * If this is a COW write it means we created new extent - * maps for the range and they point to an unwritten - * location since we got an error and we don't submit - * a bio. We must drop any extent maps within the range, - * otherwise a fast fsync would log them and after a - * crash and log replay we would have file extent items - * that point to unwritten locations (garbage). - */ - if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { - const u64 start = oe->file_offset; - const u64 end = start + oe->num_bytes - 1; + ret = btrfs_extract_ordered_extent(bbio, oe); + if (!ret) + goto submit_bio; - btrfs_drop_extent_map_range(bbio->inode, start, end, false); - } + /* + * If this is a COW write it means we created new extent maps for the + * range and they point to an unwritten location since we got an error + * and we don't submit a bio. We must drop any extent maps within the + * range, otherwise a fast fsync would log them and after a crash and + * log replay we would have file extent items that point to unwritten + * locations (garbage). + */ + if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { + const u64 start = oe->file_offset; + const u64 end = start + oe->num_bytes - 1; - btrfs_finish_ordered_extent(oe, NULL, - file_offset, dip->bytes, - !ret); - bio->bi_status = errno_to_blk_status(ret); - iomap_dio_bio_end_io(bio); - return; - } + btrfs_drop_extent_map_range(bbio->inode, start, end, false); } + btrfs_finish_ordered_extent(oe, NULL, file_offset, dip->bytes, false); + bio->bi_status = errno_to_blk_status(ret); + iomap_dio_bio_end_io(bio); + return; + +submit_bio: btrfs_submit_bio(bbio, 0); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation 2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana @ 2024-05-14 22:23 ` Qu Wenruo 0 siblings, 0 replies; 13+ messages in thread From: Qu Wenruo @ 2024-05-14 22:23 UTC (permalink / raw To: fdmanana, linux-btrfs 在 2024/5/14 23:53, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > Refactor btrfs_dio_submit_io() to avoid so much nesting and the need to > split long lines, making it a bit easier to read. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/inode.c | 52 +++++++++++++++++++++++------------------------- > 1 file changed, 25 insertions(+), 27 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index f04852e44123..c7f0239c3b68 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7869,6 +7869,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > struct btrfs_dio_private *dip = > container_of(bbio, struct btrfs_dio_private, bbio); > struct btrfs_dio_data *dio_data = iter->private; > + struct btrfs_ordered_extent *oe = dio_data->ordered; > + int ret; > > btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, > btrfs_dio_end_io, bio->bi_private); > @@ -7880,6 +7882,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > > dio_data->submitted += bio->bi_iter.bi_size; > > + if (!(iter->flags & IOMAP_WRITE)) > + goto submit_bio; > /* > * Check if we are doing a partial write. If we are, we need to split > * the ordered extent to match the submitted bio. Hang on to the > @@ -7887,37 +7891,31 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > * cancelled in iomap_end to avoid a deadlock wherein faulting the > * remaining pages is blocked on the outstanding ordered extent. > */ > - if (iter->flags & IOMAP_WRITE) { > - struct btrfs_ordered_extent *oe = dio_data->ordered; > - int ret; > - > - ret = btrfs_extract_ordered_extent(bbio, oe); > - if (ret) { > - /* > - * If this is a COW write it means we created new extent > - * maps for the range and they point to an unwritten > - * location since we got an error and we don't submit > - * a bio. We must drop any extent maps within the range, > - * otherwise a fast fsync would log them and after a > - * crash and log replay we would have file extent items > - * that point to unwritten locations (garbage). > - */ > - if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { > - const u64 start = oe->file_offset; > - const u64 end = start + oe->num_bytes - 1; > + ret = btrfs_extract_ordered_extent(bbio, oe); > + if (!ret) > + goto submit_bio; > > - btrfs_drop_extent_map_range(bbio->inode, start, end, false); > - } > + /* > + * If this is a COW write it means we created new extent maps for the > + * range and they point to an unwritten location since we got an error > + * and we don't submit a bio. We must drop any extent maps within the > + * range, otherwise a fast fsync would log them and after a crash and > + * log replay we would have file extent items that point to unwritten > + * locations (garbage). > + */ > + if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) { > + const u64 start = oe->file_offset; > + const u64 end = start + oe->num_bytes - 1; > > - btrfs_finish_ordered_extent(oe, NULL, > - file_offset, dip->bytes, > - !ret); > - bio->bi_status = errno_to_blk_status(ret); > - iomap_dio_bio_end_io(bio); > - return; > - } > + btrfs_drop_extent_map_range(bbio->inode, start, end, false); > } > > + btrfs_finish_ordered_extent(oe, NULL, file_offset, dip->bytes, false); > + bio->bi_status = errno_to_blk_status(ret); > + iomap_dio_bio_end_io(bio); > + return; > + > +submit_bio: > btrfs_submit_bio(bbio, 0); > } > ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes 2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana 2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana @ 2024-05-15 18:51 ` fdmanana 2024-05-15 18:51 ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana ` (2 more replies) 2 siblings, 3 replies; 13+ messages in thread From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> Fix a bug in an error path for direct IO writes in COW mode, which can make a subsequent fsync log invalid extent items (pointing to unwritten data). The second patch is just a cleanup. Details in the change logs. V2: Rework solution since other error paths caused the same problem, make it more generic. Added more details to change log and comment about what's going on, and why reads aren't affected. Filipe Manana (2): btrfs: immediately drop extent maps after failed COW write btrfs: make btrfs_finish_ordered_extent() return void fs/btrfs/ordered-data.c | 30 ++++++++++++++++++++++++++++-- fs/btrfs/ordered-data.h | 2 +- 2 files changed, 29 insertions(+), 3 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana @ 2024-05-15 18:51 ` fdmanana 2024-05-15 22:02 ` Qu Wenruo 2024-05-15 18:51 ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana 2024-05-17 16:28 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba 2 siblings, 1 reply; 13+ messages in thread From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because the ordered extent completion for a failed write is executed in a work queue. This means that once the write path unlocks the inode, a fast fsync can come and log the extent maps created by the write attempt before the work queue completes the ordered extent. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extents maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete in order to reduce latency. So to fix this make btrfs_finish_ordered_extent() drop the extent maps in the range if an error happened for a COW write. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 304d94f6d29b..3a3f21da6eb7 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); + /* + * If this is a COW write it means we created new extent maps for the + * range and they point to an unwritten location if we got an error + * either before submitting a bio or during IO. + * + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we + * are queuing its completion below. During completion, at + * btrfs_finish_one_ordered(), we will drop the extent maps for the + * unwritten extents. + * + * However because completion runs in a work queue we can end up + * unlocking the inode before the ordered extent is completed. + * + * That means that a fast fsync can happen before the work queue + * executes the completion of the ordered extent, and in that case + * the fsync will use the extent maps that point to unwritten extents, + * resulting in logging file extent items that point to unwritten + * locations. Unlike read paths, a fast fsync doesn't wait for ordered + * extent completion before proceeding (intentional to reduce latency). + * + * To be safe drop the new extent maps in the range (if are doing COW) + * right here before we unlock the inode and allow a fsync to run. + */ + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + btrfs_drop_extent_map_range(inode, file_offset, + file_offset + len - 1, false); + if (ret) btrfs_queue_ordered_fn(ordered); return ret; -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write 2024-05-15 18:51 ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana @ 2024-05-15 22:02 ` Qu Wenruo 0 siblings, 0 replies; 13+ messages in thread From: Qu Wenruo @ 2024-05-15 22:02 UTC (permalink / raw To: fdmanana, linux-btrfs 在 2024/5/16 04:21, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > If a write path in COW mode fails, either before submitting a bio for the > new extents or an actual IO error happens, we can end up allowing a fast > fsync to log file extent items that point to unwritten extents. > > This is because the ordered extent completion for a failed write is > executed in a work queue. This means that once the write path unlocks the > inode, a fast fsync can come and log the extent maps created by the write > attempt before the work queue completes the ordered extent. > > For example consider a direct IO write, in COW mode, that fails at > btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an > error: > > 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter > set to false, meaning an error happened; > > 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR > flag; > > 3) btrfs_finish_ordered_extent() queues the completion of the ordered > extent - so that btrfs_finish_one_ordered() will be executed later in > a work queue. That function will drop extents maps in the range when > it's executed, since the extent maps point to unwritten locations > (signaled by the BTRFS_ORDERED_IOERR flag); > > 4) After calling btrfs_finish_ordered_extent() we keep going down the > write path and unlock the inode; > > 5) After that a fast fsync starts and locks the inode; > > 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync > task sees the extent maps that point to the unwritten locations and > logs file extent items based on them - it does not know they are > unwritten, and the fast fsync path does not wait for ordered extents > to complete in order to reduce latency. > > So to fix this make btrfs_finish_ordered_extent() drop the extent maps > in the range if an error happened for a COW write. > > Note that this issues of using extent maps that point to unwritten > locations can not happen for reads, because in read paths we start by > locking the extent range and wait for any ordered extents in the range > to complete before looking for extent maps. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks for the detailed explanation on the issue. Thanks, Qu > --- > fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index 304d94f6d29b..3a3f21da6eb7 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); > spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); > > + /* > + * If this is a COW write it means we created new extent maps for the > + * range and they point to an unwritten location if we got an error > + * either before submitting a bio or during IO. > + * > + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we > + * are queuing its completion below. During completion, at > + * btrfs_finish_one_ordered(), we will drop the extent maps for the > + * unwritten extents. > + * > + * However because completion runs in a work queue we can end up > + * unlocking the inode before the ordered extent is completed. > + * > + * That means that a fast fsync can happen before the work queue > + * executes the completion of the ordered extent, and in that case > + * the fsync will use the extent maps that point to unwritten extents, > + * resulting in logging file extent items that point to unwritten > + * locations. Unlike read paths, a fast fsync doesn't wait for ordered > + * extent completion before proceeding (intentional to reduce latency). > + * > + * To be safe drop the new extent maps in the range (if are doing COW) > + * right here before we unlock the inode and allow a fsync to run. > + */ > + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) > + btrfs_drop_extent_map_range(inode, file_offset, > + file_offset + len - 1, false); > + > if (ret) > btrfs_queue_ordered_fn(ordered); > return ret; ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-15 18:51 ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana @ 2024-05-15 18:51 ` fdmanana 2024-05-15 22:03 ` Qu Wenruo 2024-05-17 16:28 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba 2 siblings, 1 reply; 13+ messages in thread From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> Currently btrfs_finish_ordered_extent() returns a boolean indicating if the ordered extent was added to the work queue for completion, but none of its callers cares about it, so make it return void. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/ordered-data.c | 3 +-- fs/btrfs/ordered-data.h | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 3a3f21da6eb7..3766804decb8 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered) btrfs_queue_work(wq, &ordered->work); } -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, struct page *page, u64 file_offset, u64 len, bool uptodate) { @@ -417,7 +417,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, if (ret) btrfs_queue_ordered_fn(ordered); - return ret; } /* diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index b6f6c6b91732..bef22179e7c5 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry); void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, struct btrfs_ordered_extent *entry); -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, struct page *page, u64 file_offset, u64 len, bool uptodate); void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void 2024-05-15 18:51 ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana @ 2024-05-15 22:03 ` Qu Wenruo 0 siblings, 0 replies; 13+ messages in thread From: Qu Wenruo @ 2024-05-15 22:03 UTC (permalink / raw To: fdmanana, linux-btrfs 在 2024/5/16 04:21, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > Currently btrfs_finish_ordered_extent() returns a boolean indicating if > the ordered extent was added to the work queue for completion, but none > of its callers cares about it, so make it return void. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/ordered-data.c | 3 +-- > fs/btrfs/ordered-data.h | 2 +- > 2 files changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index 3a3f21da6eb7..3766804decb8 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered) > btrfs_queue_work(wq, &ordered->work); > } > > -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > struct page *page, u64 file_offset, u64 len, > bool uptodate) > { > @@ -417,7 +417,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > > if (ret) > btrfs_queue_ordered_fn(ordered); > - return ret; > } > > /* > diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h > index b6f6c6b91732..bef22179e7c5 100644 > --- a/fs/btrfs/ordered-data.h > +++ b/fs/btrfs/ordered-data.h > @@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); > void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry); > void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, > struct btrfs_ordered_extent *entry); > -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > struct page *page, u64 file_offset, u64 len, > bool uptodate); > void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-15 18:51 ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana 2024-05-15 18:51 ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana @ 2024-05-17 16:28 ` David Sterba 2024-05-17 16:54 ` Filipe Manana 2 siblings, 1 reply; 13+ messages in thread From: David Sterba @ 2024-05-17 16:28 UTC (permalink / raw To: fdmanana; +Cc: linux-btrfs On Wed, May 15, 2024 at 07:51:45PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Fix a bug in an error path for direct IO writes in COW mode, which can make > a subsequent fsync log invalid extent items (pointing to unwritten data). > The second patch is just a cleanup. Details in the change logs. > > V2: Rework solution since other error paths caused the same problem, make > it more generic. > Added more details to change log and comment about what's going on, > and why reads aren't affected. > > Filipe Manana (2): > btrfs: immediately drop extent maps after failed COW write > btrfs: make btrfs_finish_ordered_extent() return void For the record, patches have been removed from for-next as the new code does NOFS allocation in irq context (reproduced by btrfs/146 and btrfs/160). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes 2024-05-17 16:28 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba @ 2024-05-17 16:54 ` Filipe Manana 0 siblings, 0 replies; 13+ messages in thread From: Filipe Manana @ 2024-05-17 16:54 UTC (permalink / raw To: dsterba; +Cc: linux-btrfs On Fri, May 17, 2024 at 5:29 PM David Sterba <dsterba@suse.cz> wrote: > > On Wed, May 15, 2024 at 07:51:45PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Fix a bug in an error path for direct IO writes in COW mode, which can make > > a subsequent fsync log invalid extent items (pointing to unwritten data). > > The second patch is just a cleanup. Details in the change logs. > > > > V2: Rework solution since other error paths caused the same problem, make > > it more generic. > > Added more details to change log and comment about what's going on, > > and why reads aren't affected. > > > > Filipe Manana (2): > > btrfs: immediately drop extent maps after failed COW write > > btrfs: make btrfs_finish_ordered_extent() return void > > For the record, patches have been removed from for-next as the new code > does NOFS allocation in irq context (reproduced by btrfs/146 and > btrfs/160). Superseded by: https://lore.kernel.org/linux-btrfs/cover.1715964570.git.fdmanana@suse.com/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-05-17 16:55 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana 2024-05-14 22:15 ` Qu Wenruo 2024-05-15 9:47 ` Filipe Manana 2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana 2024-05-14 22:23 ` Qu Wenruo 2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana 2024-05-15 18:51 ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana 2024-05-15 22:02 ` Qu Wenruo 2024-05-15 18:51 ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana 2024-05-15 22:03 ` Qu Wenruo 2024-05-17 16:28 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba 2024-05-17 16:54 ` Filipe Manana
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.