[PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes
@ 2024-05-14 14:23 fdmanana
  2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Fix a bug in an error path for direct IO writes in COW mode, which can make
a subsequent fsync log invalid extent items (pointing to unwritten data).
The second patch is just a cleanup. Details in the change logs.

Filipe Manana (2):
  btrfs: drop extent maps after failed COW dio write
  btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation

 fs/btrfs/inode.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] btrfs: drop extent maps after failed COW dio write
  2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
@ 2024-05-14 14:23 ` fdmanana
  2024-05-14 22:15   ` Qu Wenruo
  2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana
  2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2 siblings, 1 reply; 13+ messages in thread
From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During a COW direct IO write if the call to btrfs_extract_ordered_extent()
fails, we don't submit any bios, cleanup the ordered extent but we leave
the extent maps we created for the write around.

These extent maps point to locations on disk that were not written to,
since we haven't submitted a bio for them, and they are new an in the list
of modified extents.

This means that if we fsync the file after, we log file extent items based
on those extent maps, which are invalid since they point to unwritten
locations. If a power failure happens after the fsync, on log replay we
get a corrupted file range.

Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
failed.

Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5a1014122088..f04852e44123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	 * remaining pages is blocked on the outstanding ordered extent.
 	 */
 	if (iter->flags & IOMAP_WRITE) {
+		struct btrfs_ordered_extent *oe = dio_data->ordered;
 		int ret;
 
-		ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
+		ret = btrfs_extract_ordered_extent(bbio, oe);
 		if (ret) {
-			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
+			/*
+			 * If this is a COW write it means we created new extent
+			 * maps for the range and they point to an unwritten
+			 * location since we got an error and we don't submit
+			 * a bio. We must drop any extent maps within the range,
+			 * otherwise a fast fsync would log them and after a
+			 * crash and log replay we would have file extent items
+			 * that point to unwritten locations (garbage).
+			 */
+			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
+				const u64 start = oe->file_offset;
+				const u64 end = start + oe->num_bytes - 1;
+
+				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
+			}
+
+			btrfs_finish_ordered_extent(oe, NULL,
 						    file_offset, dip->bytes,
 						    !ret);
 			bio->bi_status = errno_to_blk_status(ret);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: drop extent maps after failed COW dio write
  2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana
@ 2024-05-14 22:15   ` Qu Wenruo
  2024-05-15  9:47     ` Filipe Manana
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2024-05-14 22:15 UTC (permalink / raw
  To: fdmanana, linux-btrfs



在 2024/5/14 23:53, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> During a COW direct IO write if the call to btrfs_extract_ordered_extent()
> fails, we don't submit any bios, cleanup the ordered extent but we leave
> the extent maps we created for the write around.
>
> These extent maps point to locations on disk that were not written to,
> since we haven't submitted a bio for them, and they are new an in the list
> of modified extents.
>
> This means that if we fsync the file after, we log file extent items based
> on those extent maps, which are invalid since they point to unwritten
> locations. If a power failure happens after the fsync, on log replay we
> get a corrupted file range.
>
> Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
> failed.
>
> Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Just a small question inlined below.
> ---
>   fs/btrfs/inode.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5a1014122088..f04852e44123 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
>   	 * remaining pages is blocked on the outstanding ordered extent.
>   	 */
>   	if (iter->flags & IOMAP_WRITE) {
> +		struct btrfs_ordered_extent *oe = dio_data->ordered;
>   		int ret;
>
> -		ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
> +		ret = btrfs_extract_ordered_extent(bbio, oe);
>   		if (ret) {
> -			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> +			/*
> +			 * If this is a COW write it means we created new extent
> +			 * maps for the range and they point to an unwritten
> +			 * location since we got an error and we don't submit
> +			 * a bio. We must drop any extent maps within the range,
> +			 * otherwise a fast fsync would log them and after a
> +			 * crash and log replay we would have file extent items
> +			 * that point to unwritten locations (garbage).
> +			 */

Is regular read also being affected?

Since the em is already inserted, and we're doing DIO, there should be
no page cache for the whole dio range (as long as we didn't fallback to
buffered IO).

In that case, the problem is not only limited to fsync, but also any read.

Thanks,
Qu
> +			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> +				const u64 start = oe->file_offset;
> +				const u64 end = start + oe->num_bytes - 1;
> +
> +				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
> +			}
> +
> +			btrfs_finish_ordered_extent(oe, NULL,
>   						    file_offset, dip->bytes,
>   						    !ret);
>   			bio->bi_status = errno_to_blk_status(ret);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: drop extent maps after failed COW dio write
  2024-05-14 22:15   ` Qu Wenruo
@ 2024-05-15  9:47     ` Filipe Manana
  0 siblings, 0 replies; 13+ messages in thread
From: Filipe Manana @ 2024-05-15  9:47 UTC (permalink / raw
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 14, 2024 at 11:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/14 23:53, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > During a COW direct IO write if the call to btrfs_extract_ordered_extent()
> > fails, we don't submit any bios, cleanup the ordered extent but we leave
> > the extent maps we created for the write around.
> >
> > These extent maps point to locations on disk that were not written to,
> > since we haven't submitted a bio for them, and they are new an in the list
> > of modified extents.
> >
> > This means that if we fsync the file after, we log file extent items based
> > on those extent maps, which are invalid since they point to unwritten
> > locations. If a power failure happens after the fsync, on log replay we
> > get a corrupted file range.
> >
> > Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
> > failed.
> >
> > Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Just a small question inlined below.
> > ---
> >   fs/btrfs/inode.c | 21 +++++++++++++++++++--
> >   1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 5a1014122088..f04852e44123 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
> >        * remaining pages is blocked on the outstanding ordered extent.
> >        */
> >       if (iter->flags & IOMAP_WRITE) {
> > +             struct btrfs_ordered_extent *oe = dio_data->ordered;
> >               int ret;
> >
> > -             ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
> > +             ret = btrfs_extract_ordered_extent(bbio, oe);
> >               if (ret) {
> > -                     btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> > +                     /*
> > +                      * If this is a COW write it means we created new extent
> > +                      * maps for the range and they point to an unwritten
> > +                      * location since we got an error and we don't submit
> > +                      * a bio. We must drop any extent maps within the range,
> > +                      * otherwise a fast fsync would log them and after a
> > +                      * crash and log replay we would have file extent items
> > +                      * that point to unwritten locations (garbage).
> > +                      */
>
> Is regular read also being affected?

No, it isn't, I'll explain below.

>
> Since the em is already inserted, and we're doing DIO, there should be
> no page cache for the whole dio range (as long as we didn't fallback to
> buffered IO).
>
> In that case, the problem is not only limited to fsync, but also any read.

Nop, it has nothing to do with not being buffered IO.

So what happens in this particular error path is that on error we call
btrfs_finish_ordered_extent() with "uptodate" false.

This sets the IO_ERR flag on the ordered extent and then queues the
ordered extent completion,
which results in executing btrfs_finish_one_ordered() in a work queue.

There we check the error flag on the ordered extent, and in fact we
drop extent maps in range - exactly what this patch is doing.

The problem is that once we unlock the inode in the dio write path,
another task can call fsync.

If the fsync is a "fast fsync", it does not wait for ordered extents
to complete before going over new extent maps.
So if the work queue job only finishes after the fsync logs extent
maps, we get the corruption - a log tree with file extent items
pointing to unwritten extents.

For the read path this doesn't happen.

Because for reads, when they start we lock the extent range and wait
for any ordered extents in the range to complete (we call
btrfs_lock_and_flush_ordered_range()),
which will result in dropping the extent maps in the range when the
ordered extents complete.

These are probably a lot of non-obvious details.
So I'll send a v2 with these details in the changelog and comment.

I also noticed there's at least one more error path with the same
problem, so I'll move the call to drop extent maps to
btrfs_finish_ordered_extent().

I'm giving a full fstests run with that updated version and sending it
out later.

Thanks.


>
> Thanks,
> Qu
> > +                     if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> > +                             const u64 start = oe->file_offset;
> > +                             const u64 end = start + oe->num_bytes - 1;
> > +
> > +                             btrfs_drop_extent_map_range(bbio->inode, start, end, false);
> > +                     }
> > +
> > +                     btrfs_finish_ordered_extent(oe, NULL,
> >                                                   file_offset, dip->bytes,
> >                                                   !ret);
> >                       bio->bi_status = errno_to_blk_status(ret);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation
  2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana
@ 2024-05-14 14:23 ` fdmanana
  2024-05-14 22:23   ` Qu Wenruo
  2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2 siblings, 1 reply; 13+ messages in thread
From: fdmanana @ 2024-05-14 14:23 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Refactor btrfs_dio_submit_io() to avoid so much nesting and the need to
split long lines, making it a bit easier to read.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 52 +++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f04852e44123..c7f0239c3b68 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7869,6 +7869,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	struct btrfs_dio_private *dip =
 		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_dio_data *dio_data = iter->private;
+	struct btrfs_ordered_extent *oe = dio_data->ordered;
+	int ret;
 
 	btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info,
 		       btrfs_dio_end_io, bio->bi_private);
@@ -7880,6 +7882,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 
 	dio_data->submitted += bio->bi_iter.bi_size;
 
+	if (!(iter->flags & IOMAP_WRITE))
+		goto submit_bio;
 	/*
 	 * Check if we are doing a partial write.  If we are, we need to split
 	 * the ordered extent to match the submitted bio.  Hang on to the
@@ -7887,37 +7891,31 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	 * cancelled in iomap_end to avoid a deadlock wherein faulting the
 	 * remaining pages is blocked on the outstanding ordered extent.
 	 */
-	if (iter->flags & IOMAP_WRITE) {
-		struct btrfs_ordered_extent *oe = dio_data->ordered;
-		int ret;
-
-		ret = btrfs_extract_ordered_extent(bbio, oe);
-		if (ret) {
-			/*
-			 * If this is a COW write it means we created new extent
-			 * maps for the range and they point to an unwritten
-			 * location since we got an error and we don't submit
-			 * a bio. We must drop any extent maps within the range,
-			 * otherwise a fast fsync would log them and after a
-			 * crash and log replay we would have file extent items
-			 * that point to unwritten locations (garbage).
-			 */
-			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
-				const u64 start = oe->file_offset;
-				const u64 end = start + oe->num_bytes - 1;
+	ret = btrfs_extract_ordered_extent(bbio, oe);
+	if (!ret)
+		goto submit_bio;
 
-				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
-			}
+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to an unwritten location since we got an error
+	 * and we don't submit a bio. We must drop any extent maps within the
+	 * range, otherwise a fast fsync would log them and after a crash and
+	 * log replay we would have file extent items that point to unwritten
+	 * locations (garbage).
+	 */
+	if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
+		const u64 start = oe->file_offset;
+		const u64 end = start + oe->num_bytes - 1;
 
-			btrfs_finish_ordered_extent(oe, NULL,
-						    file_offset, dip->bytes,
-						    !ret);
-			bio->bi_status = errno_to_blk_status(ret);
-			iomap_dio_bio_end_io(bio);
-			return;
-		}
+		btrfs_drop_extent_map_range(bbio->inode, start, end, false);
 	}
 
+	btrfs_finish_ordered_extent(oe, NULL, file_offset, dip->bytes, false);
+	bio->bi_status = errno_to_blk_status(ret);
+	iomap_dio_bio_end_io(bio);
+	return;
+
+submit_bio:
 	btrfs_submit_bio(bbio, 0);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation
  2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana
@ 2024-05-14 22:23   ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2024-05-14 22:23 UTC (permalink / raw
  To: fdmanana, linux-btrfs



在 2024/5/14 23:53, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Refactor btrfs_dio_submit_io() to avoid so much nesting and the need to
> split long lines, making it a bit easier to read.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 52 +++++++++++++++++++++++-------------------------
>   1 file changed, 25 insertions(+), 27 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index f04852e44123..c7f0239c3b68 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7869,6 +7869,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
>   	struct btrfs_dio_private *dip =
>   		container_of(bbio, struct btrfs_dio_private, bbio);
>   	struct btrfs_dio_data *dio_data = iter->private;
> +	struct btrfs_ordered_extent *oe = dio_data->ordered;
> +	int ret;
>
>   	btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info,
>   		       btrfs_dio_end_io, bio->bi_private);
> @@ -7880,6 +7882,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
>
>   	dio_data->submitted += bio->bi_iter.bi_size;
>
> +	if (!(iter->flags & IOMAP_WRITE))
> +		goto submit_bio;
>   	/*
>   	 * Check if we are doing a partial write.  If we are, we need to split
>   	 * the ordered extent to match the submitted bio.  Hang on to the
> @@ -7887,37 +7891,31 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
>   	 * cancelled in iomap_end to avoid a deadlock wherein faulting the
>   	 * remaining pages is blocked on the outstanding ordered extent.
>   	 */
> -	if (iter->flags & IOMAP_WRITE) {
> -		struct btrfs_ordered_extent *oe = dio_data->ordered;
> -		int ret;
> -
> -		ret = btrfs_extract_ordered_extent(bbio, oe);
> -		if (ret) {
> -			/*
> -			 * If this is a COW write it means we created new extent
> -			 * maps for the range and they point to an unwritten
> -			 * location since we got an error and we don't submit
> -			 * a bio. We must drop any extent maps within the range,
> -			 * otherwise a fast fsync would log them and after a
> -			 * crash and log replay we would have file extent items
> -			 * that point to unwritten locations (garbage).
> -			 */
> -			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> -				const u64 start = oe->file_offset;
> -				const u64 end = start + oe->num_bytes - 1;
> +	ret = btrfs_extract_ordered_extent(bbio, oe);
> +	if (!ret)
> +		goto submit_bio;
>
> -				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
> -			}
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to an unwritten location since we got an error
> +	 * and we don't submit a bio. We must drop any extent maps within the
> +	 * range, otherwise a fast fsync would log them and after a crash and
> +	 * log replay we would have file extent items that point to unwritten
> +	 * locations (garbage).
> +	 */
> +	if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> +		const u64 start = oe->file_offset;
> +		const u64 end = start + oe->num_bytes - 1;
>
> -			btrfs_finish_ordered_extent(oe, NULL,
> -						    file_offset, dip->bytes,
> -						    !ret);
> -			bio->bi_status = errno_to_blk_status(ret);
> -			iomap_dio_bio_end_io(bio);
> -			return;
> -		}
> +		btrfs_drop_extent_map_range(bbio->inode, start, end, false);
>   	}
>
> +	btrfs_finish_ordered_extent(oe, NULL, file_offset, dip->bytes, false);
> +	bio->bi_status = errno_to_blk_status(ret);
> +	iomap_dio_bio_end_io(bio);
> +	return;
> +
> +submit_bio:
>   	btrfs_submit_bio(bbio, 0);
>   }
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes
  2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana
  2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana
@ 2024-05-15 18:51 ` fdmanana
  2024-05-15 18:51   ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana
                     ` (2 more replies)
  2 siblings, 3 replies; 13+ messages in thread
From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Fix a bug in an error path for direct IO writes in COW mode, which can make
a subsequent fsync log invalid extent items (pointing to unwritten data).
The second patch is just a cleanup. Details in the change logs.

V2: Rework solution since other error paths caused the same problem, make
    it more generic.
    Added more details to change log and comment about what's going on,
    and why reads aren't affected.

Filipe Manana (2):
  btrfs: immediately drop extent maps after failed COW write
  btrfs: make btrfs_finish_ordered_extent() return void

 fs/btrfs/ordered-data.c | 30 ++++++++++++++++++++++++++++--
 fs/btrfs/ordered-data.h |  2 +-
 2 files changed, 29 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write
  2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
@ 2024-05-15 18:51   ` fdmanana
  2024-05-15 22:02     ` Qu Wenruo
  2024-05-15 18:51   ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
  2024-05-17 16:28   ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba
  2 siblings, 1 reply; 13+ messages in thread
From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because the ordered extent completion for a failed write is
executed in a work queue. This means that once the write path unlocks the
inode, a fast fsync can come and log the extent maps created by the write
attempt before the work queue completes the ordered extent.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extents maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete in order to reduce latency.

So to fix this make btrfs_finish_ordered_extent() drop the extent maps
in the range if an error happened for a COW write.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 304d94f6d29b..3a3f21da6eb7 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
 	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);

+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to an unwritten location if we got an error
+	 * either before submitting a bio or during IO.
+	 *
+	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+	 * are queuing its completion below. During completion, at
+	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
+	 * unwritten extents.
+	 *
+	 * However because completion runs in a work queue we can end up
+	 * unlocking the inode before the ordered extent is completed.
+	 *
+	 * That means that a fast fsync can happen before the work queue
+	 * executes the completion of the ordered extent, and in that case
+	 * the fsync will use the extent maps that point to unwritten extents,
+	 * resulting in logging file extent items that point to unwritten
+	 * locations. Unlike read paths, a fast fsync doesn't wait for ordered
+	 * extent completion before proceeding (intentional to reduce latency).
+	 *
+	 * To be safe drop the new extent maps in the range (if are doing COW)
+	 * right here before we unlock the inode and allow a fsync to run.
+	 */
+	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		btrfs_drop_extent_map_range(inode, file_offset,
+					    file_offset + len - 1, false);
+
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
 	return ret;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write
  2024-05-15 18:51   ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana
@ 2024-05-15 22:02     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2024-05-15 22:02 UTC (permalink / raw
  To: fdmanana, linux-btrfs



在 2024/5/16 04:21, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a write path in COW mode fails, either before submitting a bio for the
> new extents or an actual IO error happens, we can end up allowing a fast
> fsync to log file extent items that point to unwritten extents.
>
> This is because the ordered extent completion for a failed write is
> executed in a work queue. This means that once the write path unlocks the
> inode, a fast fsync can come and log the extent maps created by the write
> attempt before the work queue completes the ordered extent.
>
> For example consider a direct IO write, in COW mode, that fails at
> btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> error:
>
> 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
>     set to false, meaning an error happened;
>
> 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
>     flag;
>
> 3) btrfs_finish_ordered_extent() queues the completion of the ordered
>     extent - so that btrfs_finish_one_ordered() will be executed later in
>     a work queue. That function will drop extents maps in the range when
>     it's executed, since the extent maps point to unwritten locations
>     (signaled by the BTRFS_ORDERED_IOERR flag);
>
> 4) After calling btrfs_finish_ordered_extent() we keep going down the
>     write path and unlock the inode;
>
> 5) After that a fast fsync starts and locks the inode;
>
> 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
>     task sees the extent maps that point to the unwritten locations and
>     logs file extent items based on them - it does not know they are
>     unwritten, and the fast fsync path does not wait for ordered extents
>     to complete in order to reduce latency.
>
> So to fix this make btrfs_finish_ordered_extent() drop the extent maps
> in the range if an error happened for a COW write.
>
> Note that this issues of using extent maps that point to unwritten
> locations can not happen for reads, because in read paths we start by
> locking the extent range and wait for any ordered extents in the range
> to complete before looking for extent maps.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks for the detailed explanation on the issue.

Thanks,
Qu
> ---
>   fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 304d94f6d29b..3a3f21da6eb7 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
>   	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to an unwritten location if we got an error
> +	 * either before submitting a bio or during IO.
> +	 *
> +	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> +	 * are queuing its completion below. During completion, at
> +	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
> +	 * unwritten extents.
> +	 *
> +	 * However because completion runs in a work queue we can end up
> +	 * unlocking the inode before the ordered extent is completed.
> +	 *
> +	 * That means that a fast fsync can happen before the work queue
> +	 * executes the completion of the ordered extent, and in that case
> +	 * the fsync will use the extent maps that point to unwritten extents,
> +	 * resulting in logging file extent items that point to unwritten
> +	 * locations. Unlike read paths, a fast fsync doesn't wait for ordered
> +	 * extent completion before proceeding (intentional to reduce latency).
> +	 *
> +	 * To be safe drop the new extent maps in the range (if are doing COW)
> +	 * right here before we unlock the inode and allow a fsync to run.
> +	 */
> +	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> +		btrfs_drop_extent_map_range(inode, file_offset,
> +					    file_offset + len - 1, false);
> +
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
>   	return ret;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void
  2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2024-05-15 18:51   ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana
@ 2024-05-15 18:51   ` fdmanana
  2024-05-15 22:03     ` Qu Wenruo
  2024-05-17 16:28   ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba
  2 siblings, 1 reply; 13+ messages in thread
From: fdmanana @ 2024-05-15 18:51 UTC (permalink / raw
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently btrfs_finish_ordered_extent() returns a boolean indicating if
the ordered extent was added to the work queue for completion, but none
of its callers cares about it, so make it return void.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 3 +--
 fs/btrfs/ordered-data.h | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 3a3f21da6eb7..3766804decb8 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered)
 	btrfs_queue_work(wq, &ordered->work);
 }
 
-bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 struct page *page, u64 file_offset, u64 len,
 				 bool uptodate)
 {
@@ -417,7 +417,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
-	return ret;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index b6f6c6b91732..bef22179e7c5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				struct btrfs_ordered_extent *entry);
-bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 struct page *page, u64 file_offset, u64 len,
 				 bool uptodate);
 void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void
  2024-05-15 18:51   ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
@ 2024-05-15 22:03     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2024-05-15 22:03 UTC (permalink / raw
  To: fdmanana, linux-btrfs



在 2024/5/16 04:21, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently btrfs_finish_ordered_extent() returns a boolean indicating if
> the ordered extent was added to the work queue for completion, but none
> of its callers cares about it, so make it return void.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/ordered-data.c | 3 +--
>   fs/btrfs/ordered-data.h | 2 +-
>   2 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 3a3f21da6eb7..3766804decb8 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered)
>   	btrfs_queue_work(wq, &ordered->work);
>   }
>
> -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
> +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   				 struct page *page, u64 file_offset, u64 len,
>   				 bool uptodate)
>   {
> @@ -417,7 +417,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
> -	return ret;
>   }
>
>   /*
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b6f6c6b91732..bef22179e7c5 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
>   void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
>   void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
>   				struct btrfs_ordered_extent *entry);
> -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
> +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   				 struct page *page, u64 file_offset, u64 len,
>   				 bool uptodate);
>   void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes
  2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
  2024-05-15 18:51   ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana
  2024-05-15 18:51   ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
@ 2024-05-17 16:28   ` David Sterba
  2024-05-17 16:54     ` Filipe Manana
  2 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2024-05-17 16:28 UTC (permalink / raw
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 15, 2024 at 07:51:45PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Fix a bug in an error path for direct IO writes in COW mode, which can make
> a subsequent fsync log invalid extent items (pointing to unwritten data).
> The second patch is just a cleanup. Details in the change logs.
> 
> V2: Rework solution since other error paths caused the same problem, make
>     it more generic.
>     Added more details to change log and comment about what's going on,
>     and why reads aren't affected.
> 
> Filipe Manana (2):
>   btrfs: immediately drop extent maps after failed COW write
>   btrfs: make btrfs_finish_ordered_extent() return void

For the record, patches have been removed from for-next as the new code
does NOFS allocation in irq context (reproduced by btrfs/146 and
btrfs/160).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes
  2024-05-17 16:28   ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba
@ 2024-05-17 16:54     ` Filipe Manana
  0 siblings, 0 replies; 13+ messages in thread
From: Filipe Manana @ 2024-05-17 16:54 UTC (permalink / raw
  To: dsterba; +Cc: linux-btrfs

On Fri, May 17, 2024 at 5:29 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, May 15, 2024 at 07:51:45PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Fix a bug in an error path for direct IO writes in COW mode, which can make
> > a subsequent fsync log invalid extent items (pointing to unwritten data).
> > The second patch is just a cleanup. Details in the change logs.
> >
> > V2: Rework solution since other error paths caused the same problem, make
> >     it more generic.
> >     Added more details to change log and comment about what's going on,
> >     and why reads aren't affected.
> >
> > Filipe Manana (2):
> >   btrfs: immediately drop extent maps after failed COW write
> >   btrfs: make btrfs_finish_ordered_extent() return void
>
> For the record, patches have been removed from for-next as the new code
> does NOFS allocation in irq context (reproduced by btrfs/146 and
> btrfs/160).

Superseded by:
https://lore.kernel.org/linux-btrfs/cover.1715964570.git.fdmanana@suse.com/

>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-17 16:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-14 14:23 [PATCH 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
2024-05-14 14:23 ` [PATCH 1/2] btrfs: drop extent maps after failed COW dio write fdmanana
2024-05-14 22:15   ` Qu Wenruo
2024-05-15  9:47     ` Filipe Manana
2024-05-14 14:23 ` [PATCH 2/2] btrfs: refactor btrfs_dio_submit_io() for less nesting and indentation fdmanana
2024-05-14 22:23   ` Qu Wenruo
2024-05-15 18:51 ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes fdmanana
2024-05-15 18:51   ` [PATCH v2 1/2] btrfs: immediately drop extent maps after failed COW write fdmanana
2024-05-15 22:02     ` Qu Wenruo
2024-05-15 18:51   ` [PATCH v2 2/2] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
2024-05-15 22:03     ` Qu Wenruo
2024-05-17 16:28   ` [PATCH v2 0/2] btrfs: fix a bug in the direct IO write path for COW writes David Sterba
2024-05-17 16:54     ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.