[PATCH v3 00/21] block atomic writes for XFS

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/21] block atomic writes for XFS
@ 2024-04-29 17:47 John Garry
  2024-04-29 17:47 ` [PATCH v3 01/21] fs: Add generic_atomic_write_valid_size() John Garry
                   ` (20 more replies)
  0 siblings, 21 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

This series expands atomic write support to filesystems, specifically
XFS. Extent alignment is based on new feature forcealign.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

XFS can be formatted for atomic writes as follows:
mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1  /dev/sda

atomic-writes=1 just enables atomic writes in the SB, but does not auto-
enable atomic writes for each file.

Support can be enabled through xfs_io command:
$xfs_io -c "lsattr -v" filename
[extsize, force-align]
$xfs_io -c "extsize" filename
[16384] filename
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[extsize, force-align, atomic-writes] filename
$xfs_io -c statx filename
...
stat.stx_atomic_write_unit_min = 4096
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
...

A couple of patches are marked as RFC, as I am anything but 100% confident
in them. Indeed, from testing, there is an issue that I get ENOSPC at what
appears to be a moderate FS usage, like ~60%, and spaceman is reporting
appropriately-sized free extents. I notice that when I disable any
fallocate punch calls in my testing, then this issue goes away.  More
details available upon request.

Stripe alignment testing for forcealign changes should be noted at
https://lore.kernel.org/linux-xfs/083f3d88-cd39-41ef-9ee1-cafe04a96cf9@oracle.com/

Baseline is following series (which is based on v6.9-rc1):
https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/forcealign_and_atomicwrites_for_v3_xfs_block_atomic_writes

Patches for this series can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.9-v6-fs-v3/

Changes since v2:
https://lore.kernel.org/linux-xfs/20240304130428.13026-1-john.g.garry@oracle.com/
- Incorporate forcealign patches from
  https://lore.kernel.org/linux-xfs/20240402233006.1210262-1-david@fromorbit.com/
- Put bdev awu min and max in buftarg
- Extra forcealign patches to deal with truncate and fallocate punch,
  insert, collapse
- Add generic_atomic_write_valid_size()
- Change iomap.extent_shift -> .extent_size

Changes since v1:
https://lore.kernel.org/linux-xfs/20240124142645.9334-1-john.g.garry@oracle.com/
- Add blk_validate_atomic_write_op_size() (Darrick suggested idea)
- Swap forcealign for rtvol support (Dave requested forcealign)
- Sub-extent DIO zeroing (Dave wanted rid of XFS_BMAPI_ZERO usage)
- Improve coding for XFS statx support (Darrick, Ojaswin)
- Improve conditions for setting FMODE_CAN_ATOMIC_WRITE (Darrick)
- Improve commit message for FS_XFLAG_ATOMICWRITES flag (Darrick)
- Validate atomic writes in xfs_file_dio_write()
- Drop IOMAP_ATOMIC

Darrick J. Wong (2):
  xfs: Introduce FORCEALIGN inode flag
  xfs: Enable file data forcealign feature

Dave Chinner (6):
  xfs: only allow minlen allocations when near ENOSPC
  xfs: always tail align maxlen allocations
  xfs: simplify extent allocation alignment
  xfs: make EOF allocation simpler
  xfs: introduce forced allocation alignment
  fs: xfs: align args->minlen for forced allocation alignment

John Garry (13):
  fs: Add generic_atomic_write_valid_size()
  xfs: Do not free EOF blocks for forcealign
  xfs: Update xfs_is_falloc_aligned() mask for forcealign
  xfs: Unmap blocks according to forcealign
  xfs: Only free full extents for forcealign
  iomap: Sub-extent zeroing
  fs: xfs: iomap: Sub-extent zeroing
  fs: Add FS_XFLAG_ATOMICWRITES flag
  iomap: Atomic write support
  xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  xfs: Support atomic write for statx
  xfs: Validate atomic writes
  xfs: Support setting FMODE_CAN_ATOMIC_WRITE

 fs/iomap/direct-io.c          |  27 ++-
 fs/xfs/libxfs/xfs_alloc.c     |  33 ++--
 fs/xfs/libxfs/xfs_alloc.h     |   3 +-
 fs/xfs/libxfs/xfs_bmap.c      | 302 ++++++++++++++++++----------------
 fs/xfs/libxfs/xfs_format.h    |  16 +-
 fs/xfs/libxfs/xfs_ialloc.c    |  12 +-
 fs/xfs/libxfs/xfs_inode_buf.c |  86 ++++++++++
 fs/xfs/libxfs/xfs_inode_buf.h |   3 +
 fs/xfs/libxfs/xfs_sb.c        |   4 +
 fs/xfs/xfs_bmap_util.c        |  14 +-
 fs/xfs/xfs_buf.c              |  15 +-
 fs/xfs/xfs_buf.h              |   4 +-
 fs/xfs/xfs_file.c             |  64 +++++--
 fs/xfs/xfs_inode.c            |  14 ++
 fs/xfs/xfs_inode.h            |  15 ++
 fs/xfs/xfs_ioctl.c            |  55 ++++++-
 fs/xfs/xfs_iomap.c            |  13 +-
 fs/xfs/xfs_iops.c             |  28 ++++
 fs/xfs/xfs_mount.h            |   4 +
 fs/xfs/xfs_super.c            |   8 +
 fs/xfs/xfs_trace.h            |   8 +-
 include/linux/fs.h            |  12 ++
 include/linux/iomap.h         |   1 +
 include/uapi/linux/fs.h       |   3 +
 24 files changed, 549 insertions(+), 195 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 01/21] fs: Add generic_atomic_write_valid_size()
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 02/21] xfs: only allow minlen allocations when near ENOSPC John Garry
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Add a generic helper for FSes to validate that an atomic write is
appropriately sized (along with the other checks).

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/fs.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6ebefb079740..9bfa9b68d800 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3648,4 +3648,16 @@ bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
 	return true;
 }
 
+static inline
+bool generic_atomic_write_valid_size(loff_t pos, struct iov_iter *iter,
+				unsigned int unit_min, unsigned int unit_max)
+{
+	size_t len = iov_iter_count(iter);
+
+	if (len < unit_min || len > unit_max)
+		return false;
+
+	return generic_atomic_write_valid(pos, iter);
+}
+
 #endif /* _LINUX_FS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 02/21] xfs: only allow minlen allocations when near ENOSPC
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
  2024-04-29 17:47 ` [PATCH v3 01/21] fs: Add generic_atomic_write_valid_size() John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 03/21] xfs: always tail align maxlen allocations John Garry
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

When we are near ENOSPC and don't have enough free
space for an args->maxlen allocation, xfs_alloc_space_available()
will trim args->maxlen to equal the available space. However, this
function has only checked that there is enough contiguous free space
for an aligned args->minlen allocation to succeed. Hence there is no
guarantee that an args->maxlen allocation will succeed, nor that the
available space will allow for correct alignment of an args->maxlen
allocation.

Further, by trimming args->maxlen arbitrarily, it breaks an
assumption made in xfs_alloc_fix_len() that if the caller wants
aligned allocation, then args->maxlen will be set to an aligned
value. It then skips the tail alignment and so we end up with
extents that aren't aligned to extent size hint boundaries as we
approach ENOSPC.

To avoid this problem, don't reduce args->maxlen by some random,
arbitrary amount. If args->maxlen is too large for the available
space, reduce the allocation to a minlen allocation as we know we
have contiguous free space available for this to succeed and always
be correctly aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 9da52e92172a..215265e0f68f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2411,14 +2411,23 @@ xfs_alloc_space_available(
 	if (available < (int)max(args->total, alloc_len))
 		return false;

+	if (flags & XFS_ALLOC_FLAG_CHECK)
+		return true;
+
 	/*
-	 * Clamp maxlen to the amount of free space available for the actual
-	 * extent allocation.
+	 * If we can't do a maxlen allocation, then we must reduce the size of
+	 * the allocation to match the available free space. We know how big
+	 * the largest contiguous free space we can allocate is, so that's our
+	 * upper bound. However, we don't exaclty know what alignment/size
+	 * constraints have been placed on the allocation, so we can't
+	 * arbitrarily select some new max size. Hence make this a minlen
+	 * allocation as we know that will definitely succeed and match the
+	 * callers alignment constraints.
 	 */
-	if (available < (int)args->maxlen && !(flags & XFS_ALLOC_FLAG_CHECK)) {
-		args->maxlen = available;
+	alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+	if (longest < alloc_len) {
+		args->maxlen = args->minlen;
 		ASSERT(args->maxlen > 0);
-		ASSERT(args->maxlen >= args->minlen);
 	}

 	return true;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 03/21] xfs: always tail align maxlen allocations
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
  2024-04-29 17:47 ` [PATCH v3 01/21] fs: Add generic_atomic_write_valid_size() John Garry
  2024-04-29 17:47 ` [PATCH v3 02/21] xfs: only allow minlen allocations when near ENOSPC John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 04/21] xfs: simplify extent allocation alignment John Garry
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

When we do a large allocation, the core free space allocation code
assumes that args->maxlen is aligned to args->prod/args->mod. hence
if we get a maximum sized extent allocated, it does not do tail
alignment of the extent.

However, this assumes that nothing modifies args->maxlen between the
original allocation context setup and trimming the selected free
space extent to size. This assumption has recently been found to be
invalid - xfs_alloc_space_available() modifies args->maxlen in low
space situations - and there may be more situations we haven't yet
found like this.

Force aligned allocation introduces the requirement that extents are
correctly tail aligned, resulting in this occasional latent
alignment failure to e reclassified from an unimportant curiousity
to a must-fix bug.

Removing the assumption about args->maxlen allocations always being
tail aligned is trivial, and should not impact anything because
args->maxlen for inodes with extent size hints configured are
already aligned. Hence all this change does it avoid weird corner
cases that would have resulted in unaligned extent sizes by always
trimming the extent down to an aligned size.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 215265e0f68f..e21fd5c1f802 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -432,20 +432,18 @@ xfs_alloc_compute_diff(
  * Fix up the length, based on mod and prod.
  * len should be k * prod + mod for some k.
  * If len is too small it is returned unchanged.
- * If len hits maxlen it is left alone.
  */
-STATIC void
+static void
 xfs_alloc_fix_len(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
-	xfs_extlen_t	k;
-	xfs_extlen_t	rlen;
+	xfs_extlen_t		k;
+	xfs_extlen_t		rlen = args->len;

 	ASSERT(args->mod < args->prod);
-	rlen = args->len;
 	ASSERT(rlen >= args->minlen);
 	ASSERT(rlen <= args->maxlen);
-	if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
+	if (args->prod <= 1 || rlen < args->mod ||
 	    (args->mod == 0 && rlen < args->prod))
 		return;
 	k = rlen % args->prod;
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 04/21] xfs: simplify extent allocation alignment
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (2 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 03/21] xfs: always tail align maxlen allocations John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 05/21] xfs: make EOF allocation simpler John Garry
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

We currently align extent allocation to stripe unit or stripe width.
That is specified by an external parameter to the allocation code,
which then manipulates the xfs_alloc_args alignment configuration in
interesting ways.

The args->alignment field specifies extent start alignment, but
because we may be attempting non-aligned allocation first there are
also slop variables that allow for those allocation attempts to
account for aligned allocation if they fail.

This gets much more complex as we introduce forced allocation
alignment, where extent size hints are used to generate the extent
start alignment. extent size hints currently only affect extent
lengths (via args->prod and args->mod) and so with this change we
will have two different start alignment conditions.

Avoid this complexity by always using args->alignment to indicate
extent start alignment, and always using args->prod/mod to indicate
extent length adjustment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
jpg: fixup alignslop references in xfs_trace.h and xfs_ialloc.c
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |  4 +-
 fs/xfs/libxfs/xfs_alloc.h  |  2 +-
 fs/xfs/libxfs/xfs_bmap.c   | 96 +++++++++++++++++---------------------
 fs/xfs/libxfs/xfs_ialloc.c | 10 ++--
 fs/xfs/xfs_trace.h         |  8 ++--
 5 files changed, 54 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e21fd5c1f802..563599e956a6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2393,7 +2393,7 @@ xfs_alloc_space_available(
 	reservation = xfs_ag_resv_needed(pag, args->resv);
 
 	/* do we have enough contiguous free space for the allocation? */
-	alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
+	alloc_len = args->minlen + (args->alignment - 1) + args->alignslop;
 	longest = xfs_alloc_longest_free_extent(pag, min_free, reservation);
 	if (longest < alloc_len)
 		return false;
@@ -2422,7 +2422,7 @@ xfs_alloc_space_available(
 	 * allocation as we know that will definitely succeed and match the
 	 * callers alignment constraints.
 	 */
-	alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+	alloc_len = args->maxlen + (args->alignment - 1) + args->alignslop;
 	if (longest < alloc_len) {
 		args->maxlen = args->minlen;
 		ASSERT(args->maxlen > 0);
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0b956f8b9d5a..aa2c103d98f0 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
 	xfs_extlen_t	minleft;	/* min blocks must be left after us */
 	xfs_extlen_t	total;		/* total blocks needed in xaction */
 	xfs_extlen_t	alignment;	/* align answer to multiple of this */
-	xfs_extlen_t	minalignslop;	/* slop for minlen+alignment calcs */
+	xfs_extlen_t	alignslop;	/* slop for alignment calcs */
 	xfs_agblock_t	min_agbno;	/* set an agbno range for NEAR allocs */
 	xfs_agblock_t	max_agbno;	/* ... */
 	xfs_extlen_t	len;		/* output: actual size of extent */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..d56c82c07505 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
 	xfs_extlen_t		blen)
 {
 
+	/* Adjust best length for extent start alignment. */
+	if (blen > args->alignment)
+		blen -= args->alignment;
+
 	/*
 	 * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
 	 * possible that there is enough contiguous free space for this request.
@@ -3310,6 +3314,7 @@ xfs_bmap_select_minlen(
 	if (blen < args->maxlen)
 		return blen;
 	return args->maxlen;
+
 }
 
 static int
@@ -3403,35 +3408,43 @@ xfs_bmap_alloc_account(
 	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, fld, ap->length);
 }
 
-static int
+/*
+ * Calculate the extent start alignment and the extent length adjustments that
+ * constrain this allocation.
+ *
+ * Extent start alignment is currently determined by stripe configuration and is
+ * carried in args->alignment, whilst extent length adjustment is determined by
+ * extent size hints and is carried by args->prod and args->mod.
+ *
+ * Low level allocation code is free to either ignore or override these values
+ * as required.
+ */
+static void
 xfs_bmap_compute_alignments(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_extlen_t		align = 0; /* minimum allocation alignment */
-	int			stripe_align = 0;
 
 	/* stripe alignment for allocation is determined by mount parameters */
 	if (mp->m_swidth && xfs_has_swalloc(mp))
-		stripe_align = mp->m_swidth;
+		args->alignment = mp->m_swidth;
 	else if (mp->m_dalign)
-		stripe_align = mp->m_dalign;
+		args->alignment = mp->m_dalign;
 
 	if (ap->flags & XFS_BMAPI_COWFORK)
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
 					&ap->length))
 			ASSERT(0);
 		ASSERT(ap->length);
-	}
 
-	/* apply extent size hints if obtained earlier */
-	if (align) {
 		args->prod = align;
 		div_u64_rem(ap->offset, args->prod, &args->mod);
 		if (args->mod)
@@ -3446,7 +3459,6 @@ xfs_bmap_compute_alignments(
 			args->mod = args->prod - args->mod;
 	}
 
-	return stripe_align;
 }
 
 static void
@@ -3518,7 +3530,7 @@ xfs_bmap_exact_minlen_extent_alloc(
 	args.total = ap->total;
 
 	args.alignment = 1;
-	args.minalignslop = 0;
+	args.alignslop = 0;
 
 	args.minleft = ap->minleft;
 	args.wasdel = ap->wasdel;
@@ -3558,7 +3570,6 @@ xfs_bmap_btalloc_at_eof(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args,
 	xfs_extlen_t		blen,
-	int			stripe_align,
 	bool			ag_only)
 {
 	struct xfs_mount	*mp = args->mp;
@@ -3572,23 +3583,15 @@ xfs_bmap_btalloc_at_eof(
 	 * allocation.
 	 */
 	if (ap->offset) {
-		xfs_extlen_t	nextminlen = 0;
+		xfs_extlen_t	alignment = args->alignment;
 
 		/*
-		 * Compute the minlen+alignment for the next case.  Set slop so
-		 * that the value of minlen+alignment+slop doesn't go up between
-		 * the calls.
+		 * Compute the alignment slop for the fallback path so we ensure
+		 * we account for the potential alignemnt space required by the
+		 * fallback paths before we modify the AGF and AGFL here.
 		 */
 		args->alignment = 1;
-		if (blen > stripe_align && blen <= args->maxlen)
-			nextminlen = blen - stripe_align;
-		else
-			nextminlen = args->minlen;
-		if (nextminlen + stripe_align > args->minlen + 1)
-			args->minalignslop = nextminlen + stripe_align -
-					args->minlen - 1;
-		else
-			args->minalignslop = 0;
+		args->alignslop = alignment - args->alignment;
 
 		if (!caller_pag)
 			args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
@@ -3606,19 +3609,8 @@ xfs_bmap_btalloc_at_eof(
 		 * Exact allocation failed. Reset to try an aligned allocation
 		 * according to the original allocation specification.
 		 */
-		args->alignment = stripe_align;
-		args->minlen = nextminlen;
-		args->minalignslop = 0;
-	} else {
-		/*
-		 * Adjust minlen to try and preserve alignment if we
-		 * can't guarantee an aligned maxlen extent.
-		 */
-		args->alignment = stripe_align;
-		if (blen > args->alignment &&
-		    blen <= args->maxlen + args->alignment)
-			args->minlen = blen - args->alignment;
-		args->minalignslop = 0;
+		args->alignment = alignment;
+		args->alignslop = 0;
 	}
 
 	if (ag_only) {
@@ -3636,9 +3628,8 @@ xfs_bmap_btalloc_at_eof(
 		return 0;
 
 	/*
-	 * Allocation failed, so turn return the allocation args to their
-	 * original non-aligned state so the caller can proceed on allocation
-	 * failure as if this function was never called.
+	 * Aligned allocation failed, so all fallback paths from here drop the
+	 * start alignment requirement as we know it will not succeed.
 	 */
 	args->alignment = 1;
 	return 0;
@@ -3646,7 +3637,9 @@ xfs_bmap_btalloc_at_eof(
 
 /*
  * We have failed multiple allocation attempts so now are in a low space
- * allocation situation. Try a locality first full filesystem minimum length
+ * allocation situation. We give up on any attempt at aligned allocation here.
+ *
+ * Try a locality first full filesystem minimum length
  * allocation whilst still maintaining necessary total block reservation
  * requirements.
  *
@@ -3663,6 +3656,7 @@ xfs_bmap_btalloc_low_space(
 {
 	int			error;
 
+	args->alignment = 1;
 	if (args->minlen > ap->minlen) {
 		args->minlen = ap->minlen;
 		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
@@ -3682,13 +3676,11 @@ xfs_bmap_btalloc_low_space(
 static int
 xfs_bmap_btalloc_filestreams(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	int			stripe_align)
+	struct xfs_alloc_arg	*args)
 {
 	xfs_extlen_t		blen = 0;
 	int			error = 0;
 
-
 	error = xfs_filestream_select_ag(ap, args, &blen);
 	if (error)
 		return error;
@@ -3707,8 +3699,7 @@ xfs_bmap_btalloc_filestreams(
 
 	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
 	if (ap->aeof)
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
-				true);
+		error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
 
 	if (!error && args->fsbno == NULLFSBLOCK)
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
@@ -3732,8 +3723,7 @@ xfs_bmap_btalloc_filestreams(
 static int
 xfs_bmap_btalloc_best_length(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	int			stripe_align)
+	struct xfs_alloc_arg	*args)
 {
 	xfs_extlen_t		blen = 0;
 	int			error;
@@ -3757,8 +3747,7 @@ xfs_bmap_btalloc_best_length(
 	 * trying.
 	 */
 	if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
-				false);
+		error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
 		if (error || args->fsbno != NULLFSBLOCK)
 			return error;
 	}
@@ -3785,27 +3774,26 @@ xfs_bmap_btalloc(
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
 		.alignment	= 1,
-		.minalignslop	= 0,
+		.alignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
 	xfs_extlen_t		orig_length;
 	int			error;
-	int			stripe_align;
 
 	ASSERT(ap->length);
 	orig_offset = ap->offset;
 	orig_length = ap->length;
 
-	stripe_align = xfs_bmap_compute_alignments(ap, &args);
+	xfs_bmap_compute_alignments(ap, &args);
 
 	/* Trim the allocation back to the maximum an AG can fit. */
 	args.maxlen = min(ap->length, mp->m_ag_max_usable);
 
 	if ((ap->datatype & XFS_ALLOC_USERDATA) &&
 	    xfs_inode_is_filestream(ap->ip))
-		error = xfs_bmap_btalloc_filestreams(ap, &args, stripe_align);
+		error = xfs_bmap_btalloc_filestreams(ap, &args);
 	else
-		error = xfs_bmap_btalloc_best_length(ap, &args, stripe_align);
+		error = xfs_bmap_btalloc_best_length(ap, &args);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e5ac3e5430c4..164b6dcdbb44 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -758,12 +758,12 @@ xfs_ialloc_ag_alloc(
 		 *
 		 * For an exact allocation, alignment must be 1,
 		 * however we need to take cluster alignment into account when
-		 * fixing up the freelist. Use the minalignslop field to
-		 * indicate that extra blocks might be required for alignment,
-		 * but not to use them in the actual exact allocation.
+		 * fixing up the freelist. Use the alignslop field to indicate
+		 * that extra blocks might be required for alignment, but not
+		 * to use them in the actual exact allocation.
 		 */
 		args.alignment = 1;
-		args.minalignslop = igeo->cluster_align - 1;
+		args.alignslop = igeo->cluster_align - 1;
 
 		/* Allow space for the inode btree to split. */
 		args.minleft = igeo->inobt_maxlevels;
@@ -783,7 +783,7 @@ xfs_ialloc_ag_alloc(
 		 * on, so reset minalignslop to ensure it is not included in
 		 * subsequent requests.
 		 */
-		args.minalignslop = 0;
+		args.alignslop = 0;
 	}
 
 	if (unlikely(args.fsbno == NULLFSBLOCK)) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index aea97fc074f8..14679d64558a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1800,7 +1800,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__field(xfs_extlen_t, minleft)
 		__field(xfs_extlen_t, total)
 		__field(xfs_extlen_t, alignment)
-		__field(xfs_extlen_t, minalignslop)
+		__field(xfs_extlen_t, alignslop)
 		__field(xfs_extlen_t, len)
 		__field(char, wasdel)
 		__field(char, wasfromfl)
@@ -1819,7 +1819,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__entry->minleft = args->minleft;
 		__entry->total = args->total;
 		__entry->alignment = args->alignment;
-		__entry->minalignslop = args->minalignslop;
+		__entry->alignslop = args->alignslop;
 		__entry->len = args->len;
 		__entry->wasdel = args->wasdel;
 		__entry->wasfromfl = args->wasfromfl;
@@ -1828,7 +1828,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__entry->highest_agno = args->tp->t_highest_agno;
 	),
 	TP_printk("dev %d:%d agno 0x%x agbno 0x%x minlen %u maxlen %u mod %u "
-		  "prod %u minleft %u total %u alignment %u minalignslop %u "
+		  "prod %u minleft %u total %u alignment %u alignslop %u "
 		  "len %u wasdel %d wasfromfl %d resv %d "
 		  "datatype 0x%x highest_agno 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
@@ -1841,7 +1841,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		  __entry->minleft,
 		  __entry->total,
 		  __entry->alignment,
-		  __entry->minalignslop,
+		  __entry->alignslop,
 		  __entry->len,
 		  __entry->wasdel,
 		  __entry->wasfromfl,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 05/21] xfs: make EOF allocation simpler
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (3 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 04/21] xfs: simplify extent allocation alignment John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 06/21] xfs: introduce forced allocation alignment John Garry
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

Currently the allocation at EOF is broken into two cases - when the
offset is zero and when the offset is non-zero. When the offset is
non-zero, we try to do exact block allocation for contiguous
extent allocation. When the offset is zero, the allocation is simply
an aligned allocation.

We want aligned allocation as the fallback when exact block
allocation fails, but that complicates the EOF allocation in that it
now has to handle two different allocation cases. The
caller also has to handle allocation when not at EOF, and for the
upcoming forced alignment changes we need that to also be aligned
allocation.

To simplify all this, pull the aligned allocation cases back into
the callers and leave the EOF allocation path for exact block
allocation only. This means that the EOF exact block allocation
fallback path is the normal aligned allocation path and that ends up
making things a lot simpler when forced alignment is introduced.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c   | 129 +++++++++++++++----------------------
 fs/xfs/libxfs/xfs_ialloc.c |   2 +-
 2 files changed, 54 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d56c82c07505..c2ddf1875e52 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3320,12 +3320,12 @@ xfs_bmap_select_minlen(
 static int
 xfs_bmap_btalloc_select_lengths(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	xfs_extlen_t		*blen)
+	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno, startag;
+	xfs_extlen_t		blen = 0;
 	int			error = 0;
 
 	if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
@@ -3339,19 +3339,18 @@ xfs_bmap_btalloc_select_lengths(
 	if (startag == NULLAGNUMBER)
 		startag = 0;
 
-	*blen = 0;
 	for_each_perag_wrap(mp, startag, agno, pag) {
-		error = xfs_bmap_longest_free_extent(pag, args->tp, blen);
+		error = xfs_bmap_longest_free_extent(pag, args->tp, &blen);
 		if (error && error != -EAGAIN)
 			break;
 		error = 0;
-		if (*blen >= args->maxlen)
+		if (blen >= args->maxlen)
 			break;
 	}
 	if (pag)
 		xfs_perag_rele(pag);
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, *blen);
+	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
 	return error;
 }
 
@@ -3561,78 +3560,40 @@ xfs_bmap_exact_minlen_extent_alloc(
  * If we are not low on available data blocks and we are allocating at
  * EOF, optimise allocation for contiguous file extension and/or stripe
  * alignment of the new extent.
- *
- * NOTE: ap->aeof is only set if the allocation length is >= the
- * stripe unit and the allocation offset is at the end of file.
  */
 static int
 xfs_bmap_btalloc_at_eof(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args,
-	xfs_extlen_t		blen,
-	bool			ag_only)
+	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*caller_pag = args->pag;
+	xfs_extlen_t		alignment = args->alignment;
 	int			error;
 
+	ASSERT(ap->aeof && ap->offset);
+	ASSERT(args->alignment >= 1);
+
 	/*
-	 * If there are already extents in the file, try an exact EOF block
-	 * allocation to extend the file as a contiguous extent. If that fails,
-	 * or it's the first allocation in a file, just try for a stripe aligned
-	 * allocation.
+	 * Compute the alignment slop for the fallback path so we ensure
+	 * we account for the potential alignemnt space required by the
+	 * fallback paths before we modify the AGF and AGFL here.
 	 */
-	if (ap->offset) {
-		xfs_extlen_t	alignment = args->alignment;
-
-		/*
-		 * Compute the alignment slop for the fallback path so we ensure
-		 * we account for the potential alignemnt space required by the
-		 * fallback paths before we modify the AGF and AGFL here.
-		 */
-		args->alignment = 1;
-		args->alignslop = alignment - args->alignment;
-
-		if (!caller_pag)
-			args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
-		error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
-		if (!caller_pag) {
-			xfs_perag_put(args->pag);
-			args->pag = NULL;
-		}
-		if (error)
-			return error;
-
-		if (args->fsbno != NULLFSBLOCK)
-			return 0;
-		/*
-		 * Exact allocation failed. Reset to try an aligned allocation
-		 * according to the original allocation specification.
-		 */
-		args->alignment = alignment;
-		args->alignslop = 0;
-	}
+	args->alignment = 1;
+	args->alignslop = alignment - args->alignment;
 
-	if (ag_only) {
-		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
-	} else {
+	if (!caller_pag)
+		args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
+	error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
+	if (!caller_pag) {
+		xfs_perag_put(args->pag);
 		args->pag = NULL;
-		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
-		ASSERT(args->pag == NULL);
-		args->pag = caller_pag;
 	}
-	if (error)
-		return error;
 
-	if (args->fsbno != NULLFSBLOCK)
-		return 0;
-
-	/*
-	 * Aligned allocation failed, so all fallback paths from here drop the
-	 * start alignment requirement as we know it will not succeed.
-	 */
-	args->alignment = 1;
-	return 0;
+	/* Reset alignment to original specifications.  */
+	args->alignment = alignment;
+	args->alignslop = 0;
+	return error;
 }
 
 /*
@@ -3698,12 +3659,19 @@ xfs_bmap_btalloc_filestreams(
 	}
 
 	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
-	if (ap->aeof)
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
+	if (ap->aeof && ap->offset)
+		error = xfs_bmap_btalloc_at_eof(ap, args);
 
+	/* This may be an aligned allocation attempt. */
 	if (!error && args->fsbno == NULLFSBLOCK)
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
 
+	/* Attempt non-aligned allocation if we haven't already. */
+	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		args->alignment = 1;
+		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
+	}
+
 out_low_space:
 	/*
 	 * We are now done with the perag reference for the filestreams
@@ -3725,7 +3693,6 @@ xfs_bmap_btalloc_best_length(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args)
 {
-	xfs_extlen_t		blen = 0;
 	int			error;
 
 	ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
@@ -3736,23 +3703,33 @@ xfs_bmap_btalloc_best_length(
 	 * the request.  If one isn't found, then adjust the minimum allocation
 	 * size to the largest space found.
 	 */
-	error = xfs_bmap_btalloc_select_lengths(ap, args, &blen);
+	error = xfs_bmap_btalloc_select_lengths(ap, args);
 	if (error)
 		return error;
 
 	/*
-	 * Don't attempt optimal EOF allocation if previous allocations barely
-	 * succeeded due to being near ENOSPC. It is highly unlikely we'll get
-	 * optimal or even aligned allocations in this case, so don't waste time
-	 * trying.
+	 * If we are in low space mode, then optimal allocation will fail so
+	 * prepare for minimal allocation and run the low space algorithm
+	 * immediately.
 	 */
-	if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
-		error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
-		if (error || args->fsbno != NULLFSBLOCK)
-			return error;
+	if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
+		ASSERT(args->fsbno == NULLFSBLOCK);
+		return xfs_bmap_btalloc_low_space(ap, args);
+	}
+
+	if (ap->aeof && ap->offset)
+		error = xfs_bmap_btalloc_at_eof(ap, args);
+
+	/* This may be an aligned allocation attempt. */
+	if (!error && args->fsbno == NULLFSBLOCK)
+		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
+
+	/* Attempt non-aligned allocation if we haven't already. */
+	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		args->alignment = 1;
+		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	}
 
-	error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	if (error || args->fsbno != NULLFSBLOCK)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 164b6dcdbb44..592ee9c2ae40 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -780,7 +780,7 @@ xfs_ialloc_ag_alloc(
 		 * the exact agbno requirement and increase the alignment
 		 * instead. It is critical that the total size of the request
 		 * (len + alignment + slop) does not increase from this point
-		 * on, so reset minalignslop to ensure it is not included in
+		 * on, so reset alignslop to ensure it is not included in
 		 * subsequent requests.
 		 */
 		args.alignslop = 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 06/21] xfs: introduce forced allocation alignment
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (4 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 05/21] xfs: make EOF allocation simpler John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 07/21] fs: xfs: align args->minlen for " John Garry
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

When forced allocation alignment is specified, the extent will
be aligned to the extent size hint size rather than stripe
alignment. If aligned allocation cannot be done, then the allocation
is failed rather than attempting non-aligned fallbacks.

Note: none of the per-inode force align configuration is present
yet, so this just triggers off an "always false" wrapper function
for the moment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.h |  1 +
 fs/xfs/libxfs/xfs_bmap.c  | 29 +++++++++++++++++++++++------
 fs/xfs/xfs_inode.h        |  5 +++++
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index aa2c103d98f0..7de2e6f64882 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -66,6 +66,7 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_USERDATA		(1 << 0)/* allocation is for user data*/
 #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
+#define XFS_ALLOC_FORCEALIGN		(1 << 3)/* forced extent alignment */
 
 /* freespace limit calculations */
 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c2ddf1875e52..7a0ef0900097 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3411,9 +3411,10 @@ xfs_bmap_alloc_account(
  * Calculate the extent start alignment and the extent length adjustments that
  * constrain this allocation.
  *
- * Extent start alignment is currently determined by stripe configuration and is
- * carried in args->alignment, whilst extent length adjustment is determined by
- * extent size hints and is carried by args->prod and args->mod.
+ * Extent start alignment is currently determined by forced inode alignment or
+ * stripe configuration and is carried in args->alignment, whilst extent length
+ * adjustment is determined by extent size hints and is carried by args->prod
+ * and args->mod.
  *
  * Low level allocation code is free to either ignore or override these values
  * as required.
@@ -3426,11 +3427,18 @@ xfs_bmap_compute_alignments(
 	struct xfs_mount	*mp = args->mp;
 	xfs_extlen_t		align = 0; /* minimum allocation alignment */
 
-	/* stripe alignment for allocation is determined by mount parameters */
-	if (mp->m_swidth && xfs_has_swalloc(mp))
+	/*
+	 * Forced inode alignment takes preference over stripe alignment.
+	 * Stripe alignment for allocation is determined by mount parameters.
+	 */
+	if (xfs_inode_has_forcealign(ap->ip)) {
+		args->alignment = xfs_get_extsz_hint(ap->ip);
+		args->datatype |= XFS_ALLOC_FORCEALIGN;
+	} else if (mp->m_swidth && xfs_has_swalloc(mp)) {
 		args->alignment = mp->m_swidth;
-	else if (mp->m_dalign)
+	} else if (mp->m_dalign) {
 		args->alignment = mp->m_dalign;
+	}
 
 	if (ap->flags & XFS_BMAPI_COWFORK)
 		align = xfs_get_cowextsz_hint(ap->ip);
@@ -3617,6 +3625,11 @@ xfs_bmap_btalloc_low_space(
 {
 	int			error;
 
+	if (args->alignment > 1 && (args->datatype & XFS_ALLOC_FORCEALIGN)) {
+		args->fsbno = NULLFSBLOCK;
+		return 0;
+	}
+
 	args->alignment = 1;
 	if (args->minlen > ap->minlen) {
 		args->minlen = ap->minlen;
@@ -3668,6 +3681,8 @@ xfs_bmap_btalloc_filestreams(
 
 	/* Attempt non-aligned allocation if we haven't already. */
 	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		if (args->datatype & XFS_ALLOC_FORCEALIGN)
+			return error;
 		args->alignment = 1;
 		error = xfs_alloc_vextent_near_bno(args, ap->blkno);
 	}
@@ -3726,6 +3741,8 @@ xfs_bmap_btalloc_best_length(
 
 	/* Attempt non-aligned allocation if we haven't already. */
 	if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1)  {
+		if (args->datatype & XFS_ALLOC_FORCEALIGN)
+			return error;
 		args->alignment = 1;
 		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
 	}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ab46ffb3ac19..67f10349a6ed 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -311,6 +311,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
+{
+	return false;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 07/21] fs: xfs: align args->minlen for forced allocation alignment
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (5 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 06/21] xfs: introduce forced allocation alignment John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-06-05 14:26   ` John Garry
  2024-04-29 17:47 ` [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag John Garry
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner, John Garry

From: Dave Chinner <dchinner@redhat.com>

If args->minlen is not aligned to the constraints of forced
alignment, we may do minlen allocations that are not aligned when we
approach ENOSPC. Avoid this by always aligning args->minlen
appropriately. If alignment of minlen results in a value smaller
than the alignment constraint, fail the allocation immediately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 45 +++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a0ef0900097..4f39a43d78a7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3288,33 +3288,48 @@ xfs_bmap_longest_free_extent(
 	return 0;
 }
 
-static xfs_extlen_t
+static int
 xfs_bmap_select_minlen(
 	struct xfs_bmalloca	*ap,
 	struct xfs_alloc_arg	*args,
 	xfs_extlen_t		blen)
 {
-
 	/* Adjust best length for extent start alignment. */
 	if (blen > args->alignment)
 		blen -= args->alignment;
 
 	/*
 	 * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
-	 * possible that there is enough contiguous free space for this request.
+	 * possible that there is enough contiguous free space for this request
+	 * even if best length is less that the minimum length we need.
+	 *
+	 * If the best length won't satisfy the maximum length we requested,
+	 * then use it as the minimum length so we get as large an allocation
+	 * as possible.
 	 */
 	if (blen < ap->minlen)
-		return ap->minlen;
+		blen = ap->minlen;
+	else if (blen > args->maxlen)
+		blen = args->maxlen;
 
 	/*
-	 * If the best seen length is less than the request length,
-	 * use the best as the minimum, otherwise we've got the maxlen we
-	 * were asked for.
+	 * If we have alignment constraints, round the minlen down to match the
+	 * constraint so that alignment will be attempted. This may reduce the
+	 * allocation to smaller than was requested, so clamp the minimum to
+	 * ap->minlen to allow unaligned allocation to succeed. If we are forced
+	 * to align the allocation, return ENOSPC at this point because we don't
+	 * have enough contiguous free space to guarantee aligned allocation.
 	 */
-	if (blen < args->maxlen)
-		return blen;
-	return args->maxlen;
-
+	if (args->alignment > 1) {
+		blen = rounddown(blen, args->alignment);
+		if (blen < ap->minlen) {
+			if (args->datatype & XFS_ALLOC_FORCEALIGN)
+				return -ENOSPC;
+			blen = ap->minlen;
+		}
+	}
+	args->minlen = blen;
+	return 0;
 }
 
 static int
@@ -3350,8 +3365,7 @@ xfs_bmap_btalloc_select_lengths(
 	if (pag)
 		xfs_perag_rele(pag);
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
-	return error;
+	return xfs_bmap_select_minlen(ap, args, blen);
 }
 
 /* Update all inode and quota accounting for the allocation we just did. */
@@ -3671,7 +3685,10 @@ xfs_bmap_btalloc_filestreams(
 		goto out_low_space;
 	}
 
-	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
+	error = xfs_bmap_select_minlen(ap, args, blen);
+	if (error)
+		goto out_low_space;
+
 	if (ap->aeof && ap->offset)
 		error = xfs_bmap_btalloc_at_eof(ap, args);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (6 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 07/21] fs: xfs: align args->minlen for " John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-30 23:22   ` Dave Chinner
  2024-06-12  2:10   ` Long Li
  2024-04-29 17:47 ` [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign John Garry
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint.  Having a separate COW extent size hint is no
longer allowed.

The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.

jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
     alignment for forcealign
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h    |  6 ++++-
 fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
 fs/xfs/libxfs/xfs_sb.c        |  2 ++
 fs/xfs/xfs_inode.c            | 12 +++++++++
 fs/xfs/xfs_inode.h            |  2 +-
 fs/xfs/xfs_ioctl.c            | 34 +++++++++++++++++++++++-
 fs/xfs/xfs_mount.h            |  2 ++
 fs/xfs/xfs_super.c            |  4 +++
 include/uapi/linux/fs.h       |  2 ++
 10 files changed, 114 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2b2f9050fbfb..4dd295b047f8 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1084,16 +1085,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index d0dcce462bf4..12f128f12824 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -616,6 +616,14 @@ xfs_dinode_verify(
 	    !xfs_has_bigtime(mp))
 		return __this_address;
 
+	if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+		fa = xfs_inode_validate_forcealign(mp, mode, flags,
+				be32_to_cpu(dip->di_extsize),
+				be32_to_cpu(dip->di_cowextsize));
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
@@ -783,3 +791,45 @@ xfs_inode_validate_cowextsize(
 
 	return NULL;
 }
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+	struct xfs_mount	*mp,
+	uint16_t		mode,
+	uint16_t		flags,
+	uint32_t		extsize,
+	uint32_t		cowextsize)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_forcealign(mp))
+		return __this_address;
+
+	/* Only regular files and directories */
+	if (!S_ISDIR(mode) && !S_ISREG(mode))
+		return __this_address;
+
+	/* Doesn't apply to realtime files */
+	if (flags & XFS_DIFLAG_REALTIME)
+		return __this_address;
+
+	/* Requires a non-zero power-of-2 extent size hint */
+	if (extsize == 0 || !is_power_of_2(extsize) ||
+	    (mp->m_sb.sb_agblocks % extsize))
+		return __this_address;
+
+	/* Requires agsize be a multiple of extsize */
+	if (mp->m_sb.sb_agblocks % extsize)
+		return __this_address;
+
+	/* Requires stripe unit+width (if set) be a multiple of extsize */
+	if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
+	    (mp->m_swidth && (mp->m_swidth % extsize)))
+		return __this_address;
+
+	/* Requires no cow extent size hint */
+	if (cowextsize != 0)
+		return __this_address;
+
+	return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
 		uint32_t cowextsize, uint16_t mode, uint16_t flags,
 		uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+		uint16_t mode, uint16_t flags, uint32_t extsize,
+		uint32_t cowextsize);
 
 static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
 {
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d991eec05436..e746c57c4cc4 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+		features |= XFS_FEAT_FORCEALIGN;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ea48774f6b76..db5a0f66a121 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -607,6 +607,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+			flags |= FS_XFLAG_FORCEALIGN;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
@@ -736,6 +738,8 @@ xfs_inode_inherit_flags2(
 	}
 	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
 		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+	if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+		ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	/* Don't let invalid cowextsize hints propagate. */
 	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -744,6 +748,14 @@ xfs_inode_inherit_flags2(
 		ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
 		ip->i_cowextsize = 0;
 	}
+
+	if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+				VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+				ip->i_cowextsize);
+		if (failaddr)
+			ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 67f10349a6ed..065028789473 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -313,7 +313,7 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 
 static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
 {
-	return false;
+	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
 }
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d0e2cec6210d..d1126509ceb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_FORCEALIGN)
+		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	return di_flags2;
 }
@@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
 	if (i_flags2 && !xfs_has_v3inodes(mp))
 		return -EINVAL;
 
+	/*
+	 * Force-align requires a nonzero extent size hint and a zero cow
+	 * extent size hint.  It doesn't apply to realtime files.
+	 */
+	if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+		if (!xfs_has_forcealign(mp))
+			return -EINVAL;
+		if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+			return -EINVAL;
+		if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+					FS_XFLAG_EXTSZINHERIT)))
+			return -EINVAL;
+		if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+			return -EINVAL;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
@@ -1232,6 +1250,7 @@ xfs_ioctl_setattr_check_extsize(
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_failaddr_t		failaddr;
 	uint16_t		new_diflags;
+	uint16_t		new_diflags2;
 
 	if (!fa->fsx_valid)
 		return 0;
@@ -1244,6 +1263,7 @@ xfs_ioctl_setattr_check_extsize(
 		return -EINVAL;
 
 	new_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
+	new_diflags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);
 
 	/*
 	 * Inode verifiers do not check that the extent size hint is an integer
@@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
 	failaddr = xfs_inode_validate_extsize(ip->i_mount,
 			XFS_B_TO_FSB(mp, fa->fsx_extsize),
 			VFS_I(ip)->i_mode, new_diflags);
-	return failaddr != NULL ? -EINVAL : 0;
+	if (failaddr)
+		return -EINVAL;
+
+	if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+				VFS_I(ip)->i_mode, new_diflags,
+				XFS_B_TO_FSB(mp, fa->fsx_extsize),
+				XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
+		if (failaddr)
+			return -EINVAL;
+	}
+
+	return 0;
 }
 
 static int
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e880aa48de68..a8266cf654c4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -292,6 +292,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_BIGTIME	(1ULL << 24)	/* large timestamps */
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
+#define XFS_FEAT_FORCEALIGN	(1ULL << 27)	/* aligned file data extents */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
 __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c21f10ab0f5d..63d4312785ef 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1706,6 +1706,10 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_forcealign(mp))
+		xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 191a7e88a8ab..6a6bcb53594a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,6 +158,8 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN	0x00020000
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (7 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-30 22:54   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask " John Garry
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

For when forcealign is enabled, we want the EOF to be aligned as well, so
do not free EOF blocks.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_bmap_util.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 19e11d1da660..f26d1570b9bd 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
 	 * forever.
 	 */
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
-	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
+
+	/* Do not free blocks when forcing extent sizes */
+	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
+		end_fsb = roundup_64(end_fsb, ip->i_extsize);
+	else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
 		end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
+
 	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
 	if (last_fsb <= end_fsb)
 		return false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask for forcealign
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (8 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-30 23:35   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign John Garry
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

For when forcealign is enabled, we want the alignment mask to cover an
aligned extent, similar to rtvol.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 632653e00906..e81e01e6b22b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -61,7 +61,10 @@ xfs_is_falloc_aligned(
 		}
 		mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
 	} else {
-		mask = mp->m_sb.sb_blocksize - 1;
+		if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
+			mask = (mp->m_sb.sb_blocksize * ip->i_extsize) - 1;
+		else
+			mask = mp->m_sb.sb_blocksize - 1;
 	}
 
 	return !((pos | len) & mask);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (9 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask " John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-05-01  0:10   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign John Garry
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

For when forcealign is enabled, blocks in an inode need to be unmapped
according to extent alignment, like what is already done for rtvol.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 39 +++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_inode.h       |  5 +++++
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4f39a43d78a7..4a78ab193753 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5339,6 +5339,15 @@ xfs_bmap_del_extent_real(
 	return 0;
 }
 
+/* Return the offset of an block number within an extent for forcealign. */
+static xfs_extlen_t
+xfs_forcealign_extent_offset(
+	struct xfs_inode	*ip,
+	xfs_fsblock_t		bno)
+{
+	return bno & (ip->i_extsize - 1);
+}
+
 /*
  * Unmap (remove) blocks from a file.
  * If nexts is nonzero then the number of extents to remove is limited to
@@ -5361,6 +5370,7 @@ __xfs_bunmapi(
 	struct xfs_bmbt_irec	got;		/* current extent record */
 	struct xfs_ifork	*ifp;		/* inode fork pointer */
 	int			isrt;		/* freeing in rt area */
+	int			isforcealign;	/* freeing for file inode with forcealign */
 	int			logflags;	/* transaction logging flags */
 	xfs_extlen_t		mod;		/* rt extent offset */
 	struct xfs_mount	*mp = ip->i_mount;
@@ -5397,7 +5407,10 @@ __xfs_bunmapi(
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
-	isrt = xfs_ifork_is_realtime(ip, whichfork);
+	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+	isforcealign = (whichfork == XFS_DATA_FORK) &&
+			xfs_inode_has_forcealign(ip) &&
+			xfs_inode_has_extsize(ip) && ip->i_extsize > 1;
 	end = start + len;
 
 	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
@@ -5459,11 +5472,15 @@ __xfs_bunmapi(
 		if (del.br_startoff + del.br_blockcount > end + 1)
 			del.br_blockcount = end + 1 - del.br_startoff;
 
-		if (!isrt || (flags & XFS_BMAPI_REMAP))
+		if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
 			goto delete;
 
-		mod = xfs_rtb_to_rtxoff(mp,
-				del.br_startblock + del.br_blockcount);
+		if (isrt)
+			mod = xfs_rtb_to_rtxoff(mp,
+					del.br_startblock + del.br_blockcount);
+		else if (isforcealign)
+			mod = xfs_forcealign_extent_offset(ip,
+					del.br_startblock + del.br_blockcount);
 		if (mod) {
 			/*
 			 * Realtime extent not lined up at the end.
@@ -5511,9 +5528,19 @@ __xfs_bunmapi(
 			goto nodelete;
 		}
 
-		mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+		if (isrt)
+			mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+		else if (isforcealign)
+			mod = xfs_forcealign_extent_offset(ip,
+					del.br_startblock);
+
 		if (mod) {
-			xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
+			xfs_extlen_t off;
+
+			if (isrt)
+				off = mp->m_sb.sb_rextsize - mod;
+			else if (isforcealign)
+				off = ip->i_extsize - mod;
 
 			/*
 			 * Realtime extent is lined up at the end but not
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 065028789473..3f13943ab3a3 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -316,6 +316,11 @@ static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
 }
 
+static inline bool xfs_inode_has_extsize(struct xfs_inode *ip)
+{
+	return ip->i_diflags & XFS_DIFLAG_EXTSIZE;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (10 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-05-01  0:53   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH v3 13/21] xfs: Enable file data forcealign feature John Garry
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Like we already do for rtvol, only free full extents for forcealign in
xfs_free_file_space().

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_bmap_util.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f26d1570b9bd..1dd45dfb2811 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -847,8 +847,11 @@ xfs_free_file_space(
 	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
-	/* We can only free complete realtime extents. */
-	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+	/* Free only complete extents. */
+	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
+		startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
+		endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
+	} else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
 		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 13/21] xfs: Enable file data forcealign feature
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (11 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 14/21] iomap: Sub-extent zeroing John Garry
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Enable this feature.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4dd295b047f8..0c73b96dbefc 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (12 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 13/21] xfs: Enable file data forcealign feature John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-05-01  1:07   ` Dave Chinner
  2024-06-11  3:10   ` Long Li
  2024-04-29 17:47 ` [PATCH v3 15/21] fs: xfs: " John Garry
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
unwritten.

This will be important for atomic writes support, in that atomically
writing over a partially written extent would mean that we would need to
do the unwritten extent conversion write separately, and the write could
no longer be atomic.

It is the task of the FS to set iomap.extent_size per iter to indicate
sub-extent zeroing required.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 17 +++++++++++------
 include/linux/iomap.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..a3ed7cfa95bc 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 {
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
-	unsigned int fs_block_size = i_blocksize(inode), pad;
+	unsigned int zeroing_size, pad;
 	loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
@@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
+	if (iomap->extent_size)
+		zeroing_size = iomap->extent_size;
+	else
+		zeroing_size = i_blocksize(inode);
+
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
 		return -EINVAL;
@@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
 
 	if (need_zeroout) {
-		/* zero out from the start of the block to the write offset */
-		pad = pos & (fs_block_size - 1);
+		/* zero out from the start of the region to the write offset */
+		pad = pos & (zeroing_size - 1);
 		if (pad)
 			iomap_dio_zero(iter, dio, pos - pad, pad);
 	}
@@ -428,10 +433,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 zero_tail:
 	if (need_zeroout ||
 	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
-		/* zero out from the end of the write to the end of the block */
-		pad = pos & (fs_block_size - 1);
+		/* zero out from the end of the write to the end of the region */
+		pad = pos & (zeroing_size - 1);
 		if (pad)
-			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+			iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
 	}
 out:
 	/* Undo iter limitation to current extent */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..42623b1cdc04 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -97,6 +97,7 @@ struct iomap {
 	u64			length;	/* length of mapping, bytes */
 	u16			type;	/* type of mapping */
 	u16			flags;	/* flags for mapping */
+	unsigned int		extent_size;
 	struct block_device	*bdev;	/* block device for I/O */
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 15/21] fs: xfs: iomap: Sub-extent zeroing
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (13 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 14/21] iomap: Sub-extent zeroing John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-05-01  1:32   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Set iomap->extent_size when sub-extent zeroing is required.

We treat a sub-extent write same as an unaligned write, so we can leverage
the existing sub-FSblock unaligned write support, i.e. try a shared lock
with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
lock.

In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c  | 35 ++++++++++++++++++++++-------------
 fs/xfs/xfs_iomap.c | 13 +++++++++++--
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e81e01e6b22b..ee4f94cf6f4e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -620,18 +620,19 @@ xfs_file_dio_write_aligned(
  * Handle block unaligned direct I/O writes
  *
  * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
- * them to be done in parallel with reads and other direct I/O writes.  However,
- * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
- * to do sub-block zeroing and that requires serialisation against other direct
- * I/O to the same block.  In this case we need to serialise the submission of
- * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
- * In the case where sub-block zeroing is not required, we can do concurrent
- * sub-block dios to the same block successfully.
+ * them to be done in parallel with reads and other direct I/O writes.
+ * However if the I/O is not aligned to filesystem blocks/extent, the direct
+ * I/O layer may need to do sub-block/extent zeroing and that requires
+ * serialisation against other direct I/O to the same block/extent.  In this
+ * case we need to serialise the submission of the unaligned I/O so that we
+ * don't get racing block/extent zeroing in the dio layer.
+ * In the case where sub-block/extent zeroing is not required, we can do
+ * concurrent sub-block/extent dios to the same block/extent successfully.
  *
  * Optimistically submit the I/O using the shared lock first, but use the
  * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
- * if block allocation or partial block zeroing would be required.  In that case
- * we try again with the exclusive lock.
+ * if block/extent allocation or partial block/extent zeroing would be
+ * required.  In that case we try again with the exclusive lock.
  */
 static noinline ssize_t
 xfs_file_dio_write_unaligned(
@@ -646,9 +647,9 @@ xfs_file_dio_write_unaligned(
 	ssize_t			ret;
 
 	/*
-	 * Extending writes need exclusivity because of the sub-block zeroing
-	 * that the DIO code always does for partial tail blocks beyond EOF, so
-	 * don't even bother trying the fast path in this case.
+	 * Extending writes need exclusivity because of the sub-block/extent
+	 * zeroing that the DIO code always does for partial tail blocks
+	 * beyond EOF, so don't even bother trying the fast path in this case.
 	 */
 	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
 		if (iocb->ki_flags & IOCB_NOWAIT)
@@ -714,11 +715,19 @@ xfs_file_dio_write(
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
 	size_t			count = iov_iter_count(from);
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned int		blockmask;
 
 	/* direct I/O must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
-	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
+
+	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
+		blockmask = XFS_FSB_TO_B(mp, ip->i_extsize) - 1;
+	else
+		blockmask = mp->m_blockmask;
+
+	if ((iocb->ki_pos | count) & blockmask)
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from);
 }
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4087af7f3c9f..1a3692bbc84d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -138,6 +138,8 @@ xfs_bmbt_to_iomap(
 
 	iomap->validity_cookie = sequence_cookie;
 	iomap->folio_ops = &xfs_iomap_folio_ops;
+	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
+		iomap->extent_size = XFS_FSB_TO_B(mp, ip->i_extsize);
 	return 0;
 }
 
@@ -570,8 +572,15 @@ xfs_iomap_write_unwritten(
 
 	trace_xfs_unwritten_convert(ip, offset, count);
 
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
+		xfs_extlen_t extsize_bytes = mp->m_sb.sb_blocksize * ip->i_extsize;
+
+		offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
+		count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
+	} else {
+		offset_fsb = XFS_B_TO_FSBT(mp, offset);
+		count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+	}
 	count_fsb = (xfs_filblks_t)(count_fsb - offset_fsb);
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (14 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 15/21] fs: xfs: " John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 17/21] iomap: Atomic write support John Garry
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag.  Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/uapi/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6a6bcb53594a..0eae5383a0b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -160,6 +160,7 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define FS_XFLAG_FORCEALIGN	0x00020000
+#define FS_XFLAG_ATOMICWRITES	0x00040000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 17/21] iomap: Atomic write support
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (15 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-05-01  1:47   ` Dave Chinner
  2024-04-29 17:47 ` [PATCH v3 18/21] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.

We rely on the FS to guarantee extent alignment, such that an atomic write
should never straddle two or more extents. The FS should also check for
validity of an atomic write length/alignment.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index a3ed7cfa95bc..d7bdeb675068 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,6 +275,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {
+	bool is_atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int zeroing_size, pad;
@@ -387,6 +388,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 		bio->bi_write_hint = inode->i_write_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
+		if (is_atomic)
+			bio->bi_opf |= REQ_ATOMIC;
+
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -403,6 +407,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
+		if (is_atomic && n != orig_count) {
+			/* This bio should have covered the complete length */
+			ret = -EINVAL;
+			bio_put(bio);
+			goto out;
+		}
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
 		} else {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 18/21] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (16 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 17/21] iomap: Atomic write support John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 19/21] xfs: Support atomic write for statx John Garry
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Add initial support for FS_XFLAG_ATOMICWRITES for forcealign enabled.

Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.

rtvol also guarantees extent alignment, but we are basing support initially
on forcealign, which is not supported for rtvol yet.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h    | 11 +++++++++--
 fs/xfs/libxfs/xfs_inode_buf.c | 36 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_sb.c        |  2 ++
 fs/xfs/xfs_buf.c              | 15 ++++++++++++++-
 fs/xfs/xfs_buf.h              |  4 +++-
 fs/xfs/xfs_inode.c            |  2 ++
 fs/xfs/xfs_inode.h            |  5 +++++
 fs/xfs/xfs_ioctl.c            | 21 ++++++++++++++++++--
 fs/xfs/xfs_mount.h            |  2 ++
 fs/xfs/xfs_super.c            |  4 ++++
 10 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 0c73b96dbefc..8e32fb068430 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -354,12 +354,16 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
 #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
+
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
 		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
-		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN | \
+		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -1088,6 +1092,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define XFS_DIFLAG2_FORCEALIGN_BIT 5
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
@@ -1095,10 +1100,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
 #define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
+	 XFS_DIFLAG2_ATOMICWRITES)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 12f128f12824..5e42ec1dadb6 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -178,7 +178,10 @@ xfs_inode_from_disk(
 	struct xfs_inode	*ip,
 	struct xfs_dinode	*from)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 	struct inode		*inode = VFS_I(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
 	int			error;
 	xfs_failaddr_t		fa;
 
@@ -261,6 +264,13 @@ xfs_inode_from_disk(
 	}
 	if (xfs_is_reflink_inode(ip))
 		xfs_ifork_init_cow(ip);
+
+	if (xfs_inode_has_atomicwrites(ip)) {
+		if (sbp->sb_blocksize < target->bt_bdev_awu_min ||
+		    sbp->sb_blocksize * ip->i_extsize > target->bt_bdev_awu_max)
+			ip->i_diflags2 &= ~XFS_DIFLAG2_ATOMICWRITES;
+	}
+
 	return 0;
 
 out_destroy_data_fork:
@@ -460,6 +470,25 @@ xfs_dinode_verify_nrext64(
 	return NULL;
 }
 
+static xfs_failaddr_t
+xfs_inode_validate_atomicwrites(
+	struct xfs_mount	*mp,
+	bool			forcealign)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_atomicwrites(mp))
+		return __this_address;
+
+	/*
+	 * forcealign is required, so rely on sanity checks in
+	 * xfs_inode_validate_forcealign()
+	 */
+	if (!forcealign)
+		return __this_address;
+
+	return NULL;
+}
+
 xfs_failaddr_t
 xfs_dinode_verify(
 	struct xfs_mount	*mp,
@@ -624,6 +653,13 @@ xfs_dinode_verify(
 			return fa;
 	}
 
+	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
+		fa = xfs_inode_validate_atomicwrites(mp,
+			flags2 & XFS_DIFLAG2_FORCEALIGN);
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index e746c57c4cc4..a9ae8ab7d610 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -165,6 +165,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_INOBTCNT;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
 		features |= XFS_FEAT_FORCEALIGN;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+		features |= XFS_FEAT_ATOMICWRITES;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1a18c381127e..6e7ac6c90ec1 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2057,6 +2057,8 @@ int
 xfs_init_buftarg(
 	struct xfs_buftarg		*btp,
 	size_t				logical_sectorsize,
+	unsigned int			awu_min,
+	unsigned int			awu_max,
 	const char			*descr)
 {
 	/* Set up device logical sector size mask */
@@ -2083,6 +2085,9 @@ xfs_init_buftarg(
 	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker->private_data = btp;
 	shrinker_register(btp->bt_shrinker);
+
+	btp->bt_bdev_awu_min = awu_min;
+	btp->bt_bdev_awu_max = awu_max;
 	return 0;
 
 out_destroy_io_count:
@@ -2099,6 +2104,7 @@ xfs_alloc_buftarg(
 {
 	struct xfs_buftarg	*btp;
 	const struct dax_holder_operations *ops = NULL;
+	unsigned int awu_min = 0, awu_max = 0;
 
 #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
 	ops = &xfs_dax_holder_operations;
@@ -2112,6 +2118,13 @@ xfs_alloc_buftarg(
 	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
 					    mp, ops);
 
+	if (bdev_can_atomic_write(btp->bt_bdev)) {
+		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
+
+		awu_min = queue_atomic_write_unit_min_bytes(q);
+		awu_max = queue_atomic_write_unit_max_bytes(q);
+	}
+
 	/*
 	 * When allocating the buftargs we have not yet read the super block and
 	 * thus don't know the file system sector size yet.
@@ -2119,7 +2132,7 @@ xfs_alloc_buftarg(
 	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
 		goto error_free;
 	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
-			mp->m_super->s_id))
+			awu_min, awu_max, mp->m_super->s_id))
 		goto error_free;
 
 	return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b1580644501f..3bcd8137d739 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -124,6 +124,8 @@ struct xfs_buftarg {
 	struct percpu_counter	bt_io_count;
 	struct ratelimit_state	bt_ioerror_rl;
 
+	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
+
 	/* built-in cache, if we're not using the perag one */
 	struct xfs_buf_cache	bt_cache[];
 };
@@ -393,7 +395,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
 
 /* for xfs_buf_mem.c only: */
 int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
-		const char *descr);
+		unsigned int awu_min, unsigned int awu_max, const char *descr);
 void xfs_destroy_buftarg(struct xfs_buftarg *btp);
 
 #endif	/* __XFS_BUF_H__ */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index db5a0f66a121..d674fca22de9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -609,6 +609,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_COWEXTSIZE;
 		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
 			flags |= FS_XFLAG_FORCEALIGN;
+		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+			flags |= FS_XFLAG_ATOMICWRITES;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3f13943ab3a3..d796456215e2 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -321,6 +321,11 @@ static inline bool xfs_inode_has_extsize(struct xfs_inode *ip)
 	return ip->i_diflags & XFS_DIFLAG_EXTSIZE;
 }
 
+static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d1126509ceb9..94a6d9f6d0d8 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
 	if (xflags & FS_XFLAG_FORCEALIGN)
 		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
+	if (xflags & FS_XFLAG_ATOMICWRITES)
+		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
 
 	return di_flags2;
 }
@@ -1122,12 +1124,16 @@ xfs_ioctl_setattr_xflags(
 	struct xfs_inode	*ip,
 	struct fileattr		*fa)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
 	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
 	uint64_t		i_flags2;
 
-	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
-		/* Can't change realtime flag if any extents are allocated. */
+	/* Can't change RT or atomic flags if any extents are allocated. */
+	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+	    atomic_writes != xfs_inode_has_atomicwrites(ip)) {
 		if (ip->i_df.if_nextents || ip->i_delayed_blks)
 			return -EINVAL;
 	}
@@ -1164,6 +1170,17 @@ xfs_ioctl_setattr_xflags(
 			return -EINVAL;
 	}
 
+	if (atomic_writes) {
+		if (!xfs_has_atomicwrites(mp))
+			return -EINVAL;
+		if (target->bt_bdev_awu_min > sbp->sb_blocksize)
+			return -EINVAL;
+		if (target->bt_bdev_awu_max < fa->fsx_extsize)
+			return -EINVAL;
+		if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
+			return -EINVAL;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a8266cf654c4..5856a72d431e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -293,6 +293,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_FORCEALIGN	(1ULL << 27)	/* aligned file data extents */
+#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -357,6 +358,7 @@ __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
 __XFS_HAS_FEAT(forcealign, FORCEALIGN)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 63d4312785ef..757c90b3d71b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1710,6 +1710,10 @@ xfs_fs_fill_super(
 		xfs_warn(mp,
 "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
 
+	if (xfs_has_atomicwrites(mp))
+		xfs_warn(mp,
+"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 19/21] xfs: Support atomic write for statx
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (17 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 18/21] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 20/21] xfs: Validate atomic writes John Garry
  2024-04-29 17:47 ` [PATCH v3 21/21] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future. This is required by iomap
DIO.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..7d2ef3059ca5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,27 @@ xfs_stat_blksize(
 	return PAGE_SIZE;
 }
 
+static void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	unsigned int		extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
+
+	if (!xfs_inode_has_atomicwrites(ip)) {
+		*unit_min = 0;
+		*unit_max = 0;
+		return;
+	}
+
+	*unit_min = sbp->sb_blocksize;
+	*unit_max = min(target->bt_bdev_awu_max, extsz_bytes);
+}
+
 STATIC int
 xfs_vn_getattr(
 	struct mnt_idmap	*idmap,
@@ -619,6 +640,13 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			unsigned int unit_min, unit_max;
+
+			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+			generic_fill_statx_atomic_writes(stat,
+				unit_min, unit_max);
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 20/21] xfs: Validate atomic writes
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (18 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 19/21] xfs: Support atomic write for statx John Garry
@ 2024-04-29 17:47 ` John Garry
  2024-04-29 17:47 ` [PATCH v3 21/21] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
FORCEALIGN and also ATOMICWRITES flags would also need to be set for the
inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ee4f94cf6f4e..256d05c1be6a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -712,12 +712,20 @@ xfs_file_dio_write(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
+	struct inode		*inode = file_inode(iocb->ki_filp);
+	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
 	size_t			count = iov_iter_count(from);
 	struct xfs_mount	*mp = ip->i_mount;
 	unsigned int		blockmask;
 
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		if (!generic_atomic_write_valid_size(iocb->ki_pos, from,
+			i_blocksize(inode), XFS_FSB_TO_B(mp, ip->i_extsize))) {
+			return -EINVAL;
+		}
+	}
+
 	/* direct I/O must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 21/21] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
                   ` (19 preceding siblings ...)
  2024-04-29 17:47 ` [PATCH v3 20/21] xfs: Validate atomic writes John Garry
@ 2024-04-29 17:47 ` John Garry
  20 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. Only direct IO is currently supported, so check for that also.

We rely on the block layer to reject atomic writes which exceed the bdev
request_queue limits, so don't bother checking any such thing here.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 256d05c1be6a..5a17748eb6bd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1243,6 +1243,18 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+static bool xfs_file_open_can_atomicwrite(
+	struct inode		*inode,
+	struct file		*file)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	if (!(file->f_flags & O_DIRECT))
+		return false;
+
+	return xfs_inode_has_atomicwrites(ip);
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1252,6 +1264,8 @@ xfs_file_open(
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
 			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+	if (xfs_file_open_can_atomicwrite(inode, file))
+		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign
  2024-04-29 17:47 ` [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign John Garry
@ 2024-04-30 22:54   ` Dave Chinner
  2024-05-01  8:30     ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-04-30 22:54 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:34PM +0000, John Garry wrote:
> For when forcealign is enabled, we want the EOF to be aligned as well, so
> do not free EOF blocks.

This is doesn't match what the code does. The code is correct - it
rounds the range to be trimmed up to the aligned offset beyond EOF
and then frees them. The description needs to be updated to reflect
this.

> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 19e11d1da660..f26d1570b9bd 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
>  	 * forever.
>  	 */
>  	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> +
> +	/* Do not free blocks when forcing extent sizes */
> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)

I see this sort of check all through the remaining patches.

Given there are significant restrictions on forced alignment,
shouldn't this all the details be pushed inside the helper function?
e.g.

/*
 * Forced extent alignment is dependent on extent size hints being
 * set to define the alignment. Alignment is only necessary when the
 * extent size hint is larger than a single block.
 *
 * If reflink is enabled on the file or we are in always_cow mode,
 * we can't easily do forced alignment.
 *
 * We don't support forced alignment on realtime files.
 * XXX(dgc): why not?
 */
static inline bool
xfs_inode_has_forcealign(struct xfs_inode *ip)
{
	if (!(ip->di_flags & XFS_DIFLAG_EXTSIZE))
		return false;
	if (ip->i_extsize <= 1)
		return false;

	if (xfs_is_cow_inode(ip))
		return false;
	if (ip->di_flags & XFS_DIFLAG_REALTIME)
		return false;

	return ip->di_flags2 & XFS_DIFLAG2_FORCEALIGN;
}

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-04-29 17:47 ` [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag John Garry
@ 2024-04-30 23:22   ` Dave Chinner
  2024-05-01 10:03     ` John Garry
  2024-06-12  2:10   ` Long Li
  1 sibling, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-04-30 23:22 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Add a new inode flag to require that all file data extent mappings must
> be aligned (both the file offset range and the allocated space itself)
> to the extent size hint.  Having a separate COW extent size hint is no
> longer allowed.
> 
> The goal here is to enable sysadmins and users to mandate that all space
> mappings in a file must have a startoff/blockcount that are aligned to
> (say) a 2MB alignment and that the startblock/blockcount will follow the
> same alignment.
> 
> jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
>      alignment for forcealign
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Co-developed-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---

....

> @@ -783,3 +791,45 @@ xfs_inode_validate_cowextsize(
>  
>  	return NULL;
>  }
> +
> +/* Validate the forcealign inode flag */
> +xfs_failaddr_t
> +xfs_inode_validate_forcealign(
> +	struct xfs_mount	*mp,
> +	uint16_t		mode,

	umode_t			mode,

> +	uint16_t		flags,
> +	uint32_t		extsize,
> +	uint32_t		cowextsize)

extent sizes are xfs_extlen_t types.

> +{
> +	/* superblock rocompat feature flag */
> +	if (!xfs_has_forcealign(mp))
> +		return __this_address;
> +
> +	/* Only regular files and directories */
> +	if (!S_ISDIR(mode) && !S_ISREG(mode))
> +		return __this_address;
> +
> +	/* Doesn't apply to realtime files */
> +	if (flags & XFS_DIFLAG_REALTIME)
> +		return __this_address;

Why not? A rt device with an extsize of 1 fsb could make use of
forced alignment just like the data device to allow larger atomic
writes to be done. I mean, just because we haven't written the code
to do this yet doesn't mean it is an illegal on-disk format state.

> +	/* Requires a non-zero power-of-2 extent size hint */
> +	if (extsize == 0 || !is_power_of_2(extsize) ||
> +	    (mp->m_sb.sb_agblocks % extsize))
> +		return __this_address;

Please do these as indiviual checks with their own fail address.
That way we can tell which check failed from the console output.
Also, the agblocks check is already split out below, so it's being
checked twice...

Also, why does force-align require a power-of-2 extent size? Why
does it require the extent size to be an exact divisor of the AG
size? Aren't these atomic write alignment restrictions? i.e.
shouldn't these only be enforced when the atomic writes inode flag
is set?

> +	/* Requires agsize be a multiple of extsize */
> +	if (mp->m_sb.sb_agblocks % extsize)
> +		return __this_address;
> +
> +	/* Requires stripe unit+width (if set) be a multiple of extsize */
> +	if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
> +	    (mp->m_swidth && (mp->m_swidth % extsize)))
> +		return __this_address;

Again, this is an atomic write constraint, isn't it?

> +	/* Requires no cow extent size hint */
> +	if (cowextsize != 0)
> +		return __this_address;

What if it's a reflinked file?

.....

> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d0e2cec6210d..d1126509ceb9 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
>  		di_flags2 |= XFS_DIFLAG2_DAX;
>  	if (xflags & FS_XFLAG_COWEXTSIZE)
>  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> +	if (xflags & FS_XFLAG_FORCEALIGN)
> +		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>  
>  	return di_flags2;
>  }
> @@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
>  	if (i_flags2 && !xfs_has_v3inodes(mp))
>  		return -EINVAL;
>  
> +	/*
> +	 * Force-align requires a nonzero extent size hint and a zero cow
> +	 * extent size hint.  It doesn't apply to realtime files.
> +	 */
> +	if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
> +		if (!xfs_has_forcealign(mp))
> +			return -EINVAL;
> +		if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
> +			return -EINVAL;
> +		if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
> +					FS_XFLAG_EXTSZINHERIT)))
> +			return -EINVAL;
> +		if (fa->fsx_xflags & FS_XFLAG_REALTIME)
> +			return -EINVAL;
> +	}

What about if the file already has shared extents on it (i.e.
reflinked or deduped?)

Also, why is this getting checked here instead of in
xfs_ioctl_setattr_check_extsize()?


> @@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
>  	failaddr = xfs_inode_validate_extsize(ip->i_mount,
>  			XFS_B_TO_FSB(mp, fa->fsx_extsize),
>  			VFS_I(ip)->i_mode, new_diflags);
> -	return failaddr != NULL ? -EINVAL : 0;
> +	if (failaddr)
> +		return -EINVAL;
> +
> +	if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
> +		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
> +				VFS_I(ip)->i_mode, new_diflags,
> +				XFS_B_TO_FSB(mp, fa->fsx_extsize),
> +				XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
> +		if (failaddr)
> +			return -EINVAL;
> +	}

Oh, it's because you're trying to use on-disk format validation
routines for user API validation. That, IMO, is a bad idea because
the on-disk format and kernel/user APIs should not be tied
together as they have different constraints and error conditions.

That also explains why xfs_inode_validate_forcealign() doesn't just
get passed the inode to validate - it's because you want to pass
information from the user API to it. This results in sub-optimal
code for both on-disk format validation and user API validation.

Can you please separate these and put all the force align user API
validation checks in the one function?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask for forcealign
  2024-04-29 17:47 ` [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask " John Garry
@ 2024-04-30 23:35   ` Dave Chinner
  2024-05-01 10:48     ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-04-30 23:35 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:35PM +0000, John Garry wrote:
> For when forcealign is enabled, we want the alignment mask to cover an
> aligned extent, similar to rtvol.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 632653e00906..e81e01e6b22b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -61,7 +61,10 @@ xfs_is_falloc_aligned(
>  		}
>  		mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
>  	} else {
> -		mask = mp->m_sb.sb_blocksize - 1;
> +		if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> +			mask = (mp->m_sb.sb_blocksize * ip->i_extsize) - 1;
> +		else
> +			mask = mp->m_sb.sb_blocksize - 1;
>  	}
>  
>  	return !((pos | len) & mask);

I think this whole function needs to be rewritten so that
non-power-of-2 extent sizes are supported on both devices properly.

	xfs_extlen_t	fsbs = 1;
	u64		bytes;
	u32		mod;

	if (xfs_inode_has_forcealign(ip))
		fsbs = ip->i_extsize;
	else if (XFS_IS_REALTIME_INODE(ip))
		fsbs = mp->m_sb.sb_rextsize;

	bytes = XFS_FSB_TO_B(mp, fsbs);
	if (is_power_of_2(fsbs))
		return !((pos | len) & (bytes - 1));

	div_u64_rem(pos, bytes, &mod);
	if (mod)
		return false;
	div_u64_rem(len, bytes, &mod);
	return mod == 0;

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign
  2024-04-29 17:47 ` [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign John Garry
@ 2024-05-01  0:10   ` Dave Chinner
  2024-05-01 10:54     ` John Garry
  2024-06-06  9:50     ` John Garry
  0 siblings, 2 replies; 60+ messages in thread
From: Dave Chinner @ 2024-05-01  0:10 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:36PM +0000, John Garry wrote:
> For when forcealign is enabled, blocks in an inode need to be unmapped
> according to extent alignment, like what is already done for rtvol.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 39 +++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_inode.h       |  5 +++++
>  2 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4f39a43d78a7..4a78ab193753 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5339,6 +5339,15 @@ xfs_bmap_del_extent_real(
>  	return 0;
>  }
>  
> +/* Return the offset of an block number within an extent for forcealign. */
> +static xfs_extlen_t
> +xfs_forcealign_extent_offset(
> +	struct xfs_inode	*ip,
> +	xfs_fsblock_t		bno)
> +{
> +	return bno & (ip->i_extsize - 1);
> +}
> +
>  /*
>   * Unmap (remove) blocks from a file.
>   * If nexts is nonzero then the number of extents to remove is limited to
> @@ -5361,6 +5370,7 @@ __xfs_bunmapi(
>  	struct xfs_bmbt_irec	got;		/* current extent record */
>  	struct xfs_ifork	*ifp;		/* inode fork pointer */
>  	int			isrt;		/* freeing in rt area */
> +	int			isforcealign;	/* freeing for file inode with forcealign */
>  	int			logflags;	/* transaction logging flags */
>  	xfs_extlen_t		mod;		/* rt extent offset */
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -5397,7 +5407,10 @@ __xfs_bunmapi(
>  		return 0;
>  	}
>  	XFS_STATS_INC(mp, xs_blk_unmap);
> -	isrt = xfs_ifork_is_realtime(ip, whichfork);
> +	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);

Why did you change this check? What's wrong with
xfs_ifork_is_realtime(), and if there is something wrong, why
shouldn't xfs_ifork_is_relatime() get fixed?

> +	isforcealign = (whichfork == XFS_DATA_FORK) &&
> +			xfs_inode_has_forcealign(ip) &&
> +			xfs_inode_has_extsize(ip) && ip->i_extsize > 1;

This is one of the reasons why I said xfs_inode_has_forcealign()
should be checking that extent size hints should be checked in that
helper....

>  	end = start + len;
>  
>  	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
> @@ -5459,11 +5472,15 @@ __xfs_bunmapi(
>  		if (del.br_startoff + del.br_blockcount > end + 1)
>  			del.br_blockcount = end + 1 - del.br_startoff;
>  
> -		if (!isrt || (flags & XFS_BMAPI_REMAP))
> +		if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
>  			goto delete;
>  
> -		mod = xfs_rtb_to_rtxoff(mp,
> -				del.br_startblock + del.br_blockcount);
> +		if (isrt)
> +			mod = xfs_rtb_to_rtxoff(mp,
> +					del.br_startblock + del.br_blockcount);
> +		else if (isforcealign)
> +			mod = xfs_forcealign_extent_offset(ip,
> +					del.br_startblock + del.br_blockcount);

There's got to be a cleaner way to do this.

We already know that either isrt or isforcealign must be set here,
so there's no need for the "else if" construct.

Also, forcealign should take precedence over realtime, so that
forcealign will work on realtime devices as well. I'd change this
code to call a wrapper like:

		mod = xfs_bunmapi_align(ip, del.br_startblock + del.br_blockcount);

static xfs_extlen_t
xfs_bunmapi_align(
	struct xfs_inode	*ip,
	xfs_fsblock_t		bno)
{
	if (!XFS_INODE_IS_REALTIME(ip)) {
		ASSERT(xfs_inode_has_forcealign(ip))
		if (is_power_of_2(ip->i_extsize))
			return bno & (ip->i_extsize - 1);
		return do_div(bno, ip->i_extsize);
	}
	return xfs_rtb_to_rtxoff(ip->i_mount, bno);
}



>  		if (mod) {
>  			/*
>  			 * Realtime extent not lined up at the end.
> @@ -5511,9 +5528,19 @@ __xfs_bunmapi(
>  			goto nodelete;
>  		}
>  
> -		mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
> +		if (isrt)
> +			mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
> +		else if (isforcealign)
> +			mod = xfs_forcealign_extent_offset(ip,
> +					del.br_startblock);
> +
		mod = xfs_bunmapi_align(ip, del.br_startblock);

>  		if (mod) {
> -			xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
> +			xfs_extlen_t off;
> +
> +			if (isrt)
> +				off = mp->m_sb.sb_rextsize - mod;
> +			else if (isforcealign)
> +				off = ip->i_extsize - mod;

			if (forcealign)
				off = ip->i_extsize - mod;
			else
				off = mp->m_sb.sb_rextsize - mod;

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign
  2024-04-29 17:47 ` [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign John Garry
@ 2024-05-01  0:53   ` Dave Chinner
  2024-05-01 11:24     ` John Garry
  2024-05-01 23:53     ` Darrick J. Wong
  0 siblings, 2 replies; 60+ messages in thread
From: Dave Chinner @ 2024-05-01  0:53 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
> Like we already do for rtvol, only free full extents for forcealign in
> xfs_free_file_space().
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index f26d1570b9bd..1dd45dfb2811 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -847,8 +847,11 @@ xfs_free_file_space(
>  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>  
> -	/* We can only free complete realtime extents. */
> -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> +	/* Free only complete extents. */
> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
> +		startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
> +		endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
> +	} else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
>  		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>  		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>  	}

When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
line wrapper around roundup_64().

So lets get rid of the obfuscation that the one line RT wrapper
introduces, and it turns into this:

	rounding = 1;
	if (xfs_inode_has_forcealign(ip)
		rounding = ip->i_extsize;
	else if (XFS_IS_REALTIME_INODE(ip))
		rounding = mp->m_sb.sb_rextsize;

	if (rounding > 1) {
		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
	}

What this points out is that the prep steps for fallocate operations
also need to handle both forced alignment and rtextsize rounding,
and it does neither right now.  xfs_flush_unmap_range() is the main
offender here, but xfs_prepare_shift() also needs fixing.

Hence:

static inline xfs_extlen_t
xfs_extent_alignment(
	struct xfs_inode	*ip)
{
	if (xfs_inode_has_forcealign(ip))
		return ip->i_extsize;
	if (XFS_IS_REALTIME_INODE(ip))
		return mp->m_sb.sb_rextsize;
	return 1;
}


In xfs_flush_unmap_range():

	/*
	 * Make sure we extend the flush out to extent alignment
	 * boundaries so any extent range overlapping the start/end
	 * of the modification we are about to do is clean and idle.
	 */
	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
	rounding = max(rounding, PAGE_SIZE);
	...

in xfs_free_file_space()

	/*
	 * Round the range we are going to free inwards to extent
	 * alignment boundaries so we don't free blocks outside the
	 * range requested.
	 */
	rounding = xfs_extent_alignment(ip);
	if (rounding > 1 ) {
		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
	}

and in xfs_prepare_shift()

	/*
	 * Shift operations must stabilize the start block offset boundary along
	 * with the full range of the operation. If we don't, a COW writeback
	 * completion could race with an insert, front merge with the start
	 * extent (after split) during the shift and corrupt the file. Start
	 * with the aligned block just prior to the start to stabilize the boundary.
	 */
	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
	offset = round_down(offset, rounding);
	if (offset)
		offset -= rounding;

Also, I think that the changes I suggested earlier to 
xfs_is_falloc_aligned() could use this xfs_extent_alignment()
helper...

Overall this makes the code a whole lot easier to read and it also
allows forced alignment to work correctly on RT devices...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-04-29 17:47 ` [PATCH v3 14/21] iomap: Sub-extent zeroing John Garry
@ 2024-05-01  1:07   ` Dave Chinner
  2024-05-01 10:23     ` John Garry
  2024-05-30 10:40     ` John Garry
  2024-06-11  3:10   ` Long Li
  1 sibling, 2 replies; 60+ messages in thread
From: Dave Chinner @ 2024-05-01  1:07 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:39PM +0000, John Garry wrote:
> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
> unwritten.
> 
> This will be important for atomic writes support, in that atomically
> writing over a partially written extent would mean that we would need to
> do the unwritten extent conversion write separately, and the write could
> no longer be atomic.
> 
> It is the task of the FS to set iomap.extent_size per iter to indicate
> sub-extent zeroing required.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Shouldn't this be done before the XFS feature is enabled in the
series?

> ---
>  fs/iomap/direct-io.c  | 17 +++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..a3ed7cfa95bc 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  {
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
> -	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	unsigned int zeroing_size, pad;
>  	loff_t length = iomap_length(iter);
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
> @@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> +	if (iomap->extent_size)
> +		zeroing_size = iomap->extent_size;
> +	else
> +		zeroing_size = i_blocksize(inode);

Oh, the dissonance!

iomap->extent_size isn't an extent size at all.

The size of the extent the iomap returns is iomap->length. This new
variable is the IO specific "block size" that should be assumed by
the dio code to determine if padding should be done.

IOWs, I think we should add an "io_block_size" field to the iomap,
and every filesystem that supports iomap should set it to the
filesystem block size (i_blocksize(inode)). Then the changes to the
iomap code end up just being:


-	unsigned int fs_block_size = i_blocksize(inode), pad;
+	unsigned int fs_block_size = iomap->io_block_size, pad;

And the patch that introduces that infrastructure change will also
change all the filesystem implementations to unconditionally set
iomap->io_block_size to i_blocksize().

Then, in a separate patch, you can add XFS support for large IO
block sizes when we have either a large rtextsize or extent size
hints set.

> +
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
> @@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>  
>  	if (need_zeroout) {
> -		/* zero out from the start of the block to the write offset */
> -		pad = pos & (fs_block_size - 1);
> +		/* zero out from the start of the region to the write offset */
> +		pad = pos & (zeroing_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
>  	}
> @@ -428,10 +433,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  zero_tail:
>  	if (need_zeroout ||
>  	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
> -		/* zero out from the end of the write to the end of the block */
> -		pad = pos & (fs_block_size - 1);
> +		/* zero out from the end of the write to the end of the region */
> +		pad = pos & (zeroing_size - 1);
>  		if (pad)
> -			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
> +			iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
>  	}
>  out:
>  	/* Undo iter limitation to current extent */
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6fc1c858013d..42623b1cdc04 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -97,6 +97,7 @@ struct iomap {
>  	u64			length;	/* length of mapping, bytes */
>  	u16			type;	/* type of mapping */
>  	u16			flags;	/* flags for mapping */
> +	unsigned int		extent_size;

This needs a descriptive comment. At minimum, it should tell the
reader what units are used for the variable.  If it is bytes, then
it needs to be a u64, because XFS can have extent size hints well
beyond 2^32 bytes in length.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 15/21] fs: xfs: iomap: Sub-extent zeroing
  2024-04-29 17:47 ` [PATCH v3 15/21] fs: xfs: " John Garry
@ 2024-05-01  1:32   ` Dave Chinner
  2024-05-01 11:36     ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-05-01  1:32 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:40PM +0000, John Garry wrote:
> Set iomap->extent_size when sub-extent zeroing is required.
> 
> We treat a sub-extent write same as an unaligned write, so we can leverage
> the existing sub-FSblock unaligned write support, i.e. try a shared lock
> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> lock.
> 
> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.

If forcedalign is set, should we just reject unaligned DIOs?

.....
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c  | 35 ++++++++++++++++++++++-------------
>  fs/xfs/xfs_iomap.c | 13 +++++++++++--
>  2 files changed, 33 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e81e01e6b22b..ee4f94cf6f4e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -620,18 +620,19 @@ xfs_file_dio_write_aligned(
>   * Handle block unaligned direct I/O writes

 * Handle unaligned direct IO writes.

>   *
>   * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
> - * them to be done in parallel with reads and other direct I/O writes.  However,
> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
> - * to do sub-block zeroing and that requires serialisation against other direct
> - * I/O to the same block.  In this case we need to serialise the submission of
> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
> - * In the case where sub-block zeroing is not required, we can do concurrent
> - * sub-block dios to the same block successfully.
> + * them to be done in parallel with reads and other direct I/O writes.
> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
> + * I/O layer may need to do sub-block/extent zeroing and that requires
> + * serialisation against other direct I/O to the same block/extent.  In this
> + * case we need to serialise the submission of the unaligned I/O so that we
> + * don't get racing block/extent zeroing in the dio layer.
> + * In the case where sub-block/extent zeroing is not required, we can do
> + * concurrent sub-block/extent dios to the same block/extent successfully.
>   *
>   * Optimistically submit the I/O using the shared lock first, but use the
>   * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
> - * if block allocation or partial block zeroing would be required.  In that case
> - * we try again with the exclusive lock.
> + * if block/extent allocation or partial block/extent zeroing would be
> + * required.  In that case we try again with the exclusive lock.

Rather than changing every "block" to "block/extent", leave the bulk
of the comment unchanged and add another paragraph to it that says
something like:

 * If forced extent alignment is turned on, then serialisation
 * constraints are extended from filesystem block alignment
 * to extent alignment boundaries. In this case, we treat any
 * non-extent-aligned DIO the same as a sub-block DIO.

>   */
>  static noinline ssize_t
>  xfs_file_dio_write_unaligned(
> @@ -646,9 +647,9 @@ xfs_file_dio_write_unaligned(
>  	ssize_t			ret;
>  
>  	/*
> -	 * Extending writes need exclusivity because of the sub-block zeroing
> -	 * that the DIO code always does for partial tail blocks beyond EOF, so
> -	 * don't even bother trying the fast path in this case.
> +	 * Extending writes need exclusivity because of the sub-block/extent
> +	 * zeroing that the DIO code always does for partial tail blocks
> +	 * beyond EOF, so don't even bother trying the fast path in this case.
>  	 */
>  	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>  		if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -714,11 +715,19 @@ xfs_file_dio_write(
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>  	size_t			count = iov_iter_count(from);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	unsigned int		blockmask;
>  
>  	/* direct I/O must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
> -	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> +
> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> +		blockmask = XFS_FSB_TO_B(mp, ip->i_extsize) - 1;
> +	else
> +		blockmask = mp->m_blockmask;

	alignmask = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip)) - 1;

Note that this would consider sub rt_extsize IO as unaligned, which
may be undesirable. In that case, we should define a second helper
such as xfs_inode_io_alignment() that doesn't take into account RT
extent sizes because we can still do filesystem block sized
unwritten extent conversion on those devices. The same IO-specific
wrapper would be used for the other cases in this patch, too.

> +
> +	if ((iocb->ki_pos | count) & blockmask)
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4087af7f3c9f..1a3692bbc84d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -138,6 +138,8 @@ xfs_bmbt_to_iomap(
>  
>  	iomap->validity_cookie = sequence_cookie;
>  	iomap->folio_ops = &xfs_iomap_folio_ops;
> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> +		iomap->extent_size = XFS_FSB_TO_B(mp, ip->i_extsize);

	iomap->io_block_size = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip));

>  	return 0;
>  }
>  
> @@ -570,8 +572,15 @@ xfs_iomap_write_unwritten(
>  
>  	trace_xfs_unwritten_convert(ip, offset, count);
>  
> -	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> -	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
> +		xfs_extlen_t extsize_bytes = mp->m_sb.sb_blocksize * ip->i_extsize;
> +
> +		offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
> +		count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
> +	} else {
> +		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +		count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	}

More places we can use a xfs_inode_alignment() helper.

	offset_fsb = XFS_B_TO_FSBT(mp, offset);
	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
	rounding = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip));
	if (rounding > 1) {
		 offset_fsb = rounddown_64(offset_fsb, rounding);
		 count_fsb = roundup_64(count_fsb, rounding);
	}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 17/21] iomap: Atomic write support
  2024-04-29 17:47 ` [PATCH v3 17/21] iomap: Atomic write support John Garry
@ 2024-05-01  1:47   ` Dave Chinner
  2024-05-01 11:08     ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-05-01  1:47 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:42PM +0000, John Garry wrote:
> Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.
> 
> We rely on the FS to guarantee extent alignment, such that an atomic write
> should never straddle two or more extents. The FS should also check for
> validity of an atomic write length/alignment.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index a3ed7cfa95bc..d7bdeb675068 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -275,6 +275,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool is_atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int zeroing_size, pad;
> @@ -387,6 +388,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  		bio->bi_write_hint = inode->i_write_hint;
>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> +		if (is_atomic)
> +			bio->bi_opf |= REQ_ATOMIC;

REQ_ATOMIC is only valid for write IO, isn't it?

This should be added in iomap_dio_bio_opflags() after it is
determined we are doing a write operation.  Regardless, it should be
added in iomap_dio_bio_opflags(), not here. That also allows us to
get rid of the is_atomic variable.

> +
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> @@ -403,6 +407,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> +		if (is_atomic && n != orig_count) {
> +			/* This bio should have covered the complete length */
> +			ret = -EINVAL;
> +			bio_put(bio);
> +			goto out;
> +		}

What happens now if we've done zeroing IO before this? I suspect we
might expose stale data if the partial block zeroing converts the
unwritten extent in full...

>  		if (dio->flags & IOMAP_DIO_WRITE) {
>  			task_io_account_write(n);
>  		} else {

Ignoring the error handling issues, this code might be better as:

		if (dio->flags & IOMAP_DIO_WRITE) {
			if ((opflags & REQ_ATOMIC) && n != orig_count) {
				/* atomic writes are all or nothing */
				ret = -EIO
				bio_put(bio);
				goto out;
			}
		}

so that we are not putting atomic write error checks in the read IO
submission path.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign
  2024-04-30 22:54   ` Dave Chinner
@ 2024-05-01  8:30     ` John Garry
  2024-05-02  1:11       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-05-01  8:30 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 30/04/2024 23:54, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:34PM +0000, John Garry wrote:
>> For when forcealign is enabled, we want the EOF to be aligned as well, so
>> do not free EOF blocks.
> 
> This is doesn't match what the code does. The code is correct - it
> rounds the range to be trimmed up to the aligned offset beyond EOF
> and then frees them. The description needs to be updated to reflect
> this.

ok, fine

> 
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_bmap_util.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 19e11d1da660..f26d1570b9bd 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
>>   	 * forever.
>>   	 */
>>   	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
>> -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
>> +
>> +	/* Do not free blocks when forcing extent sizes */
>> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> 
> I see this sort of check all through the remaining patches.
> 
> Given there are significant restrictions on forced alignment,
> shouldn't this all the details be pushed inside the helper function?
> e.g.
> 
> /*
>   * Forced extent alignment is dependent on extent size hints being
>   * set to define the alignment. Alignment is only necessary when the
>   * extent size hint is larger than a single block.
>   *
>   * If reflink is enabled on the file or we are in always_cow mode,
>   * we can't easily do forced alignment.
>   *
>   * We don't support forced alignment on realtime files.
>   * XXX(dgc): why not?

There is no technical reason to not be able to support forcealign on RT, 
AFAIK. My idea is to support RT after non-RT is supported.

>   */
> static inline bool
> xfs_inode_has_forcealign(struct xfs_inode *ip)
> {
> 	if (!(ip->di_flags & XFS_DIFLAG_EXTSIZE))
> 		return false;
> 	if (ip->i_extsize <= 1)
> 		return false;
> 
> 	if (xfs_is_cow_inode(ip))
> 		return false;

Could we just include this in the forcealign validate checks? Currently 
we just check CoW extsize is zero there.

> 	if (ip->di_flags & XFS_DIFLAG_REALTIME)
> 		return false;

We check this in xfs_inode_validate_forcealign()

> 
> 	return ip->di_flags2 & XFS_DIFLAG2_FORCEALIGN;
> }
> 

So can we simply have:

static inline bool
xfs_inode_has_forcealign(struct xfs_inode *ip)
{

	if (!(ip->di_flags & XFS_DIFLAG_EXTSIZE))
		return false;
  	if (ip->i_extsize <= 1)
  		return false;
  	return ip->di_flags2 & XFS_DIFLAG2_FORCEALIGN;
}

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-04-30 23:22   ` Dave Chinner
@ 2024-05-01 10:03     ` John Garry
  2024-05-02  0:50       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-05-01 10:03 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang


>> +/* Validate the forcealign inode flag */
>> +xfs_failaddr_t
>> +xfs_inode_validate_forcealign(
>> +	struct xfs_mount	*mp,
>> +	uint16_t		mode,
> 
> 	umode_t			mode,

ok. BTW, other functions like xfs_inode_validate_extsize() use uint16_t

> 
>> +	uint16_t		flags,
>> +	uint32_t		extsize,
>> +	uint32_t		cowextsize)
> 
> extent sizes are xfs_extlen_t types.

ok

> 
>> +{
>> +	/* superblock rocompat feature flag */
>> +	if (!xfs_has_forcealign(mp))
>> +		return __this_address;
>> +
>> +	/* Only regular files and directories */
>> +	if (!S_ISDIR(mode) && !S_ISREG(mode))
>> +		return __this_address;
>> +
>> +	/* Doesn't apply to realtime files */
>> +	if (flags & XFS_DIFLAG_REALTIME)
>> +		return __this_address;
> 
> Why not? A rt device with an extsize of 1 fsb could make use of
> forced alignment just like the data device to allow larger atomic
> writes to be done. I mean, just because we haven't written the code
> to do this yet doesn't mean it is an illegal on-disk format state.

ok, so where is a better place to disallow forcealign for RT now (since 
we have not written the code to support it nor verified it)?

> 
>> +	/* Requires a non-zero power-of-2 extent size hint */
>> +	if (extsize == 0 || !is_power_of_2(extsize) ||
>> +	    (mp->m_sb.sb_agblocks % extsize))
>> +		return __this_address;
> 
> Please do these as indiviual checks with their own fail address.

ok

> That way we can tell which check failed from the console output.
> Also, the agblocks check is already split out below, so it's being
> checked twice...
> 
> Also, why does force-align require a power-of-2 extent size? Why
> does it require the extent size to be an exact divisor of the AG
> size? Aren't these atomic write alignment restrictions? i.e.
> shouldn't these only be enforced when the atomic writes inode flag
> is set?

With regards the power-of-2 restriction, I think that the code changes 
are going to become a lot more complex if we don't enforce this for 
forcealign.

For example, consider xfs_file_dio_write(), where we check for an 
unaligned write based on forcealign extent mask. It's much simpler to 
rely on a power-of-2 size. And same for iomap extent zeroing.

So then it can be asked, for what reason do we want to support 
unorthodox, non-power-of-2 sizes? Who would want this?

As for AG size, again I think that it is required to be aligned to the 
forcealign extsize. As I remember, when converting from an FSB to a DB, 
if the AG itself is not aligned to the forcealign extsize, then the DB 
will not be aligned to the forcealign extsize. More below...

> 
>> +	/* Requires agsize be a multiple of extsize */
>> +	if (mp->m_sb.sb_agblocks % extsize)
>> +		return __this_address;
>> +
>> +	/* Requires stripe unit+width (if set) be a multiple of extsize */
>> +	if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
>> +	    (mp->m_swidth && (mp->m_swidth % extsize)))
>> +		return __this_address;
> 
> Again, this is an atomic write constraint, isn't it?

So why do we want forcealign? It is to only align extent FSBs? Or to 
align extents to DBs? I would have thought the latter. If so, it seems 
sensible to do this check also.

> 
>> +	/* Requires no cow extent size hint */
>> +	if (cowextsize != 0)
>> +		return __this_address;
> 
> What if it's a reflinked file?

Yeah, I think that we want to disallow that.

> 
> .....
> 
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index d0e2cec6210d..d1126509ceb9 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
>>   		di_flags2 |= XFS_DIFLAG2_DAX;
>>   	if (xflags & FS_XFLAG_COWEXTSIZE)
>>   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> +	if (xflags & FS_XFLAG_FORCEALIGN)
>> +		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>>   
>>   	return di_flags2;
>>   }
>> @@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
>>   	if (i_flags2 && !xfs_has_v3inodes(mp))
>>   		return -EINVAL;
>>   
>> +	/*
>> +	 * Force-align requires a nonzero extent size hint and a zero cow
>> +	 * extent size hint.  It doesn't apply to realtime files.
>> +	 */
>> +	if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
>> +		if (!xfs_has_forcealign(mp))
>> +			return -EINVAL;
>> +		if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
>> +			return -EINVAL;
>> +		if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
>> +					FS_XFLAG_EXTSZINHERIT)))
>> +			return -EINVAL;
>> +		if (fa->fsx_xflags & FS_XFLAG_REALTIME)
>> +			return -EINVAL;
>> +	}
> 
> What about if the file already has shared extents on it (i.e.
> reflinked or deduped?)

At the top of the function we have this check for RT:

	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
		/* Can't change realtime flag if any extents are allocated. */
		if (ip->i_df.if_nextents || ip->i_delayed_blks)
			return -EINVAL;
	}

Would expanding that check for forcealign also suffice? Indeed, later in 
this series I expanded this check to cover atomicwrites (when I really 
intended it for forcealign).

> 
> Also, why is this getting checked here instead of in
> xfs_ioctl_setattr_check_extsize()?
> 
> 
>> @@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
>>   	failaddr = xfs_inode_validate_extsize(ip->i_mount,
>>   			XFS_B_TO_FSB(mp, fa->fsx_extsize),
>>   			VFS_I(ip)->i_mode, new_diflags);
>> -	return failaddr != NULL ? -EINVAL : 0;
>> +	if (failaddr)
>> +		return -EINVAL;
>> +
>> +	if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
>> +		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
>> +				VFS_I(ip)->i_mode, new_diflags,
>> +				XFS_B_TO_FSB(mp, fa->fsx_extsize),
>> +				XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
>> +		if (failaddr)
>> +			return -EINVAL;
>> +	}
> 
> Oh, it's because you're trying to use on-disk format validation
> routines for user API validation. That, IMO, is a bad idea because
> the on-disk format and kernel/user APIs should not be tied
> together as they have different constraints and error conditions.
> 
> That also explains why xfs_inode_validate_forcealign() doesn't just
> get passed the inode to validate - it's because you want to pass
> information from the user API to it. This results in sub-optimal
> code for both on-disk format validation and user API validation.
> 
> Can you please separate these and put all the force align user API
> validation checks in the one function?
> 

ok, fine. But it would be good to have clarification on function of 
forcealign, above, i.e. does it always align extents to disk blocks?

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-05-01  1:07   ` Dave Chinner
@ 2024-05-01 10:23     ` John Garry
  2024-05-30 10:40     ` John Garry
  1 sibling, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-01 10:23 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 02:07, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:39PM +0000, John Garry wrote:
>> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
>> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
>> unwritten.
>>
>> This will be important for atomic writes support, in that atomically
>> writing over a partially written extent would mean that we would need to
>> do the unwritten extent conversion write separately, and the write could
>> no longer be atomic.
>>
>> It is the task of the FS to set iomap.extent_size per iter to indicate
>> sub-extent zeroing required.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
> 
> Shouldn't this be done before the XFS feature is enabled in the
> series?

Well, it is done before XFS iomap zeroing support patch. But I can move 
this patch to the very beginning of the series.

> 
>> ---
>>   fs/iomap/direct-io.c  | 17 +++++++++++------
>>   include/linux/iomap.h |  1 +
>>   2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index f3b43d223a46..a3ed7cfa95bc 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   {
>>   	const struct iomap *iomap = &iter->iomap;
>>   	struct inode *inode = iter->inode;
>> -	unsigned int fs_block_size = i_blocksize(inode), pad;
>> +	unsigned int zeroing_size, pad;
>>   	loff_t length = iomap_length(iter);
>>   	loff_t pos = iter->pos;
>>   	blk_opf_t bio_opf;
>> @@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> +	if (iomap->extent_size)
>> +		zeroing_size = iomap->extent_size;
>> +	else
>> +		zeroing_size = i_blocksize(inode);
> 
> Oh, the dissonance!
> 
> iomap->extent_size isn't an extent size at all.

Right, it's a poorly chosen name

> 
> The size of the extent the iomap returns is iomap->length. This new
> variable is the IO specific "block size" that should be assumed by
> the dio code to determine if padding should be done.
> 
> IOWs, I think we should add an "io_block_size" field to the iomap,
> and every filesystem that supports iomap should set it to the
> filesystem block size (i_blocksize(inode)). Then the changes to the
> iomap code end up just being:
> 
> 
> -	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	unsigned int fs_block_size = iomap->io_block_size, pad;
> 
> And the patch that introduces that infrastructure change will also
> change all the filesystem implementations to unconditionally set
> iomap->io_block_size to i_blocksize().

ok

> 
> Then, in a separate patch, you can add XFS support for large IO
> block sizes when we have either a large rtextsize or extent size
> hints set.

I hadn't been considering large rtextsize for this. I suppose that it 
could be added.

> 
>> +
>>   	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>   	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>   		return -EINVAL;
>> @@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>   
>>   	if (need_zeroout) {
>> -		/* zero out from the start of the block to the write offset */
>> -		pad = pos & (fs_block_size - 1);
>> +		/* zero out from the start of the region to the write offset */
>> +		pad = pos & (zeroing_size - 1);
>>   		if (pad)
>>   			iomap_dio_zero(iter, dio, pos - pad, pad);
>>   	}
>> @@ -428,10 +433,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   zero_tail:
>>   	if (need_zeroout ||
>>   	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>> -		/* zero out from the end of the write to the end of the block */
>> -		pad = pos & (fs_block_size - 1);
>> +		/* zero out from the end of the write to the end of the region */
>> +		pad = pos & (zeroing_size - 1);
>>   		if (pad)
>> -			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
>> +			iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
>>   	}
>>   out:
>>   	/* Undo iter limitation to current extent */
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 6fc1c858013d..42623b1cdc04 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -97,6 +97,7 @@ struct iomap {
>>   	u64			length;	/* length of mapping, bytes */
>>   	u16			type;	/* type of mapping */
>>   	u16			flags;	/* flags for mapping */
>> +	unsigned int		extent_size;
> 
> This needs a descriptive comment. At minimum, it should tell the
> reader what units are used for the variable.  If it is bytes, then
> it needs to be a u64, because XFS can have extent size hints well
> beyond 2^32 bytes in length.
> 

ok

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask for forcealign
  2024-04-30 23:35   ` Dave Chinner
@ 2024-05-01 10:48     ` John Garry
  2024-05-01 23:45       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-05-01 10:48 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 00:35, Dave Chinner wrote:
>>   	return !((pos | len) & mask);
> I think this whole function needs to be rewritten so that
> non-power-of-2 extent sizes are supported on both devices properly.
> 
> 	xfs_extlen_t	fsbs = 1;
> 	u64		bytes;
> 	u32		mod;
> 
> 	if (xfs_inode_has_forcealign(ip))
> 		fsbs = ip->i_extsize;
> 	else if (XFS_IS_REALTIME_INODE(ip))
> 		fsbs = mp->m_sb.sb_rextsize;
> 
> 	bytes = XFS_FSB_TO_B(mp, fsbs);
> 	if (is_power_of_2(fsbs))
> 		return !((pos | len) & (bytes - 1));
> 
> 	div_u64_rem(pos, bytes, &mod);
> 	if (mod)
> 		return false;
> 	div_u64_rem(len, bytes, &mod);
> 	return mod == 0;

ok, but I still have a doubt about non-power-of-2 forcealign extsize 
support.

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign
  2024-05-01  0:10   ` Dave Chinner
@ 2024-05-01 10:54     ` John Garry
  2024-06-06  9:50     ` John Garry
  1 sibling, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-01 10:54 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 01:10, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:36PM +0000, John Garry wrote:
>> For when forcealign is enabled, blocks in an inode need to be unmapped
>> according to extent alignment, like what is already done for rtvol.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 39 +++++++++++++++++++++++++++++++++------
>>   fs/xfs/xfs_inode.h       |  5 +++++
>>   2 files changed, 38 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 4f39a43d78a7..4a78ab193753 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -5339,6 +5339,15 @@ xfs_bmap_del_extent_real(
>>   	return 0;
>>   }
>>   
>> +/* Return the offset of an block number within an extent for forcealign. */
>> +static xfs_extlen_t
>> +xfs_forcealign_extent_offset(
>> +	struct xfs_inode	*ip,
>> +	xfs_fsblock_t		bno)
>> +{
>> +	return bno & (ip->i_extsize - 1);
>> +}
>> +
>>   /*
>>    * Unmap (remove) blocks from a file.
>>    * If nexts is nonzero then the number of extents to remove is limited to
>> @@ -5361,6 +5370,7 @@ __xfs_bunmapi(
>>   	struct xfs_bmbt_irec	got;		/* current extent record */
>>   	struct xfs_ifork	*ifp;		/* inode fork pointer */
>>   	int			isrt;		/* freeing in rt area */
>> +	int			isforcealign;	/* freeing for file inode with forcealign */
>>   	int			logflags;	/* transaction logging flags */
>>   	xfs_extlen_t		mod;		/* rt extent offset */
>>   	struct xfs_mount	*mp = ip->i_mount;
>> @@ -5397,7 +5407,10 @@ __xfs_bunmapi(
>>   		return 0;
>>   	}
>>   	XFS_STATS_INC(mp, xs_blk_unmap);
>> -	isrt = xfs_ifork_is_realtime(ip, whichfork);
>> +	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
> 
> Why did you change this check? What's wrong with
> xfs_ifork_is_realtime(), and if there is something wrong, why
> shouldn't xfs_ifork_is_relatime() get fixed?

oops, I should have not made that change. I must have changed it when 
debugging and not reverted it.

> 
>> +	isforcealign = (whichfork == XFS_DATA_FORK) &&
>> +			xfs_inode_has_forcealign(ip) &&
>> +			xfs_inode_has_extsize(ip) && ip->i_extsize > 1;
> 
> This is one of the reasons why I said xfs_inode_has_forcealign()
> should be checking that extent size hints should be checked in that
> helper....

Right. In this particular case, I found that directories may be 
considered as well if we don't check for xfs_inode_has_extsize() (which 
we don't want).

> 
>>   	end = start + len;
>>   
>>   	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
>> @@ -5459,11 +5472,15 @@ __xfs_bunmapi(
>>   		if (del.br_startoff + del.br_blockcount > end + 1)
>>   			del.br_blockcount = end + 1 - del.br_startoff;
>>   
>> -		if (!isrt || (flags & XFS_BMAPI_REMAP))
>> +		if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
>>   			goto delete;
>>   
>> -		mod = xfs_rtb_to_rtxoff(mp,
>> -				del.br_startblock + del.br_blockcount);
>> +		if (isrt)
>> +			mod = xfs_rtb_to_rtxoff(mp,
>> +					del.br_startblock + del.br_blockcount);
>> +		else if (isforcealign)
>> +			mod = xfs_forcealign_extent_offset(ip,
>> +					del.br_startblock + del.br_blockcount);
> 
> There's got to be a cleaner way to do this.
> 
> We already know that either isrt or isforcealign must be set here,
> so there's no need for the "else if" construct.

right

> 
> Also, forcealign should take precedence over realtime, so that
> forcealign will work on realtime devices as well. I'd change this
> code to call a wrapper like:
> 
> 		mod = xfs_bunmapi_align(ip, del.br_startblock + del.br_blockcount);
> 
> static xfs_extlen_t
> xfs_bunmapi_align(
> 	struct xfs_inode	*ip,
> 	xfs_fsblock_t		bno)
> {
> 	if (!XFS_INODE_IS_REALTIME(ip)) {
> 		ASSERT(xfs_inode_has_forcealign(ip))
> 		if (is_power_of_2(ip->i_extsize))
> 			return bno & (ip->i_extsize - 1);
> 		return do_div(bno, ip->i_extsize);
> 	}
> 	return xfs_rtb_to_rtxoff(ip->i_mount, bno);
> }

ok, that's neater

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 17/21] iomap: Atomic write support
  2024-05-01  1:47   ` Dave Chinner
@ 2024-05-01 11:08     ` John Garry
  2024-05-02  1:43       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-05-01 11:08 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 02:47, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:42PM +0000, John Garry wrote:
>> Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.
>>
>> We rely on the FS to guarantee extent alignment, such that an atomic write
>> should never straddle two or more extents. The FS should also check for
>> validity of an atomic write length/alignment.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index a3ed7cfa95bc..d7bdeb675068 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -275,6 +275,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio)
>>   {
>> +	bool is_atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
>>   	const struct iomap *iomap = &iter->iomap;
>>   	struct inode *inode = iter->inode;
>>   	unsigned int zeroing_size, pad;
>> @@ -387,6 +388,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>>   		bio->bi_write_hint = inode->i_write_hint;
>>   		bio->bi_ioprio = dio->iocb->ki_ioprio;
>> +		if (is_atomic)
>> +			bio->bi_opf |= REQ_ATOMIC;
> 
> REQ_ATOMIC is only valid for write IO, isn't it?

yes, it is. We reject RWF_ATOMIC for a READ.

> 
> This should be added in iomap_dio_bio_opflags() after it is
> determined we are doing a write operation.  Regardless, it should be
> added in iomap_dio_bio_opflags(), not here. That also allows us to
> get rid of the is_atomic variable.

ok

> 
>> +
>>   		bio->bi_private = dio;
>>   		bio->bi_end_io = iomap_dio_bio_end_io;
>>   
>> @@ -403,6 +407,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		}
>>   
>>   		n = bio->bi_iter.bi_size;
>> +		if (is_atomic && n != orig_count) {
>> +			/* This bio should have covered the complete length */
>> +			ret = -EINVAL;
>> +			bio_put(bio);
>> +			goto out;
>> +		}
> 
> What happens now if we've done zeroing IO before this? I suspect we
> might expose stale data if the partial block zeroing converts the
> unwritten extent in full...

We use iomap_dio.ref to ensure that __iomap_dio_rw() does not return 
until any zeroing and actual sub-io block write completes. See 
iomap_dio_zero() -> iomap_dio_submit_bio() -> atomic_inc(&dio->ref) 
callchain. I meant to add such info to the commit message, as you 
questioned this previously.

> 
>>   		if (dio->flags & IOMAP_DIO_WRITE) {
>>   			task_io_account_write(n);
>>   		} else {
> 
> Ignoring the error handling issues, this code might be better as:
> 
> 		if (dio->flags & IOMAP_DIO_WRITE) {
> 			if ((opflags & REQ_ATOMIC) && n != orig_count) {
> 				/* atomic writes are all or nothing */
> 				ret = -EIO
> 				bio_put(bio);
> 				goto out;
> 			}
> 		}
> 
> so that we are not putting atomic write error checks in the read IO
> submission path.
> 

Maybe, I'll look at a rework with the suggested change to use 
iomap_dio_bio_opflags() - I actually thought that I introduced a change 
to use iomap_dio_bio_opflags() previously...

BTW, we need to return -EINVAL, as this is what userspace expects for 
such an error.

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign
  2024-05-01  0:53   ` Dave Chinner
@ 2024-05-01 11:24     ` John Garry
  2024-05-01 23:53     ` Darrick J. Wong
  1 sibling, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-01 11:24 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 01:53, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
>> Like we already do for rtvol, only free full extents for forcealign in
>> xfs_free_file_space().
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_bmap_util.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index f26d1570b9bd..1dd45dfb2811 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -847,8 +847,11 @@ xfs_free_file_space(
>>   	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>>   	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>>   
>> -	/* We can only free complete realtime extents. */
>> -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
>> +	/* Free only complete extents. */
>> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
>> +		startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
>> +		endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
>> +	} else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
>>   		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>>   		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>>   	}
> 
> When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
> line wrapper around roundup_64().
> 
> So lets get rid of the obfuscation that the one line RT wrapper
> introduces, and it turns into this:
> 
> 	rounding = 1;
> 	if (xfs_inode_has_forcealign(ip)
> 		rounding = ip->i_extsize;
> 	else if (XFS_IS_REALTIME_INODE(ip))
> 		rounding = mp->m_sb.sb_rextsize;
> 
> 	if (rounding > 1) {
> 		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> 		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> 	}

ok, and the same idea for xfs_can_free_eofblocks() with 
xfs_rtb_roundup_rtx(), right?

> 
> What this points out is that the prep steps for fallocate operations
> also need to handle both forced alignment and rtextsize rounding,
> and it does neither right now.  xfs_flush_unmap_range() is the main
> offender here, but xfs_prepare_shift() also needs fixing.

When you say fix, is this something to spin off separately for RT? This 
series is big enough already...

> 
> Hence:
> 
> static inline xfs_extlen_t
> xfs_extent_alignment(
> 	struct xfs_inode	*ip)
> {
> 	if (xfs_inode_has_forcealign(ip))
> 		return ip->i_extsize;
> 	if (XFS_IS_REALTIME_INODE(ip))
> 		return mp->m_sb.sb_rextsize;
> 	return 1;
> }
> 
> 
> In xfs_flush_unmap_range():
> 
> 	/*
> 	 * Make sure we extend the flush out to extent alignment
> 	 * boundaries so any extent range overlapping the start/end
> 	 * of the modification we are about to do is clean and idle.
> 	 */
> 	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> 	rounding = max(rounding, PAGE_SIZE);
> 	...
> 
> in xfs_free_file_space()
> 
> 	/*
> 	 * Round the range we are going to free inwards to extent
> 	 * alignment boundaries so we don't free blocks outside the
> 	 * range requested.
> 	 */
> 	rounding = xfs_extent_alignment(ip);
> 	if (rounding > 1 ) {
> 		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> 		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> 	}
> 
> and in xfs_prepare_shift()
> 
> 	/*
> 	 * Shift operations must stabilize the start block offset boundary along
> 	 * with the full range of the operation. If we don't, a COW writeback
> 	 * completion could race with an insert, front merge with the start
> 	 * extent (after split) during the shift and corrupt the file. Start
> 	 * with the aligned block just prior to the start to stabilize the boundary.
> 	 */
> 	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> 	offset = round_down(offset, rounding);
> 	if (offset)
> 		offset -= rounding;
> 
> Also, I think that the changes I suggested earlier to
> xfs_is_falloc_aligned() could use this xfs_extent_alignment()
> helper...
> 
> Overall this makes the code a whole lot easier to read and it also
> allows forced alignment to work correctly on RT devices...
> 

ok, fine

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 15/21] fs: xfs: iomap: Sub-extent zeroing
  2024-05-01  1:32   ` Dave Chinner
@ 2024-05-01 11:36     ` John Garry
  2024-05-02  1:26       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-05-01 11:36 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 02:32, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:40PM +0000, John Garry wrote:
>> Set iomap->extent_size when sub-extent zeroing is required.
>>
>> We treat a sub-extent write same as an unaligned write, so we can leverage
>> the existing sub-FSblock unaligned write support, i.e. try a shared lock
>> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
>> lock.
>>
>> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
> 
> If forcedalign is set, should we just reject unaligned DIOs?

Why would we? That's very restrictive. Indeed, we got to the point of 
adding the sub-extent zeroing just for supporting that.

> 
> .....
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c  | 35 ++++++++++++++++++++++-------------
>>   fs/xfs/xfs_iomap.c | 13 +++++++++++--
>>   2 files changed, 33 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index e81e01e6b22b..ee4f94cf6f4e 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -620,18 +620,19 @@ xfs_file_dio_write_aligned(
>>    * Handle block unaligned direct I/O writes
> 
>   * Handle unaligned direct IO writes.
> 
>>    *
>>    * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
>> - * them to be done in parallel with reads and other direct I/O writes.  However,
>> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
>> - * to do sub-block zeroing and that requires serialisation against other direct
>> - * I/O to the same block.  In this case we need to serialise the submission of
>> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
>> - * In the case where sub-block zeroing is not required, we can do concurrent
>> - * sub-block dios to the same block successfully.
>> + * them to be done in parallel with reads and other direct I/O writes.
>> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
>> + * I/O layer may need to do sub-block/extent zeroing and that requires
>> + * serialisation against other direct I/O to the same block/extent.  In this
>> + * case we need to serialise the submission of the unaligned I/O so that we
>> + * don't get racing block/extent zeroing in the dio layer.
>> + * In the case where sub-block/extent zeroing is not required, we can do
>> + * concurrent sub-block/extent dios to the same block/extent successfully.
>>    *
>>    * Optimistically submit the I/O using the shared lock first, but use the
>>    * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
>> - * if block allocation or partial block zeroing would be required.  In that case
>> - * we try again with the exclusive lock.
>> + * if block/extent allocation or partial block/extent zeroing would be
>> + * required.  In that case we try again with the exclusive lock.
> 
> Rather than changing every "block" to "block/extent", leave the bulk
> of the comment unchanged and add another paragraph to it that says
> something like:
> 
>   * If forced extent alignment is turned on, then serialisation
>   * constraints are extended from filesystem block alignment
>   * to extent alignment boundaries. In this case, we treat any
>   * non-extent-aligned DIO the same as a sub-block DIO.

ok, fine

> 
>>    */
>>   static noinline ssize_t
>>   xfs_file_dio_write_unaligned(
>> @@ -646,9 +647,9 @@ xfs_file_dio_write_unaligned(
>>   	ssize_t			ret;
>>   
>>   	/*
>> -	 * Extending writes need exclusivity because of the sub-block zeroing
>> -	 * that the DIO code always does for partial tail blocks beyond EOF, so
>> -	 * don't even bother trying the fast path in this case.
>> +	 * Extending writes need exclusivity because of the sub-block/extent
>> +	 * zeroing that the DIO code always does for partial tail blocks
>> +	 * beyond EOF, so don't even bother trying the fast path in this case.
>>   	 */
>>   	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>>   		if (iocb->ki_flags & IOCB_NOWAIT)
>> @@ -714,11 +715,19 @@ xfs_file_dio_write(
>>   	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>>   	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>>   	size_t			count = iov_iter_count(from);
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +	unsigned int		blockmask;
>>   
>>   	/* direct I/O must be aligned to device logical sector size */
>>   	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>>   		return -EINVAL;
>> -	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>> +
>> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
>> +		blockmask = XFS_FSB_TO_B(mp, ip->i_extsize) - 1;
>> +	else
>> +		blockmask = mp->m_blockmask;
> 
> 	alignmask = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip)) - 1;

Do you mean xfs_extent_alignment() instead of xfs_inode_alignment()?

> 
> Note that this would consider sub rt_extsize IO as unaligned,

> which
> may be undesirable. In that case, we should define a second helper
> such as xfs_inode_io_alignment() that doesn't take into account RT
> extent sizes because we can still do filesystem block sized
> unwritten extent conversion on those devices. The same IO-specific
> wrapper would be used for the other cases in this patch, too.

ok, fine

> 
>> +
>> +	if ((iocb->ki_pos | count) & blockmask)
>>   		return xfs_file_dio_write_unaligned(ip, iocb, from);
>>   	return xfs_file_dio_write_aligned(ip, iocb, from);
>>   }
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4087af7f3c9f..1a3692bbc84d 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -138,6 +138,8 @@ xfs_bmbt_to_iomap(
>>   
>>   	iomap->validity_cookie = sequence_cookie;
>>   	iomap->folio_ops = &xfs_iomap_folio_ops;
>> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
>> +		iomap->extent_size = XFS_FSB_TO_B(mp, ip->i_extsize);
> 
> 	iomap->io_block_size = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip));
> 
>>   	return 0;
>>   }
>>   
>> @@ -570,8 +572,15 @@ xfs_iomap_write_unwritten(
>>   
>>   	trace_xfs_unwritten_convert(ip, offset, count);
>>   
>> -	offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> -	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
>> +		xfs_extlen_t extsize_bytes = mp->m_sb.sb_blocksize * ip->i_extsize;
>> +
>> +		offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
>> +		count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
>> +	} else {
>> +		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> +		count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> +	}
> 
> More places we can use a xfs_inode_alignment() helper.
> 
> 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> 	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> 	rounding = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip));
> 	if (rounding > 1) {
> 		 offset_fsb = rounddown_64(offset_fsb, rounding);
> 		 count_fsb = roundup_64(count_fsb, rounding);
> 	}

ok, but again I assume you mean xfs_extent_alignment()

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask for forcealign
  2024-05-01 10:48     ` John Garry
@ 2024-05-01 23:45       ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2024-05-01 23:45 UTC (permalink / raw
  To: John Garry
  Cc: Dave Chinner, hch, viro, brauner, jack, chandan.babu, willy,
	axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 11:48:59AM +0100, John Garry wrote:
> On 01/05/2024 00:35, Dave Chinner wrote:
> > >   	return !((pos | len) & mask);
> > I think this whole function needs to be rewritten so that
> > non-power-of-2 extent sizes are supported on both devices properly.
> > 
> > 	xfs_extlen_t	fsbs = 1;
> > 	u64		bytes;
> > 	u32		mod;
> > 
> > 	if (xfs_inode_has_forcealign(ip))
> > 		fsbs = ip->i_extsize;
> > 	else if (XFS_IS_REALTIME_INODE(ip))
> > 		fsbs = mp->m_sb.sb_rextsize;
> > 
> > 	bytes = XFS_FSB_TO_B(mp, fsbs);
> > 	if (is_power_of_2(fsbs))
> > 		return !((pos | len) & (bytes - 1));
> > 
> > 	div_u64_rem(pos, bytes, &mod);
> > 	if (mod)
> > 		return false;
> > 	div_u64_rem(len, bytes, &mod);
> > 	return mod == 0;
> 
> ok, but I still have a doubt about non-power-of-2 forcealign extsize
> support.

The trouble is, non-power-of-2 extent size hints are supported for
regular and realtime files for funny cases like trying to align
allocations to RAID stripes.  I think it would be hard to drop support
for this, given that means that old filesystems can't ever get upgraded
to forcealign.

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign
  2024-05-01  0:53   ` Dave Chinner
  2024-05-01 11:24     ` John Garry
@ 2024-05-01 23:53     ` Darrick J. Wong
  2024-05-02  3:12       ` Dave Chinner
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-05-01 23:53 UTC (permalink / raw
  To: Dave Chinner
  Cc: John Garry, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 10:53:28AM +1000, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
> > Like we already do for rtvol, only free full extents for forcealign in
> > xfs_free_file_space().
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index f26d1570b9bd..1dd45dfb2811 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -847,8 +847,11 @@ xfs_free_file_space(
> >  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
> >  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
> >  
> > -	/* We can only free complete realtime extents. */
> > -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> > +	/* Free only complete extents. */
> > +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
> > +		startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
> > +		endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
> > +	} else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> >  		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
> >  		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
> >  	}
> 
> When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
> line wrapper around roundup_64().

I added this a couple of cycles ago to get ready for realtime
modernization.  That will create a bunch *more* churn in my tree just to
convert everything *back*.

Where the hell were you when that was being reviewed?!!!

NO!  This is pointless busywork!

--D

> So lets get rid of the obfuscation that the one line RT wrapper
> introduces, and it turns into this:
> 
> 	rounding = 1;
> 	if (xfs_inode_has_forcealign(ip)
> 		rounding = ip->i_extsize;
> 	else if (XFS_IS_REALTIME_INODE(ip))
> 		rounding = mp->m_sb.sb_rextsize;
> 
> 	if (rounding > 1) {
> 		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> 		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> 	}
> 
> What this points out is that the prep steps for fallocate operations
> also need to handle both forced alignment and rtextsize rounding,
> and it does neither right now.  xfs_flush_unmap_range() is the main
> offender here, but xfs_prepare_shift() also needs fixing.
> 
> Hence:
> 
> static inline xfs_extlen_t
> xfs_extent_alignment(
> 	struct xfs_inode	*ip)
> {
> 	if (xfs_inode_has_forcealign(ip))
> 		return ip->i_extsize;
> 	if (XFS_IS_REALTIME_INODE(ip))
> 		return mp->m_sb.sb_rextsize;
> 	return 1;
> }
> 
> 
> In xfs_flush_unmap_range():
> 
> 	/*
> 	 * Make sure we extend the flush out to extent alignment
> 	 * boundaries so any extent range overlapping the start/end
> 	 * of the modification we are about to do is clean and idle.
> 	 */
> 	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> 	rounding = max(rounding, PAGE_SIZE);
> 	...
> 
> in xfs_free_file_space()
> 
> 	/*
> 	 * Round the range we are going to free inwards to extent
> 	 * alignment boundaries so we don't free blocks outside the
> 	 * range requested.
> 	 */
> 	rounding = xfs_extent_alignment(ip);
> 	if (rounding > 1 ) {
> 		startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> 		endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> 	}
> 
> and in xfs_prepare_shift()
> 
> 	/*
> 	 * Shift operations must stabilize the start block offset boundary along
> 	 * with the full range of the operation. If we don't, a COW writeback
> 	 * completion could race with an insert, front merge with the start
> 	 * extent (after split) during the shift and corrupt the file. Start
> 	 * with the aligned block just prior to the start to stabilize the boundary.
> 	 */
> 	rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> 	offset = round_down(offset, rounding);
> 	if (offset)
> 		offset -= rounding;
> 
> Also, I think that the changes I suggested earlier to 
> xfs_is_falloc_aligned() could use this xfs_extent_alignment()
> helper...
> 
> Overall this makes the code a whole lot easier to read and it also
> allows forced alignment to work correctly on RT devices...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-05-01 10:03     ` John Garry
@ 2024-05-02  0:50       ` Dave Chinner
  2024-05-02  7:56         ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-05-02  0:50 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 11:03:06AM +0100, John Garry wrote:
> 
> > > +/* Validate the forcealign inode flag */
> > > +xfs_failaddr_t
> > > +xfs_inode_validate_forcealign(
> > > +	struct xfs_mount	*mp,
> > > +	uint16_t		mode,
> > 
> > 	umode_t			mode,
> 
> ok. BTW, other functions like xfs_inode_validate_extsize() use uint16_t
> 
> > 
> > > +	uint16_t		flags,
> > > +	uint32_t		extsize,
> > > +	uint32_t		cowextsize)
> > 
> > extent sizes are xfs_extlen_t types.
> 
> ok
> 
> > 
> > > +{
> > > +	/* superblock rocompat feature flag */
> > > +	if (!xfs_has_forcealign(mp))
> > > +		return __this_address;
> > > +
> > > +	/* Only regular files and directories */
> > > +	if (!S_ISDIR(mode) && !S_ISREG(mode))
> > > +		return __this_address;
> > > +
> > > +	/* Doesn't apply to realtime files */
> > > +	if (flags & XFS_DIFLAG_REALTIME)
> > > +		return __this_address;
> > 
> > Why not? A rt device with an extsize of 1 fsb could make use of
> > forced alignment just like the data device to allow larger atomic
> > writes to be done. I mean, just because we haven't written the code
> > to do this yet doesn't mean it is an illegal on-disk format state.
> 
> ok, so where is a better place to disallow forcealign for RT now (since we
> have not written the code to support it nor verified it)?

Just don't allow it to be set in the setattr ioctl if the inode is
RT. ANd don't let an inode be marked RT if forcealign is already
set.

> 
> > 
> > > +	/* Requires a non-zero power-of-2 extent size hint */
> > > +	if (extsize == 0 || !is_power_of_2(extsize) ||
> > > +	    (mp->m_sb.sb_agblocks % extsize))
> > > +		return __this_address;
> > 
> > Please do these as indiviual checks with their own fail address.
> 
> ok
> 
> > That way we can tell which check failed from the console output.
> > Also, the agblocks check is already split out below, so it's being
> > checked twice...
> > 
> > Also, why does force-align require a power-of-2 extent size? Why
> > does it require the extent size to be an exact divisor of the AG
> > size? Aren't these atomic write alignment restrictions? i.e.
> > shouldn't these only be enforced when the atomic writes inode flag
> > is set?
> 
> With regards the power-of-2 restriction, I think that the code changes are
> going to become a lot more complex if we don't enforce this for forcealign.
> 
> For example, consider xfs_file_dio_write(), where we check for an unaligned
> write based on forcealign extent mask. It's much simpler to rely on a
> power-of-2 size. And same for iomap extent zeroing.

But it's not more complex - we already do this non-power-of-2
alignment stuff for all the realtime code, so it's just a matter
of not blindly using bit masking in alignment checks.

> So then it can be asked, for what reason do we want to support unorthodox,
> non-power-of-2 sizes? Who would want this?

I'm constantly surprised by the way people use stuff like this
filesystem and storage alignment constraints are not arbitrarily
limited to power-of-2 sizes.

For example, code implementation is simple in RAID setups when you
use power-of-2 chunk sizes and stripe widths. But not all storage
hardware fits power-of-2 configs like 4+1, 4+2, 8+1, 8+2, etc. THis
is pretty common - 2.5" 2U drive trays have 24 drive bays. If you
want to give up 33% of the storage capacity just to use power-of-2
stripe widths then you would use 4x4+2 RAID6 luns. However, most
people don't want to waste that much money on redundancy. They are
much more likely to use 2x10+2 RAID6 luns or 1x21+2 with a hot spare
to maximise the data storage capacity.

If someone wants to force-align allocation to stripe widths on such
a RAID array config rather than trying to rely on the best effort
swalloc mount option, then they need non-power-of-2
alignments to be supported.

It's pretty much a no-brainer - the alignment code already handles
non-power-of-2 alignments, and it's not very much additional code to
ensure we can handle any alignment the user specified.

> As for AG size, again I think that it is required to be aligned to the
> forcealign extsize. As I remember, when converting from an FSB to a DB, if
> the AG itself is not aligned to the forcealign extsize, then the DB will not
> be aligned to the forcealign extsize. More below...
> 
> > 
> > > +	/* Requires agsize be a multiple of extsize */
> > > +	if (mp->m_sb.sb_agblocks % extsize)
> > > +		return __this_address;
> > > +
> > > +	/* Requires stripe unit+width (if set) be a multiple of extsize */
> > > +	if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
> > > +	    (mp->m_swidth && (mp->m_swidth % extsize)))
> > > +		return __this_address;
> > 
> > Again, this is an atomic write constraint, isn't it?
> 
> So why do we want forcealign? It is to only align extent FSBs?

Yes. forced alignment is essentially just extent size guarantees.

This is part of what is needed for atomic writes, but atomic writes
also require specific physical storage alignment between the
filesystem and the device. The filesystem setup has to correctly
align AGs to the physical storage, and stuff like RAID
configurations need to be specifically compatible with the atomic
write capabilities of the underlying hardware.

None of these hardware iand storage stack alignment constraints have
any relevance to the filesystem forced alignment functionality. They
are completely indepedent. All the forced alignment does is
guarantees that allocation is aligned according the extent size hint
on the inode or it fails with ENOSPC.

> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index d0e2cec6210d..d1126509ceb9 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
> > >   		di_flags2 |= XFS_DIFLAG2_DAX;
> > >   	if (xflags & FS_XFLAG_COWEXTSIZE)
> > >   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > > +	if (xflags & FS_XFLAG_FORCEALIGN)
> > > +		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> > >   	return di_flags2;
> > >   }
> > > @@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
> > >   	if (i_flags2 && !xfs_has_v3inodes(mp))
> > >   		return -EINVAL;
> > > +	/*
> > > +	 * Force-align requires a nonzero extent size hint and a zero cow
> > > +	 * extent size hint.  It doesn't apply to realtime files.
> > > +	 */
> > > +	if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
> > > +		if (!xfs_has_forcealign(mp))
> > > +			return -EINVAL;
> > > +		if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
> > > +			return -EINVAL;
> > > +		if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
> > > +					FS_XFLAG_EXTSZINHERIT)))
> > > +			return -EINVAL;
> > > +		if (fa->fsx_xflags & FS_XFLAG_REALTIME)
> > > +			return -EINVAL;
> > > +	}
> > 
> > What about if the file already has shared extents on it (i.e.
> > reflinked or deduped?)
> 
> At the top of the function we have this check for RT:
> 
> 	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> 		/* Can't change realtime flag if any extents are allocated. */
> 		if (ip->i_df.if_nextents || ip->i_delayed_blks)
> 			return -EINVAL;
> 	}
> 
> Would expanding that check for forcealign also suffice? Indeed, later in
> this series I expanded this check to cover atomicwrites (when I really
> intended it for forcealign).

For the moment, yes.

> > > @@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
> > >   	failaddr = xfs_inode_validate_extsize(ip->i_mount,
> > >   			XFS_B_TO_FSB(mp, fa->fsx_extsize),
> > >   			VFS_I(ip)->i_mode, new_diflags);
> > > -	return failaddr != NULL ? -EINVAL : 0;
> > > +	if (failaddr)
> > > +		return -EINVAL;
> > > +
> > > +	if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
> > > +		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
> > > +				VFS_I(ip)->i_mode, new_diflags,
> > > +				XFS_B_TO_FSB(mp, fa->fsx_extsize),
> > > +				XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
> > > +		if (failaddr)
> > > +			return -EINVAL;
> > > +	}
> > 
> > Oh, it's because you're trying to use on-disk format validation
> > routines for user API validation. That, IMO, is a bad idea because
> > the on-disk format and kernel/user APIs should not be tied
> > together as they have different constraints and error conditions.
> > 
> > That also explains why xfs_inode_validate_forcealign() doesn't just
> > get passed the inode to validate - it's because you want to pass
> > information from the user API to it. This results in sub-optimal
> > code for both on-disk format validation and user API validation.
> > 
> > Can you please separate these and put all the force align user API
> > validation checks in the one function?
> > 
> 
> ok, fine. But it would be good to have clarification on function of
> forcealign, above, i.e. does it always align extents to disk blocks?

No, it doesn't. XFS has never done this - physical extent alignment
is always done relative to the start of the AG, not the underlying
disk geometry.

IOWs, forced alignement is not aligning to disk blocks at all - it
is aligning extents logically to file offset and physically to the
offset from the start of the allocation group.  Hence there are no
real constraints on forced alignment - we can do any sort of
alignment as long it is smaller than half the max size of a physical
extent.

For allocation to then be aligned to physical storage, we need mkfs
to physically align the start of each AG to the geometry of the
underlying storage. We already do this for filesystems with a stripe
unit defined, hence stripe aligned allocation is physically aligned
to the underlying storage.

However, if mkfs doesn't get the physical layout of AGs right, there
is nothing the mounted filesystem can do to guarantee extent
allocation is aligned to physical disk blocks regardless of whether
forced alignment is enabled or not...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign
  2024-05-01  8:30     ` John Garry
@ 2024-05-02  1:11       ` Dave Chinner
  2024-05-02  8:55         ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-05-02  1:11 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 09:30:37AM +0100, John Garry wrote:
> On 30/04/2024 23:54, Dave Chinner wrote:
> > On Mon, Apr 29, 2024 at 05:47:34PM +0000, John Garry wrote:
> > > For when forcealign is enabled, we want the EOF to be aligned as well, so
> > > do not free EOF blocks.
> > 
> > This is doesn't match what the code does. The code is correct - it
> > rounds the range to be trimmed up to the aligned offset beyond EOF
> > and then frees them. The description needs to be updated to reflect
> > this.
> 
> ok, fine
> 
> > 
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_bmap_util.c | 7 ++++++-
> > >   1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index 19e11d1da660..f26d1570b9bd 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
> > >   	 * forever.
> > >   	 */
> > >   	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> > > -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> > > +
> > > +	/* Do not free blocks when forcing extent sizes */
> > > +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> > 
> > I see this sort of check all through the remaining patches.
> > 
> > Given there are significant restrictions on forced alignment,
> > shouldn't this all the details be pushed inside the helper function?
> > e.g.
> > 
> > /*
> >   * Forced extent alignment is dependent on extent size hints being
> >   * set to define the alignment. Alignment is only necessary when the
> >   * extent size hint is larger than a single block.
> >   *
> >   * If reflink is enabled on the file or we are in always_cow mode,
> >   * we can't easily do forced alignment.
> >   *
> >   * We don't support forced alignment on realtime files.
> >   * XXX(dgc): why not?
> 
> There is no technical reason to not be able to support forcealign on RT,
> AFAIK. My idea is to support RT after non-RT is supported.
> 
> >   */
> > static inline bool
> > xfs_inode_has_forcealign(struct xfs_inode *ip)
> > {
> > 	if (!(ip->di_flags & XFS_DIFLAG_EXTSIZE))
> > 		return false;
> > 	if (ip->i_extsize <= 1)
> > 		return false;
> > 
> > 	if (xfs_is_cow_inode(ip))
> > 		return false;
> 
> Could we just include this in the forcealign validate checks? Currently we
> just check CoW extsize is zero there.

Checking COW extsize is zero doesn't tell us anything useful about
whether the inode might have shared extents, or that the filesystem
has had the sysfs "always cow" debug knob turned on. That changes
filesystem behaviour at mount time and has nothing to do with the
on-disk format constraints.

And now that I think about it, checking for COW extsize is
completely the wrong thing to do because it doesn't get used until
an extent is shared and a COW trigger is hit. So the presence of COW
extsize has zero impact on whether we can use forced alignment or
not.

IOWs, we have to check for shared extents or always cow here,
because even a file with correctly set up forced alignment needs to
have forced alignment disabled when always_cow is enabled. Every
write is going to use the COW path and AFAICT we don't support
forced alignment through that path yet.

> 
> > 	if (ip->di_flags & XFS_DIFLAG_REALTIME)
> > 		return false;
> 
> We check this in xfs_inode_validate_forcealign()

That's kinda my point - we have a random smattering of different
checks at different layers and in different contexts. i.e.  There's
no one function that performs -all- the "can we do forced alignment"
checks that allow forced alignment to be used. This simply adds all
those checks in the one place and ensures that even if other code
gets checks wrong, we won't use forcealign inappropriately.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 15/21] fs: xfs: iomap: Sub-extent zeroing
  2024-05-01 11:36     ` John Garry
@ 2024-05-02  1:26       ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-05-02  1:26 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 12:36:02PM +0100, John Garry wrote:
> On 01/05/2024 02:32, Dave Chinner wrote:
> > On Mon, Apr 29, 2024 at 05:47:40PM +0000, John Garry wrote:
> > > Set iomap->extent_size when sub-extent zeroing is required.
> > > 
> > > We treat a sub-extent write same as an unaligned write, so we can leverage
> > > the existing sub-FSblock unaligned write support, i.e. try a shared lock
> > > with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> > > lock.
> > > 
> > > In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
> > 
> > If forcedalign is set, should we just reject unaligned DIOs?
> 
> Why would we? That's very restrictive. Indeed, we got to the point of adding
> the sub-extent zeroing just for supporting that.
> > > @@ -646,9 +647,9 @@ xfs_file_dio_write_unaligned(
> > >   	ssize_t			ret;
> > >   	/*
> > > -	 * Extending writes need exclusivity because of the sub-block zeroing
> > > -	 * that the DIO code always does for partial tail blocks beyond EOF, so
> > > -	 * don't even bother trying the fast path in this case.
> > > +	 * Extending writes need exclusivity because of the sub-block/extent
> > > +	 * zeroing that the DIO code always does for partial tail blocks
> > > +	 * beyond EOF, so don't even bother trying the fast path in this case.
> > >   	 */
> > >   	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
> > >   		if (iocb->ki_flags & IOCB_NOWAIT)
> > > @@ -714,11 +715,19 @@ xfs_file_dio_write(
> > >   	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
> > >   	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
> > >   	size_t			count = iov_iter_count(from);
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	unsigned int		blockmask;
> > >   	/* direct I/O must be aligned to device logical sector size */
> > >   	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> > >   		return -EINVAL;
> > > -	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> > > +
> > > +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1)
> > > +		blockmask = XFS_FSB_TO_B(mp, ip->i_extsize) - 1;
> > > +	else
> > > +		blockmask = mp->m_blockmask;
> > 
> > 	alignmask = XFS_FSB_TO_B(mp, xfs_inode_alignment(ip)) - 1;
> 
> Do you mean xfs_extent_alignment() instead of xfs_inode_alignment()?

Yes, I was.

I probably should have named it xfs_inode_extent_alignment() because
clearly I kept thinking of it as "inode alignment"... :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 17/21] iomap: Atomic write support
  2024-05-01 11:08     ` John Garry
@ 2024-05-02  1:43       ` Dave Chinner
  2024-05-02  9:12         ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-05-02  1:43 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 12:08:34PM +0100, John Garry wrote:
> On 01/05/2024 02:47, Dave Chinner wrote:
> > On Mon, Apr 29, 2024 at 05:47:42PM +0000, John Garry wrote:
> > > Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.
> > > 
> > > We rely on the FS to guarantee extent alignment, such that an atomic write
> > > should never straddle two or more extents. The FS should also check for
> > > validity of an atomic write length/alignment.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
...
> > > +
> > >   		bio->bi_private = dio;
> > >   		bio->bi_end_io = iomap_dio_bio_end_io;
> > > @@ -403,6 +407,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   		}
> > >   		n = bio->bi_iter.bi_size;
> > > +		if (is_atomic && n != orig_count) {
> > > +			/* This bio should have covered the complete length */
> > > +			ret = -EINVAL;
> > > +			bio_put(bio);
> > > +			goto out;
> > > +		}
> > 
> > What happens now if we've done zeroing IO before this? I suspect we
> > might expose stale data if the partial block zeroing converts the
> > unwritten extent in full...
> 
> We use iomap_dio.ref to ensure that __iomap_dio_rw() does not return until
> any zeroing and actual sub-io block write completes. See iomap_dio_zero() ->
> iomap_dio_submit_bio() -> atomic_inc(&dio->ref) callchain. I meant to add
> such info to the commit message, as you questioned this previously.

Yes, I get that. But my point is that we may have only done -part-
of a block unaligned IO.

This is effectively a failure from a bio_iov_iter_get_pages() call.
What does the bio_iov_iter_get_pages() failure case do that this new
failure case not do? Why does this case have different failure
handling?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign
  2024-05-01 23:53     ` Darrick J. Wong
@ 2024-05-02  3:12       ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-05-02  3:12 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: John Garry, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, May 01, 2024 at 04:53:10PM -0700, Darrick J. Wong wrote:
> On Wed, May 01, 2024 at 10:53:28AM +1000, Dave Chinner wrote:
> > On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
> > > Like we already do for rtvol, only free full extents for forcealign in
> > > xfs_free_file_space().
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >  fs/xfs/xfs_bmap_util.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index f26d1570b9bd..1dd45dfb2811 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -847,8 +847,11 @@ xfs_free_file_space(
> > >  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
> > >  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
> > >  
> > > -	/* We can only free complete realtime extents. */
> > > -	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> > > +	/* Free only complete extents. */
> > > +	if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
> > > +		startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
> > > +		endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
> > > +	} else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> > >  		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
> > >  		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
> > >  	}
> > 
> > When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
> > line wrapper around roundup_64().
> 
> I added this a couple of cycles ago to get ready for realtime
> modernization.

Yes, I know. I'm not suggesting that there's anything wrong with
this code, just pointing out that the RT wrappers are doing the
exact same conversion as the force-align code is doing. And from
that observation, a common implementation makes a lot of sense
because that same logic is repeated in quite a few places....

> That will create a bunch *more* churn in my tree just to
> convert everything *back*.

This doesn't change anything significant in your tree, nor do you
need to "convert everything back". The RT wrappers are unchanged,
and the only material difference in your tree vs the upstream
xfs_free_file_space() this patchset is based on is this:

-	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+	if (xfs_inode_has_bigrtalloc(ip)) {

That's it.

All the suggestion I made does is change where you need to make this
one line change. It would also remove the need to do this one line
change in multiple other places, so it would actually -reduce- your
ongoing rebase pain, not make it worse.

That's a net win for everyone, and it's most definitely not a reason
to shout at people and threaten to revert any changes they might
make in this area of the code.

> Where the hell were you when that was being reviewed?!!!

How is this sort of unhelpful statement in any way relevant to
improving the forcealign functionality to the point where we can
actually merge it and start making use of it for atomic writes?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-05-02  0:50       ` Dave Chinner
@ 2024-05-02  7:56         ` John Garry
  0 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-02  7:56 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 02/05/2024 01:50, Dave Chinner wrote:
>> For example, consider xfs_file_dio_write(), where we check for an unaligned
>> write based on forcealign extent mask. It's much simpler to rely on a
>> power-of-2 size. And same for iomap extent zeroing.
> But it's not more complex - we already do this non-power-of-2
> alignment stuff for all the realtime code, so it's just a matter
> of not blindly using bit masking in alignment checks.
> 
>> So then it can be asked, for what reason do we want to support unorthodox,
>> non-power-of-2 sizes? Who would want this?
> I'm constantly surprised by the way people use stuff like this
> filesystem and storage alignment constraints are not arbitrarily
> limited to power-of-2 sizes.
> 
> For example, code implementation is simple in RAID setups when you
> use power-of-2 chunk sizes and stripe widths. But not all storage
> hardware fits power-of-2 configs like 4+1, 4+2, 8+1, 8+2, etc. THis
> is pretty common - 2.5" 2U drive trays have 24 drive bays. If you
> want to give up 33% of the storage capacity just to use power-of-2
> stripe widths then you would use 4x4+2 RAID6 luns. However, most
> people don't want to waste that much money on redundancy. They are
> much more likely to use 2x10+2 RAID6 luns or 1x21+2 with a hot spare
> to maximise the data storage capacity.

Thanks for sharing this info

> 
> If someone wants to force-align allocation to stripe widths on such
> a RAID array config rather than trying to rely on the best effort
> swalloc mount option, then they need non-power-of-2
> alignments to be supported.
> 
> It's pretty much a no-brainer - the alignment code already handles
> non-power-of-2 alignments, and it's not very much additional code to
> ensure we can handle any alignment the user specified.

ok, fine

> 
>> As for AG size, again I think that it is required to be aligned to the
>> forcealign extsize. As I remember, when converting from an FSB to a DB, if
>> the AG itself is not aligned to the forcealign extsize, then the DB will not
>> be aligned to the forcealign extsize. More below...
>>
>>>> +	/* Requires agsize be a multiple of extsize */
>>>> +	if (mp->m_sb.sb_agblocks % extsize)
>>>> +		return __this_address;
>>>> +
>>>> +	/* Requires stripe unit+width (if set) be a multiple of extsize */
>>>> +	if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
>>>> +	    (mp->m_swidth && (mp->m_swidth % extsize)))
>>>> +		return __this_address;
>>> Again, this is an atomic write constraint, isn't it?
>> So why do we want forcealign? It is to only align extent FSBs?
> Yes. forced alignment is essentially just extent size guarantees.
> 
> This is part of what is needed for atomic writes, but atomic writes
> also require specific physical storage alignment between the
> filesystem and the device. The filesystem setup has to correctly
> align AGs to the physical storage, and stuff like RAID
> configurations need to be specifically compatible with the atomic
> write capabilities of the underlying hardware.
> 
> None of these hardware iand storage stack alignment constraints have
> any relevance to the filesystem forced alignment functionality. They
> are completely indepedent. All the forced alignment does is
> guarantees that allocation is aligned according the extent size hint
> on the inode or it fails with ENOSPC.

Fine, so only for atomic writes we just need to ensure FSBs are aligned 
to DBs.

And so it is the responsibility of mkfs to ensure AG size aligns to any 
forcealign extsize specified and also disk atomic write geometry.

For atomic write only, it is the responsibility of the kernel to check 
the forcealign extsize is compatible with any stripe alignment and AG size.

>>>
>>> Can you please separate these and put all the force align user API
>>> validation checks in the one function?
>>>
>> ok, fine. But it would be good to have clarification on function of
>> forcealign, above, i.e. does it always align extents to disk blocks?
> No, it doesn't. XFS has never done this - physical extent alignment
> is always done relative to the start of the AG, not the underlying
> disk geometry.
> 
> IOWs, forced alignement is not aligning to disk blocks at all - it
> is aligning extents logically to file offset and physically to the
> offset from the start of the allocation group.  Hence there are no
> real constraints on forced alignment - we can do any sort of
> alignment as long it is smaller than half the max size of a physical
> extent.
> 
> For allocation to then be aligned to physical storage, we need mkfs
> to physically align the start of each AG to the geometry of the
> underlying storage. We already do this for filesystems with a stripe
> unit defined, hence stripe aligned allocation is physically aligned
> to the underlying storage.

Sure

> 
> However, if mkfs doesn't get the physical layout of AGs right, there
> is nothing the mounted filesystem can do to guarantee extent
> allocation is aligned to physical disk blocks regardless of whether
> forced alignment is enabled or not...

ok, understood.

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign
  2024-05-02  1:11       ` Dave Chinner
@ 2024-05-02  8:55         ` John Garry
  0 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-02  8:55 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 02/05/2024 02:11, Dave Chinner wrote:
>>> static inline bool
>>> xfs_inode_has_forcealign(struct xfs_inode *ip)
>>> {
>>> 	if (!(ip->di_flags & XFS_DIFLAG_EXTSIZE))
>>> 		return false;
>>> 	if (ip->i_extsize <= 1)
>>> 		return false;
>>>
>>> 	if (xfs_is_cow_inode(ip))
>>> 		return false;
>> Could we just include this in the forcealign validate checks? Currently we
>> just check CoW extsize is zero there.
> Checking COW extsize is zero doesn't tell us anything useful about
> whether the inode might have shared extents, or that the filesystem
> has had the sysfs "always cow" debug knob turned on. That changes
> filesystem behaviour at mount time and has nothing to do with the
> on-disk format constraints.
> 
> And now that I think about it, checking for COW extsize is
> completely the wrong thing to do because it doesn't get used until
> an extent is shared and a COW trigger is hit. So the presence of COW
> extsize has zero impact on whether we can use forced alignment or
> not.

ok

> 
> IOWs, we have to check for shared extents or always cow here,
> because even a file with correctly set up forced alignment needs to
> have forced alignment disabled when always_cow is enabled. Every
> write is going to use the COW path and AFAICT we don't support
> forced alignment through that path yet.

ok

> 
>>> 	if (ip->di_flags & XFS_DIFLAG_REALTIME)
>>> 		return false;
>> We check this in xfs_inode_validate_forcealign()
> That's kinda my point - we have a random smattering of different
> checks at different layers and in different contexts. i.e.  There's
> no one function that performs -all- the "can we do forced alignment"
> checks that allow forced alignment to be used. This simply adds all
> those checks in the one place and ensures that even if other code
> gets checks wrong, we won't use forcealign inappropriately.

Fine, I can do that if you think it is the best strategy.

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 17/21] iomap: Atomic write support
  2024-05-02  1:43       ` Dave Chinner
@ 2024-05-02  9:12         ` John Garry
  0 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-02  9:12 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 02/05/2024 02:43, Dave Chinner wrote:
> On Wed, May 01, 2024 at 12:08:34PM +0100, John Garry wrote:
>> On 01/05/2024 02:47, Dave Chinner wrote:
>>> On Mon, Apr 29, 2024 at 05:47:42PM +0000, John Garry wrote:
>>>> Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.
>>>>
>>>> We rely on the FS to guarantee extent alignment, such that an atomic write
>>>> should never straddle two or more extents. The FS should also check for
>>>> validity of an atomic write length/alignment.
>>>>
>>>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>>> ---
> ...
>>>> +
>>>>    		bio->bi_private = dio;
>>>>    		bio->bi_end_io = iomap_dio_bio_end_io;
>>>> @@ -403,6 +407,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>>>    		}
>>>>    		n = bio->bi_iter.bi_size;
>>>> +		if (is_atomic && n != orig_count) {
>>>> +			/* This bio should have covered the complete length */
>>>> +			ret = -EINVAL;
>>>> +			bio_put(bio);
>>>> +			goto out;
>>>> +		}
>>>
>>> What happens now if we've done zeroing IO before this? I suspect we
>>> might expose stale data if the partial block zeroing converts the
>>> unwritten extent in full...
>>
>> We use iomap_dio.ref to ensure that __iomap_dio_rw() does not return until
>> any zeroing and actual sub-io block write completes. See iomap_dio_zero() ->
>> iomap_dio_submit_bio() -> atomic_inc(&dio->ref) callchain. I meant to add
>> such info to the commit message, as you questioned this previously.
> 
> Yes, I get that. But my point is that we may have only done -part-
> of a block unaligned IO.
> 
> This is effectively a failure from a bio_iov_iter_get_pages() call.
> What does the bio_iov_iter_get_pages() failure case do that this new
> failure case not do? Why does this case have different failure
> handling?
> 

So you are saying that if we fail here (that is the (is_atomic && n != 
orig_count) check), anything unwritten in the atomic write region and 
zerotail region could expose stale data, right?

If so, I would say that we need to zero the complete unwritten atomic 
write and zerotail regions - similar to the bio_iov_iter_get_pages() 
failure handling - and still report an -EINVAL error.

How does that sound?

Thanks,
John



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-05-01  1:07   ` Dave Chinner
  2024-05-01 10:23     ` John Garry
@ 2024-05-30 10:40     ` John Garry
  1 sibling, 0 replies; 60+ messages in thread
From: John Garry @ 2024-05-30 10:40 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 01/05/2024 02:07, Dave Chinner wrote:
>>   	blk_opf_t bio_opf;
>> @@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> +	if (iomap->extent_size)
>> +		zeroing_size = iomap->extent_size;
>> +	else
>> +		zeroing_size = i_blocksize(inode);
> Oh, the dissonance!
> 
> iomap->extent_size isn't an extent size at all.
> 
> The size of the extent the iomap returns is iomap->length. This new
> variable is the IO specific "block size" that should be assumed by
> the dio code to determine if padding should be done.
> 
> IOWs, I think we should add an "io_block_size" field to the iomap,
> and every filesystem that supports iomap should set it to the
> filesystem block size (i_blocksize(inode)). Then the changes to the
> iomap code end up just being:
> 
> 
> -	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	unsigned int fs_block_size = iomap->io_block_size, pad;
> 
> And the patch that introduces that infrastructure change will also
> change all the filesystem implementations to unconditionally set
> iomap->io_block_size to i_blocksize().

JFYI, this is how that change looks:

----8<----

Subject: [PATCH] iomap: Allow filesystens set sub-fs block zeroing size

Allow filesystens to set the sub-fs block zero size, as in future we will
want to extend this feature to support zeroing of block sizes of larger
than the inode block size.

Signed-off-by: John Garry <john.g.garry@oracle.com>

diff --git a/block/fops.c b/block/fops.c
index 9d6d86ebefb9..020443078630 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -402,6 +402,7 @@ static int blkdev_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,
  	iomap->addr = iomap->offset;
  	iomap->length = isize - iomap->offset;
  	iomap->flags |= IOMAP_F_BUFFER_HEAD; /* noop for !CONFIG_BUFFER_HEAD */
+	iomap->io_block_size = i_blocksize(inode);
  	return 0;
  }

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 753db965f7c0..665811b1578b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7740,6 +7740,7 @@ static int btrfs_dio_iomap_begin(struct inode 
*inode, loff_t start,
  	iomap->offset = start;
  	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
  	iomap->length = len;
+	iomap->io_block_size = i_blocksize(inode);
  	free_extent_map(em);

  	return 0;
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 8be60797ea2f..ea9d2f3eadb3 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -305,6 +305,7 @@ static int erofs_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,
  		if (flags & IOMAP_DAX)
  			iomap->addr += mdev.m_dax_part_off;
  	}
+	iomap->io_block_size = i_blocksize(inode);
  	return 0;
  }

diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 9b248ee5fef2..6ee89f6a078c 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -749,6 +749,7 @@ static int z_erofs_iomap_begin_report(struct inode 
*inode, loff_t offset,
  		if (iomap->offset >= inode->i_size)
  			iomap->length = length + offset - map.m_la;
  	}
+	iomap->io_block_size = i_blocksize(inode);
  	iomap->flags = 0;
  	return 0;
  }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0caa1650cee8..7a5539a52844 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -862,6 +862,7 @@ static int ext2_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,
  		iomap->length = (u64)ret << blkbits;
  		iomap->flags |= IOMAP_F_MERGED;
  	}
+	iomap->io_block_size = i_blocksize(inode);

  	if (new)
  		iomap->flags |= IOMAP_F_NEW;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e067f2dd0335..ce3269874fde 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4933,6 +4933,7 @@ static int ext4_iomap_xattr_fiemap(struct inode 
*inode, struct iomap *iomap)
  	iomap->length = length;
  	iomap->type = iomap_type;
  	iomap->flags = 0;
+	iomap->io_block_size = i_blocksize(inode);
  out:
  	return error;
  }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bae9ccf5fe0..3ec82e4d71c4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3235,6 +3235,7 @@ static void ext4_set_iomap(struct inode *inode, 
struct iomap *iomap,
  		iomap->bdev = inode->i_sb->s_bdev;
  	iomap->offset = (u64) map->m_lblk << blkbits;
  	iomap->length = (u64) map->m_len << blkbits;
+	iomap->io_block_size = i_blocksize(inode);

  	if ((map->m_flags & EXT4_MAP_MAPPED) &&
  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b9b0debc6b3d..6c12641b9a7b 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4233,6 +4233,7 @@ static int f2fs_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,
  		}
  		iomap->addr = IOMAP_NULL_ADDR;
  	}
+	iomap->io_block_size = i_blocksize(inode);

  	if (map.m_flags & F2FS_MAP_NEW)
  		iomap->flags |= IOMAP_F_NEW;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..68ddc74cb31e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -577,6 +577,7 @@ static int fuse_iomap_begin(struct inode *inode, 
loff_t pos, loff_t length,
  	iomap->flags = 0;
  	iomap->bdev = NULL;
  	iomap->dax_dev = fc->dax->dev;
+	iomap->io_block_size = i_blocksize(inode);

  	/*
  	 * Both read/write and mmap path can race here. So we need something
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 1795c4e8dbf6..8d2de42b1da9 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -927,6 +927,7 @@ static int __gfs2_iomap_get(struct inode *inode, 
loff_t pos, loff_t length,

  out:
  	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->io_block_size = i_blocksize(inode);
  unlock:
  	up_read(&ip->i_rw_mutex);
  	return ret;
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 1bb8d97cd9ae..5d2718faf520 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -149,6 +149,7 @@ static int hpfs_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,
  		iomap->addr = IOMAP_NULL_ADDR;
  		iomap->length = 1 << blkbits;
  	}
+	iomap->io_block_size = i_blocksize(inode);

  	hpfs_unlock(sb);
  	return 0;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..1e6eb59cac6c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,
  {
  	const struct iomap *iomap = &iter->iomap;
  	struct inode *inode = iter->inode;
-	unsigned int fs_block_size = i_blocksize(inode), pad;
+	u64 io_block_size = iomap->io_block_size;
  	loff_t length = iomap_length(iter);
  	loff_t pos = iter->pos;
  	blk_opf_t bio_opf;
@@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,
  	int nr_pages, ret = 0;
  	size_t copied = 0;
  	size_t orig_count;
+	unsigned int pad;

  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
@@ -355,7 +356,7 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,

  	if (need_zeroout) {
  		/* zero out from the start of the block to the write offset */
-		pad = pos & (fs_block_size - 1);
+		pad = pos & (io_block_size - 1);
  		if (pad)
  			iomap_dio_zero(iter, dio, pos - pad, pad);
  	}
@@ -429,9 +430,9 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,
  	if (need_zeroout ||
  	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
  		/* zero out from the end of the write to the end of the block */
-		pad = pos & (fs_block_size - 1);
+		pad = pos & (io_block_size - 1);
  		if (pad)
-			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+			iomap_dio_zero(iter, dio, pos, io_block_size - pad);
  	}
  out:
  	/* Undo iter limitation to current extent */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..ecb4cae88248 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -127,6 +127,7 @@ xfs_bmbt_to_iomap(
  	}
  	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
  	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
+	iomap->io_block_size = i_blocksize(VFS_I(ip));
  	if (mapping_flags & IOMAP_DAX)
  		iomap->dax_dev = target->bt_daxdev;
  	else
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 3b103715acc9..bf2cc4bee309 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -50,6 +50,7 @@ static int zonefs_read_iomap_begin(struct inode 
*inode, loff_t offset,
  		iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
  		iomap->length = isize - iomap->offset;
  	}
+	iomap->io_block_size = i_blocksize(inode);
  	mutex_unlock(&zi->i_truncate_mutex);

  	trace_zonefs_iomap_begin(inode, iomap);
@@ -99,6 +100,7 @@ static int zonefs_write_iomap_begin(struct inode 
*inode, loff_t offset,
  		iomap->type = IOMAP_MAPPED;
  		iomap->length = isize - iomap->offset;
  	}
+	iomap->io_block_size = i_blocksize(inode);
  	mutex_unlock(&zi->i_truncate_mutex);

  	trace_zonefs_iomap_begin(inode, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..c6ae6fdcec00 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -103,6 +103,7 @@ struct iomap {
  	void			*private; /* filesystem private */
  	const struct iomap_folio_ops *folio_ops;
  	u64			validity_cookie; /* used with .iomap_valid() */
+	u64			io_block_size; /* sub-FS block zeroing size  */
  };

  static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)


---->8----

That's a lot changes... in addition, if rtextsize is to be considered in 
setting io_block_size, what about ext4 bigalloc and other similar features?

> 
> Then, in a separate patch, you can add XFS support for large IO
> block sizes when we have either a large rtextsize or extent size
> hints set.



> 
>> +
>>   	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>   	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/21] fs: xfs: align args->minlen for forced allocation alignment
  2024-04-29 17:47 ` [PATCH v3 07/21] fs: xfs: align args->minlen for " John Garry
@ 2024-06-05 14:26   ` John Garry
  2024-06-06  8:47     ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-06-05 14:26 UTC (permalink / raw
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner

On 29/04/2024 18:47, John Garry wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If args->minlen is not aligned to the constraints of forced
> alignment, we may do minlen allocations that are not aligned when we
> approach ENOSPC. Avoid this by always aligning args->minlen
> appropriately. If alignment of minlen results in a value smaller
> than the alignment constraint, fail the allocation immediately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>   fs/xfs/libxfs/xfs_bmap.c | 45 +++++++++++++++++++++++++++-------------
>   1 file changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 7a0ef0900097..4f39a43d78a7 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3288,33 +3288,48 @@ xfs_bmap_longest_free_extent(
>   	return 0;
>   }
>   
> -static xfs_extlen_t
> +static int
>   xfs_bmap_select_minlen(
>   	struct xfs_bmalloca	*ap,
>   	struct xfs_alloc_arg	*args,
>   	xfs_extlen_t		blen)
>   {
> -
>   	/* Adjust best length for extent start alignment. */
>   	if (blen > args->alignment)
>   		blen -= args->alignment;
>   
>   	/*
>   	 * Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
> -	 * possible that there is enough contiguous free space for this request.
> +	 * possible that there is enough contiguous free space for this request
> +	 * even if best length is less that the minimum length we need.
> +	 *
> +	 * If the best length won't satisfy the maximum length we requested,
> +	 * then use it as the minimum length so we get as large an allocation
> +	 * as possible.
>   	 */
>   	if (blen < ap->minlen)
> -		return ap->minlen;
> +		blen = ap->minlen;
> +	else if (blen > args->maxlen)
> +		blen = args->maxlen;
>   
>   	/*
> -	 * If the best seen length is less than the request length,
> -	 * use the best as the minimum, otherwise we've got the maxlen we
> -	 * were asked for.
> +	 * If we have alignment constraints, round the minlen down to match the
> +	 * constraint so that alignment will be attempted. This may reduce the
> +	 * allocation to smaller than was requested, so clamp the minimum to
> +	 * ap->minlen to allow unaligned allocation to succeed. If we are forced
> +	 * to align the allocation, return ENOSPC at this point because we don't
> +	 * have enough contiguous free space to guarantee aligned allocation.
>   	 */
> -	if (blen < args->maxlen)
> -		return blen;
> -	return args->maxlen;
> -
> +	if (args->alignment > 1) {
> +		blen = rounddown(blen, args->alignment);
> +		if (blen < ap->minlen) {
> +			if (args->datatype & XFS_ALLOC_FORCEALIGN)
> +				return -ENOSPC;
> +			blen = ap->minlen;
> +		}
> +	}

Hi Dave,

I still think that there is a problem with this code or some other 
allocator code which gives rise to unexpected -ENOSPC. I just highlight 
this code, above, as I get an unexpected -ENOSPC failure here when the 
fs does have many free (big enough) extents. I think that the problem 
may be elsewhere, though.

Initially we have a file like this:

  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
    0: [0..127]:        62592..62719      0 (62592..62719)     128
    1: [128..895]:      hole                                   768
    2: [896..1023]:     63616..63743      0 (63616..63743)     128
    3: [1024..1151]:    64896..65023      0 (64896..65023)     128
    4: [1152..1279]:    65664..65791      0 (65664..65791)     128
    5: [1280..1407]:    68224..68351      0 (68224..68351)     128
    6: [1408..1535]:    76416..76543      0 (76416..76543)     128
    7: [1536..1791]:    62720..62975      0 (62720..62975)     256
    8: [1792..1919]:    60032..60159      0 (60032..60159)     128
    9: [1920..2047]:    63488..63615      0 (63488..63615)     128
   10: [2048..2303]:    63744..63999      0 (63744..63999)     256

forcealign extsize is 16 4k fsb, so the layout looks ok.

Then we truncate the file to 454 sectors (or 56.75 fsb). This gives:

EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
    0: [0..127]:        62592..62719      0 (62592..62719)     128
    1: [128..455]:      hole                                   328

We have 57 fsb.

Then I attempt to write from byte offset 232448 (454 sector) and a get a 
write failure in xfs_bmap_select_minlen() returning -ENOSPC; at that 
point the file looks like this:

  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
    0: [0..127]:        62592..62719      0 (62592..62719)     128
    1: [128..447]:      hole                                   320
    2: [448..575]:      62720..62847      0 (62720..62847)     128

That hole in ext #1 is 40 fsb, and not aligned with forcealign 
granularity. This means that ext #2 is misaligned wrt forcealign 
granularity.

This is strange.

I notice that we when allocate ext #2, xfs_bmap_btalloc() returns 
ap->blkno=7840, length=16, offset=56. I would expect offset % 16 == 0, 
which it is not.

In the following sub-io block zeroing, I note that we zero the front 
padding from pos=196608 (or fsb 48 or sector 384) for len=35840, and 
back padding from pos=263680 for len=64000 (upto sector 640 or fsb 80). 
That seems wrong, as we are zeroing data in the ext #1 hole, right?

Now the actual -ENOSPC comes from xfs_bmap_btalloc() -> ... -> 
xfs_bmap_select_minlen() with initially blen=32 args->alignment=16 
ap->minlen=1 args->maxlen=8. There xfs_bmap_btalloc() has ap->length=8 
initially. This may be just a symptom.

With args->maxlen < args->alignment, we fail with -ENOSPC in 
xfs_bmap_select_minlen()

I guess that there is something wrong in the block allocator for ext #2. 
Any idea where to check?

I'll send a new v4 series soon which has this problem, as to share the 
exact full code changes.

Thanks,
John


> +	args->minlen = blen;
> +	return 0;
>   }
>   
>   static int
> @@ -3350,8 +3365,7 @@ xfs_bmap_btalloc_select_lengths(
>   	if (pag)
>   		xfs_perag_rele(pag);
>   
> -	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
> -	return error;
> +	return xfs_bmap_select_minlen(ap, args, blen);
>   }
>   
>   /* Update all inode and quota accounting for the allocation we just did. */
> @@ -3671,7 +3685,10 @@ xfs_bmap_btalloc_filestreams(
>   		goto out_low_space;
>   	}
>   
> -	args->minlen = xfs_bmap_select_minlen(ap, args, blen);
> +	error = xfs_bmap_select_minlen(ap, args, blen);
> +	if (error)
> +		goto out_low_space;
> +
>   	if (ap->aeof && ap->offset)
>   		error = xfs_bmap_btalloc_at_eof(ap, args);
>   


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/21] fs: xfs: align args->minlen for forced allocation alignment
  2024-06-05 14:26   ` John Garry
@ 2024-06-06  8:47     ` Dave Chinner
  2024-06-06 16:22       ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-06-06  8:47 UTC (permalink / raw
  To: John Garry
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner

On Wed, Jun 05, 2024 at 03:26:11PM +0100, John Garry wrote:
> Hi Dave,
> 
> I still think that there is a problem with this code or some other allocator
> code which gives rise to unexpected -ENOSPC. I just highlight this code,
> above, as I get an unexpected -ENOSPC failure here when the fs does have
> many free (big enough) extents. I think that the problem may be elsewhere,
> though.
> 
> Initially we have a file like this:
> 
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..127]:        62592..62719      0 (62592..62719)     128
>    1: [128..895]:      hole                                   768
>    2: [896..1023]:     63616..63743      0 (63616..63743)     128
>    3: [1024..1151]:    64896..65023      0 (64896..65023)     128
>    4: [1152..1279]:    65664..65791      0 (65664..65791)     128
>    5: [1280..1407]:    68224..68351      0 (68224..68351)     128
>    6: [1408..1535]:    76416..76543      0 (76416..76543)     128
>    7: [1536..1791]:    62720..62975      0 (62720..62975)     256
>    8: [1792..1919]:    60032..60159      0 (60032..60159)     128
>    9: [1920..2047]:    63488..63615      0 (63488..63615)     128
>   10: [2048..2303]:    63744..63999      0 (63744..63999)     256
> 
> forcealign extsize is 16 4k fsb, so the layout looks ok.
> 
> Then we truncate the file to 454 sectors (or 56.75 fsb). This gives:
> 
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..127]:        62592..62719      0 (62592..62719)     128
>    1: [128..455]:      hole                                   328
>
> We have 57 fsb.
> 
> Then I attempt to write from byte offset 232448 (454 sector) and a get a
> write failure in xfs_bmap_select_minlen() returning -ENOSPC; at that point
> the file looks like this:

So you are doing an unaligned write of some size at EOF and EOF is
not aligned to the extsize?

>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..127]:        62592..62719      0 (62592..62719)     128
>    1: [128..447]:      hole                                   320
>    2: [448..575]:      62720..62847      0 (62720..62847)     128
> 
> That hole in ext #1 is 40 fsb, and not aligned with forcealign granularity.
> This means that ext #2 is misaligned wrt forcealign granularity.

OK, so the command to produce this would be something like this?

# xfs_io -fd -c "truncate 0" \
	-c "chattr +<forcealign>" -c "extsize 64k" \
	-c "pwrite 0 64k -b 64k" -c "pwrite 448k 64k -b 64k" \
	-c "bmap -vvp" \
	-c "truncate 227k" \
	-c "bmap -vvp" \
	-c "pwrite 227k 64k -b 64k" \
	-c "bmap -vvp" \
	/mnt/scratch/testfile

> This is strange.
> 
> I notice that we when allocate ext #2, xfs_bmap_btalloc() returns
> ap->blkno=7840, length=16, offset=56. I would expect offset % 16 == 0, which
> it is not.

IOWs, the allocation was not correctly rounded down to an aligned
start offset.  What were the initial parameters passed to this
allocation? i.e. why didn't it round the start offset down to 48?
Answering that question will tell you where the bug is.

Of course, if the allocation start is rounded down to 48, then
the length should be rounded up to 32 to cover the entire range we
are writing new data to.

> In the following sub-io block zeroing, I note that we zero the front padding
> from pos=196608 (or fsb 48 or sector 384) for len=35840, and back padding
> from pos=263680 for len=64000 (upto sector 640 or fsb 80). That seems wrong,
> as we are zeroing data in the ext #1 hole, right?

The sub block zeroing is doing exactly the right thing - it is
demonstrating the exact range that the force aligned allocation
should have covered.

> Now the actual -ENOSPC comes from xfs_bmap_btalloc() -> ... ->
> xfs_bmap_select_minlen() with initially blen=32 args->alignment=16
> ap->minlen=1 args->maxlen=8. There xfs_bmap_btalloc() has ap->length=8
> initially. This may be just a symptom.

Yeah, now the allocator is trying to fix up the mess that the first unaligned
allocation created, and it's tripping over ENOSPC because it's not
allowed to do sub-extent size hint allocations when forced alignment
is enabled....

> I guess that there is something wrong in the block allocator for ext #2. Any
> idea where to check?

Start with tracing exactly what range iomap is requesting be
allocated, and then follow that through into the allocator to work
out why the offset being passed to the allocation never gets rounded
down to be aligned. There's a mistake in the logic somewhere that is
failing to apply the start alignment to the allocation request (i.e.
the bug will be in the allocation setup code path. i.e. somewhere
in the xfs_bmapi_write -> xfs_bmap_btalloc path *before* we get to
the xfs_alloc_vextent...() calls.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign
  2024-05-01  0:10   ` Dave Chinner
  2024-05-01 10:54     ` John Garry
@ 2024-06-06  9:50     ` John Garry
  1 sibling, 0 replies; 60+ messages in thread
From: John Garry @ 2024-06-06  9:50 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

Hi Dave,

>>   
>>   	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
>> @@ -5459,11 +5472,15 @@ __xfs_bunmapi(
>>   		if (del.br_startoff + del.br_blockcount > end + 1)
>>   			del.br_blockcount = end + 1 - del.br_startoff;
>>   
>> -		if (!isrt || (flags & XFS_BMAPI_REMAP))
>> +		if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
>>   			goto delete;
>>   
>> -		mod = xfs_rtb_to_rtxoff(mp,
>> -				del.br_startblock + del.br_blockcount);
>> +		if (isrt)
>> +			mod = xfs_rtb_to_rtxoff(mp,
>> +					del.br_startblock + del.br_blockcount);
>> +		else if (isforcealign)
>> +			mod = xfs_forcealign_extent_offset(ip,
>> +					del.br_startblock + del.br_blockcount);
> There's got to be a cleaner way to do this.
> 
> We already know that either isrt or isforcealign must be set here,
> so there's no need for the "else if" construct.
> 
> Also, forcealign should take precedence over realtime, so that
> forcealign will work on realtime devices as well. I'd change this
> code to call a wrapper like:
> 
> 		mod = xfs_bunmapi_align(ip, del.br_startblock + del.br_blockcount);
> 
> static xfs_extlen_t
> xfs_bunmapi_align(
> 	struct xfs_inode	*ip,
> 	xfs_fsblock_t		bno)
> {
> 	if (!XFS_INODE_IS_REALTIME(ip)) {
> 		ASSERT(xfs_inode_has_forcealign(ip))
> 		if (is_power_of_2(ip->i_extsize))
> 			return bno & (ip->i_extsize - 1);
> 		return do_div(bno, ip->i_extsize);
> 	}
> 	return xfs_rtb_to_rtxoff(ip->i_mount, bno);
> }

I have made that change according to your suggestion.

However, now that the forcealign power-of-2 extsize restriction has been 
lifted, I am finding another bug.

I will just mention this now, as I want to go back to that other issue 
https://lore.kernel.org/linux-xfs/20240429174746.2132161-1-john.g.garry@oracle.com/T/#mebd7e97dfd0f12219bf92f289c41f62bf2abcff5

However this new one is pretty simple to reproduce.

We create a file:

ext:  logical_offset:        physical_offset: length:   expected: flags:
0:    0..    1775:      40200..     41975:   1776:            last,eof
/root/mnt2/file_22: 1 extent found

The forcealign extsize is 98304, i.e. 24 blks.

And then try to delete it, and get this:

[   17.604237] XFS: Assertion failed: tp->t_blk_res > 0, file: 
fs/xfs/libxfs/xfs_bmap.c, line: 5599
[   17.605908] ------------[ cut here ]------------
[   17.606884] kernel BUG at fs/xfs/xfs_message.c:102!
[   17.607917] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   17.609134] CPU: 3 PID: 240 Comm: kworker/3:2 Not tainted 
6.10.0-rc1-00096-g759a4497daa7-dirty #2553
[   17.610606] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[   17.612619] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
[   17.613682] RIP: 0010:assfail+0x36/0x40
[   17.614134] Code: c2 18 1e 1d 9d 48 89 f1 48 89 fe 48 c7 c7 43 f2 12 
9d e8 7d fd ff ff 80 3d ae 86 8d 01 00 75 09 90 0f 0b 90 c3 cc cc cc cc 
90 <0f> 0b 0f 1f 84 00 00 00 00 00 90 90 90 90 900
[   17.616478] RSP: 0018:ff4887cac0973c28 EFLAGS: 00010202
[   17.617080] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
000000007fffffff
[   17.617899] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffffffff9d12f243
[   17.618717] RBP: ff4887cac0973d40 R08: 0000000000000000 R09: 
000000000000000a
[   17.619548] R10: 000000000000000a R11: f000000000000000 R12: 
ff360ff881d88000[   17.620360] R13: 0000000000000000 R14: 
ff360ff8efa32040 R15: 000ffffffffe0000
[   17.620367] FS:  0000000000000000(0000) GS:ff360ffa75cc0000(0000) 
knlGS:0000000000000000
[   17.620369] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.620371] CR2: 00007f213b4ef008 CR3: 000000011cfca003 CR4: 
0000000000771ef0
[   17.620372] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[   17.620374] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[   17.620375] PKRU: 55555554
[   17.620376] Call Trace:
[   17.620384]  <TASK>
[   17.624278]  ? die+0x32/0x90
[   17.624284]  ? do_trap+0xd8/0x100
[   17.624286]  ? assfail+0x36/0x40
[   17.624288]  ? do_error_trap+0x60/0x80
[   17.624289]  ? assfail+0x36/0x40
[   17.624292]  ? exc_invalid_op+0x53/0x70
[   17.628287]  ? assfail+0x36/0x40
[   17.628291]  ? asm_exc_invalid_op+0x1a/0x20
[   17.628295]  ? assfail+0x36/0x40
[   17.628297]  ? assfail+0x23/0x40
[   17.628299]  __xfs_bunmapi+0xb87/0xeb0
[   17.628304]  ? xfs_log_reserve+0x18f/0x210
[   17.629447]  xfs_bunmapi_range+0x62/0xd0
[   17.631699]  xfs_itruncate_extents_flags+0x1c4/0x410
[   17.631703]  xfs_inactive_truncate+0xba/0x140
[   17.631705]  xfs_inactive+0x331/0x3d0
[   17.631707]  xfs_inodegc_worker+0xb8/0x190
[   17.631709]  process_one_work+0x157/0x380
[   17.633949]  worker_thread+0x2ba/0x3e0
[   17.633953]  ? __pfx_worker_thread+0x10/0x10
[   17.633954]  kthread+0xce/0x100
[   17.633958]  ? __pfx_kthread+0x10/0x10
[   17.633960]  ret_from_fork+0x2c/0x50
[   17.633963]  ? __pfx_kthread+0x10/0x10
[   17.633965]  ret_from_fork_asm+0x1a/0x30
[   17.636314]  </TASK>
[   17.640377] Modules linked in:
[   17.642375] ---[ end trace 0000000000000000 ]---

Maybe something is going wrong with the AG bno vs fsbno indexing.

That extent allocated has fsbno=50552 (% 24 != 0). The agsize is 22416 fsb.

That 50552 comes from xfs_alloc_vextent_finish() with args->fsbno=50552 
= XFS_AGB_TO_FSB(mp, agno=1, agbno=17784) = 32K 
(=roundup_power_of_2(22416)) + 17784

So the agbno is aligned, but the fsbno is not.

In __xfs_bunmapi(), at this point:

	mod = xfs_bunmapi_align(ip, del.br_startblock);

	if (mod) {
		xfs_extlen_t off;

		if (isforcealign) {
			off = ip->i_extsize - mod;
		} else {
			ASSERT(isrt);
			off = mp->m_sb.sb_rextsize - mod;
		}

		/*
		 * Realtime extent is lined up at the end but not
		 * at the front.  We'll get rid of full extents if
		 * we can.
		 */

mod=8 del.br_startblock(48776) + del.br_blockcount(1776)=50552

Since this code was originally only used for rt, we may be missing 
setting some struct members which were being set for rt. For example, 
xfs_trans_alloc() accepts rtextents value, and maybe we should be doing 
similar for forcealign. Or xfs_fsb_to_db() has special RT handling, but 
I doubt this is the problem.

I have no such issue in using a power-of-2 extsize.

I do realise that I need to share the full code, but I'm reluctant to 
post with known bugs.

Please let me know if you have an idea. I'll look at this further when I 
get a chance.

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/21] fs: xfs: align args->minlen for forced allocation alignment
  2024-06-06  8:47     ` Dave Chinner
@ 2024-06-06 16:22       ` John Garry
  2024-06-07  6:04         ` John Garry
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-06-06 16:22 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner

On 06/06/2024 09:47, Dave Chinner wrote:
> On Wed, Jun 05, 2024 at 03:26:11PM +0100, John Garry wrote:
>> Hi Dave,
>>
>> I still think that there is a problem with this code or some other allocator
>> code which gives rise to unexpected -ENOSPC. I just highlight this code,
>> above, as I get an unexpected -ENOSPC failure here when the fs does have
>> many free (big enough) extents. I think that the problem may be elsewhere,
>> though.
>>
>> Initially we have a file like this:
>>
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>     0: [0..127]:        62592..62719      0 (62592..62719)     128
>>     1: [128..895]:      hole                                   768
>>     2: [896..1023]:     63616..63743      0 (63616..63743)     128
>>     3: [1024..1151]:    64896..65023      0 (64896..65023)     128
>>     4: [1152..1279]:    65664..65791      0 (65664..65791)     128
>>     5: [1280..1407]:    68224..68351      0 (68224..68351)     128
>>     6: [1408..1535]:    76416..76543      0 (76416..76543)     128
>>     7: [1536..1791]:    62720..62975      0 (62720..62975)     256
>>     8: [1792..1919]:    60032..60159      0 (60032..60159)     128
>>     9: [1920..2047]:    63488..63615      0 (63488..63615)     128
>>    10: [2048..2303]:    63744..63999      0 (63744..63999)     256
>>
>> forcealign extsize is 16 4k fsb, so the layout looks ok.
>>
>> Then we truncate the file to 454 sectors (or 56.75 fsb). This gives:
>>
>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>     0: [0..127]:        62592..62719      0 (62592..62719)     128
>>     1: [128..455]:      hole                                   328
>>
>> We have 57 fsb.
>>
>> Then I attempt to write from byte offset 232448 (454 sector) and a get a
>> write failure in xfs_bmap_select_minlen() returning -ENOSPC; at that point
>> the file looks like this:
> 
> So you are doing an unaligned write of some size at EOF and EOF is
> not aligned to the extsize?

Correct

> 
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>     0: [0..127]:        62592..62719      0 (62592..62719)     128
>>     1: [128..447]:      hole                                   320
>>     2: [448..575]:      62720..62847      0 (62720..62847)     128
>>
>> That hole in ext #1 is 40 fsb, and not aligned with forcealign granularity.
>> This means that ext #2 is misaligned wrt forcealign granularity.
> 
> OK, so the command to produce this would be something like this?
> 
> # xfs_io -fd -c "truncate 0" \
> 	-c "chattr +<forcealign>" -c "extsize 64k" \
> 	-c "pwrite 0 64k -b 64k" -c "pwrite 448k 64k -b 64k" \
> 	-c "bmap -vvp" \
> 	-c "truncate 227k" \
> 	-c "bmap -vvp" \
> 	-c "pwrite 227k 64k -b 64k" \
> 	-c "bmap -vvp" \
> 	/mnt/scratch/testfile

No, unfortunately not. Well maybe not on a clean filesystem. In my 
stress test, something else is causing this. Probably heavy fragmentation.

> 
>> This is strange.
>>
>> I notice that we when allocate ext #2, xfs_bmap_btalloc() returns
>> ap->blkno=7840, length=16, offset=56. I would expect offset % 16 == 0, which
>> it is not.
> 
> IOWs, the allocation was not correctly rounded down to an aligned
> start offset.  What were the initial parameters passed to this
> allocation?

For xfs_bmap_btalloc() entry,

ap->offset=48, length=32, blkno=0, total=0, minlen=1, minleft=1, eof=1, 
wasdel=0, aeof=0, conv=0, datatype=5, flags=0x8

> i.e. why didn't it round the start offset down to 48?
> Answering that question will tell you where the bug is.

After xfs_bmap_compute_alignments() -> xfs_bmap_extsize_align(), 
ap->offset=48 - that seems ok.

Maybe the problem is in xfs_bmap_process_allocated_extent(). For the 
problematic case when calling that function:

args->fsbno=7840 args->len=16 ap->offset=48 orig_offset=56 orig_length=24

So, as the comment reads there, we could not satisfy the original length 
request, so we move up the position of the extent.

I assume that we just don't want to do that for forcealign, correct?

> 
> Of course, if the allocation start is rounded down to 48, then
> the length should be rounded up to 32 to cover the entire range we
> are writing new data to.
> 
>> In the following sub-io block zeroing, I note that we zero the front padding
>> from pos=196608 (or fsb 48 or sector 384) for len=35840, and back padding
>> from pos=263680 for len=64000 (upto sector 640 or fsb 80). That seems wrong,
>> as we are zeroing data in the ext #1 hole, right?
> 
> The sub block zeroing is doing exactly the right thing - it is
> demonstrating the exact range that the force aligned allocation
> should have covered.

Agreed

> 
>> Now the actual -ENOSPC comes from xfs_bmap_btalloc() -> ... ->
>> xfs_bmap_select_minlen() with initially blen=32 args->alignment=16
>> ap->minlen=1 args->maxlen=8. There xfs_bmap_btalloc() has ap->length=8
>> initially. This may be just a symptom.
> 
> Yeah, now the allocator is trying to fix up the mess that the first unaligned
> allocation created, and it's tripping over ENOSPC because it's not
> allowed to do sub-extent size hint allocations when forced alignment
> is enabled....
> 
>> I guess that there is something wrong in the block allocator for ext #2. Any
>> idea where to check?
> 
> Start with tracing exactly what range iomap is requesting be
> allocated, and then follow that through into the allocator to work
> out why the offset being passed to the allocation never gets rounded
> down to be aligned. There's a mistake in the logic somewhere that is
> failing to apply the start alignment to the allocation request (i.e.
> the bug will be in the allocation setup code path. i.e. somewhere
> in the xfs_bmapi_write -> xfs_bmap_btalloc path *before* we get to
> the xfs_alloc_vextent...() calls.
> 
As above, the problem seems in the processing fix-up.

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/21] fs: xfs: align args->minlen for forced allocation alignment
  2024-06-06 16:22       ` John Garry
@ 2024-06-07  6:04         ` John Garry
  0 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-06-07  6:04 UTC (permalink / raw
  To: Dave Chinner
  Cc: djwong, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, Dave Chinner

On 06/06/2024 17:22, John Garry wrote:
> 
>> i.e. why didn't it round the start offset down to 48?
>> Answering that question will tell you where the bug is.
> 
> After xfs_bmap_compute_alignments() -> xfs_bmap_extsize_align(), 
> ap->offset=48 - that seems ok.
> 
> Maybe the problem is in xfs_bmap_process_allocated_extent(). For the 
> problematic case when calling that function:
> 
> args->fsbno=7840 args->len=16 ap->offset=48 orig_offset=56 orig_length=24
> 
> So, as the comment reads there, we could not satisfy the original length 
> request, so we move up the position of the extent.
> 
> I assume that we just don't want to do that for forcealign, correct?
> 

JFYI, after making this following change, my stress test ran overnight:

@@ -3506,13 +3513,15 @@ xfs_bmap_process_allocated_extent(
          * very fragmented so we're unlikely to be able to satisfy the
          * hints anyway.
          */
+       if (!xfs_inode_has_forcealign(ap->ip)) {
         if (ap->length <= orig_length)
                 ap->offset = orig_offset;
         else if (ap->offset + ap->length < orig_offset + orig_length)
                 ap->offset = orig_offset + orig_length - ap->length;
-
+       }
+


>>
>> Of course, if the allocation start is rounded down to 48, then
>> the length should be rounded up to 32 to cover the entire range we
>> are writing new data to. 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-04-29 17:47 ` [PATCH v3 14/21] iomap: Sub-extent zeroing John Garry
  2024-05-01  1:07   ` Dave Chinner
@ 2024-06-11  3:10   ` Long Li
  2024-06-11  7:29     ` John Garry
  1 sibling, 1 reply; 60+ messages in thread
From: Long Li @ 2024-06-11  3:10 UTC (permalink / raw
  To: John Garry, david, djwong, hch, viro, brauner, jack, chandan.babu,
	willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:39PM +0000, John Garry wrote:
> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
> unwritten.
> 
> This will be important for atomic writes support, in that atomically
> writing over a partially written extent would mean that we would need to
> do the unwritten extent conversion write separately, and the write could
> no longer be atomic.
> 
> It is the task of the FS to set iomap.extent_size per iter to indicate
> sub-extent zeroing required.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 17 +++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..a3ed7cfa95bc 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  {
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
> -	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	unsigned int zeroing_size, pad;
>  	loff_t length = iomap_length(iter);
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
> @@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> +	if (iomap->extent_size)
> +		zeroing_size = iomap->extent_size;
> +	else
> +		zeroing_size = i_blocksize(inode);
> +
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
> @@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>  
>  	if (need_zeroout) {
> -		/* zero out from the start of the block to the write offset */
> -		pad = pos & (fs_block_size - 1);
> +		/* zero out from the start of the region to the write offset */
> +		pad = pos & (zeroing_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
 
Hi, John

I've been testing and using your atomic write patch series recently. I noticed
that if zeroing_size is larger than a single page, the length passed to
iomap_dio_zero() could also be larger than a page size. This seems incorrect
because iomap_dio_zero() utilizes ZERO_PAGE(0), which is only a single page
in size.

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/21] iomap: Sub-extent zeroing
  2024-06-11  3:10   ` Long Li
@ 2024-06-11  7:29     ` John Garry
  0 siblings, 0 replies; 60+ messages in thread
From: John Garry @ 2024-06-11  7:29 UTC (permalink / raw
  To: Long Li, david, djwong, hch, viro, brauner, jack, chandan.babu,
	willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 11/06/2024 04:10, Long Li wrote:
>>   	if (need_zeroout) {
>> -		/* zero out from the start of the block to the write offset */
>> -		pad = pos & (fs_block_size - 1);
>> +		/* zero out from the start of the region to the write offset */
>> +		pad = pos & (zeroing_size - 1);
>>   		if (pad)
>>   			iomap_dio_zero(iter, dio, pos - pad, pad);
>   
> Hi, John
> 
> I've been testing and using your atomic write patch series recently. I noticed
> that if zeroing_size is larger than a single page, the length passed to
> iomap_dio_zero() could also be larger than a page size. This seems incorrect
> because iomap_dio_zero() utilizes ZERO_PAGE(0), which is only a single page
> in size.

ok, thanks for the notice.

So 
https://lore.kernel.org/linux-xfs/20240607145902.1137853-1-kernel@pankajraghav.com/T/#m7ba4ed4f0f0f48be99042703c10b42b72c9fe37c 
is changing that same function increase the zero range past PAGE_SIZE. 
I'll just need to figure out how to make it support an arbitrary larger 
size.

Thanks,
John


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-04-29 17:47 ` [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag John Garry
  2024-04-30 23:22   ` Dave Chinner
@ 2024-06-12  2:10   ` Long Li
  2024-06-12  6:55     ` John Garry
  1 sibling, 1 reply; 60+ messages in thread
From: Long Li @ 2024-06-12  2:10 UTC (permalink / raw
  To: John Garry, david, djwong, hch, viro, brauner, jack, chandan.babu,
	willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Add a new inode flag to require that all file data extent mappings must
> be aligned (both the file offset range and the allocated space itself)
> to the extent size hint.  Having a separate COW extent size hint is no
> longer allowed.
> 
> The goal here is to enable sysadmins and users to mandate that all space
> mappings in a file must have a startoff/blockcount that are aligned to
> (say) a 2MB alignment and that the startblock/blockcount will follow the
> same alignment.
> 
> jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
>      alignment for forcealign
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Co-developed-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h    |  6 ++++-
>  fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
>  fs/xfs/libxfs/xfs_sb.c        |  2 ++
>  fs/xfs/xfs_inode.c            | 12 +++++++++
>  fs/xfs/xfs_inode.h            |  2 +-
>  fs/xfs/xfs_ioctl.c            | 34 +++++++++++++++++++++++-
>  fs/xfs/xfs_mount.h            |  2 ++
>  fs/xfs/xfs_super.c            |  4 +++
>  include/uapi/linux/fs.h       |  2 ++
>  10 files changed, 114 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 2b2f9050fbfb..4dd295b047f8 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
 
Hi, John

You know I've been using and testing your atomic writes patch series recently,
and I'm particularly interested in the changes to the on-disk format. I noticed
that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
be the next available bit in sequence.

I'm wondering if using bit 30 is just a temporary solution to avoid conflicts, 
and if the plan is to eventually use bits sequentially, for example, using bit 4?
I'm looking forward to your explanation.

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-06-12  2:10   ` Long Li
@ 2024-06-12  6:55     ` John Garry
  2024-06-12 15:43       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: John Garry @ 2024-06-12  6:55 UTC (permalink / raw
  To: Long Li, david, djwong, hch, viro, brauner, jack, chandan.babu,
	willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On 12/06/2024 03:10, Long Li wrote:
> On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
>> From: "Darrick J. Wong"<djwong@kernel.org>
>>
>> Add a new inode flag to require that all file data extent mappings must
>> be aligned (both the file offset range and the allocated space itself)
>> to the extent size hint.  Having a separate COW extent size hint is no
>> longer allowed.
>>
>> The goal here is to enable sysadmins and users to mandate that all space
>> mappings in a file must have a startoff/blockcount that are aligned to
>> (say) a 2MB alignment and that the startblock/blockcount will follow the
>> same alignment.
>>
>> jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
>>       alignment for forcealign
>> Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
>> Co-developed-by: John Garry<john.g.garry@oracle.com>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_format.h    |  6 ++++-
>>   fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
>>   fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
>>   fs/xfs/libxfs/xfs_sb.c        |  2 ++
>>   fs/xfs/xfs_inode.c            | 12 +++++++++
>>   fs/xfs/xfs_inode.h            |  2 +-
>>   fs/xfs/xfs_ioctl.c            | 34 +++++++++++++++++++++++-
>>   fs/xfs/xfs_mount.h            |  2 ++
>>   fs/xfs/xfs_super.c            |  4 +++
>>   include/uapi/linux/fs.h       |  2 ++
>>   10 files changed, 114 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index 2b2f9050fbfb..4dd295b047f8 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
>>   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>>   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>>   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>> +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
>   
> Hi, John
> 
> You know I've been using and testing your atomic writes patch series recently,
> and I'm particularly interested in the changes to the on-disk format. I noticed
> that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
> be the next available bit in sequence.
> 
> I'm wondering if using bit 30 is just a temporary solution to avoid conflicts,
> and if the plan is to eventually use bits sequentially, for example, using bit 4?
> I'm looking forward to your explanation.

I really don't know. I'm looking through the history and it has been 
like that this the start of my source control records.

Maybe it was a copy-and-paste error from XFS_FEAT_FORCEALIGN, whose 
value has changed since.

Anyway, I'll ask a bit more internally, and I'll look to change to (1 << 
4) if ok.

Thanks,
John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-06-12  6:55     ` John Garry
@ 2024-06-12 15:43       ` Darrick J. Wong
  2024-06-13  2:04         ` Long Li
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-06-12 15:43 UTC (permalink / raw
  To: John Garry
  Cc: Long Li, david, hch, viro, brauner, jack, chandan.babu, willy,
	axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, Jun 12, 2024 at 07:55:31AM +0100, John Garry wrote:
> On 12/06/2024 03:10, Long Li wrote:
> > On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
> > > From: "Darrick J. Wong"<djwong@kernel.org>
> > > 
> > > Add a new inode flag to require that all file data extent mappings must
> > > be aligned (both the file offset range and the allocated space itself)
> > > to the extent size hint.  Having a separate COW extent size hint is no
> > > longer allowed.
> > > 
> > > The goal here is to enable sysadmins and users to mandate that all space
> > > mappings in a file must have a startoff/blockcount that are aligned to
> > > (say) a 2MB alignment and that the startblock/blockcount will follow the
> > > same alignment.
> > > 
> > > jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
> > >       alignment for forcealign
> > > Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
> > > Co-developed-by: John Garry<john.g.garry@oracle.com>
> > > Signed-off-by: John Garry<john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/libxfs/xfs_format.h    |  6 ++++-
> > >   fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
> > >   fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
> > >   fs/xfs/libxfs/xfs_sb.c        |  2 ++
> > >   fs/xfs/xfs_inode.c            | 12 +++++++++
> > >   fs/xfs/xfs_inode.h            |  2 +-
> > >   fs/xfs/xfs_ioctl.c            | 34 +++++++++++++++++++++++-
> > >   fs/xfs/xfs_mount.h            |  2 ++
> > >   fs/xfs/xfs_super.c            |  4 +++
> > >   include/uapi/linux/fs.h       |  2 ++
> > >   10 files changed, 114 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 2b2f9050fbfb..4dd295b047f8 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
> > >   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > >   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > >   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > > +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
> > Hi, John
> > 
> > You know I've been using and testing your atomic writes patch series recently,
> > and I'm particularly interested in the changes to the on-disk format. I noticed
> > that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
> > be the next available bit in sequence.
> > 
> > I'm wondering if using bit 30 is just a temporary solution to avoid conflicts,
> > and if the plan is to eventually use bits sequentially, for example, using bit 4?
> > I'm looking forward to your explanation.
> 
> I really don't know. I'm looking through the history and it has been like
> that this the start of my source control records.
> 
> Maybe it was a copy-and-paste error from XFS_FEAT_FORCEALIGN, whose value
> has changed since.
> 
> Anyway, I'll ask a bit more internally, and I'll look to change to (1 << 4)
> if ok.

I tend to use upper bits for ondisk features that are still under
development so that (a) there won't be collisions with other features
getting merged and (b) after the feature I'm working on gets merged, any
old fs images in my zoo will no longer mount.

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
  2024-06-12 15:43       ` Darrick J. Wong
@ 2024-06-13  2:04         ` Long Li
  0 siblings, 0 replies; 60+ messages in thread
From: Long Li @ 2024-06-13  2:04 UTC (permalink / raw
  To: Darrick J. Wong, John Garry
  Cc: david, hch, viro, brauner, jack, chandan.babu, willy, axboe,
	martin.petersen, linux-kernel, linux-fsdevel, tytso, jbongio,
	ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang

On Wed, Jun 12, 2024 at 08:43:42AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 12, 2024 at 07:55:31AM +0100, John Garry wrote:
> > On 12/06/2024 03:10, Long Li wrote:
> > > On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
> > > > From: "Darrick J. Wong"<djwong@kernel.org>
> > > > 
> > > > Add a new inode flag to require that all file data extent mappings must
> > > > be aligned (both the file offset range and the allocated space itself)
> > > > to the extent size hint.  Having a separate COW extent size hint is no
> > > > longer allowed.
> > > > 
> > > > The goal here is to enable sysadmins and users to mandate that all space
> > > > mappings in a file must have a startoff/blockcount that are aligned to
> > > > (say) a 2MB alignment and that the startblock/blockcount will follow the
> > > > same alignment.
> > > > 
> > > > jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
> > > >       alignment for forcealign
> > > > Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
> > > > Co-developed-by: John Garry<john.g.garry@oracle.com>
> > > > Signed-off-by: John Garry<john.g.garry@oracle.com>
> > > > ---
> > > >   fs/xfs/libxfs/xfs_format.h    |  6 ++++-
> > > >   fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
> > > >   fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
> > > >   fs/xfs/libxfs/xfs_sb.c        |  2 ++
> > > >   fs/xfs/xfs_inode.c            | 12 +++++++++
> > > >   fs/xfs/xfs_inode.h            |  2 +-
> > > >   fs/xfs/xfs_ioctl.c            | 34 +++++++++++++++++++++++-
> > > >   fs/xfs/xfs_mount.h            |  2 ++
> > > >   fs/xfs/xfs_super.c            |  4 +++
> > > >   include/uapi/linux/fs.h       |  2 ++
> > > >   10 files changed, 114 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 2b2f9050fbfb..4dd295b047f8 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
> > > >   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > > >   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > > >   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > > > +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
> > > Hi, John
> > > 
> > > You know I've been using and testing your atomic writes patch series recently,
> > > and I'm particularly interested in the changes to the on-disk format. I noticed
> > > that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
> > > be the next available bit in sequence.
> > > 
> > > I'm wondering if using bit 30 is just a temporary solution to avoid conflicts,
> > > and if the plan is to eventually use bits sequentially, for example, using bit 4?
> > > I'm looking forward to your explanation.
> > 
> > I really don't know. I'm looking through the history and it has been like
> > that this the start of my source control records.
> > 
> > Maybe it was a copy-and-paste error from XFS_FEAT_FORCEALIGN, whose value
> > has changed since.
> > 
> > Anyway, I'll ask a bit more internally, and I'll look to change to (1 << 4)
> > if ok.
> 
> I tend to use upper bits for ondisk features that are still under
> development so that (a) there won't be collisions with other features
> getting merged and (b) after the feature I'm working on gets merged, any
> old fs images in my zoo will no longer mount.
> 

I get it, thank you very much for your response.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-06-13  1:53 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-29 17:47 [PATCH v3 00/21] block atomic writes for XFS John Garry
2024-04-29 17:47 ` [PATCH v3 01/21] fs: Add generic_atomic_write_valid_size() John Garry
2024-04-29 17:47 ` [PATCH v3 02/21] xfs: only allow minlen allocations when near ENOSPC John Garry
2024-04-29 17:47 ` [PATCH v3 03/21] xfs: always tail align maxlen allocations John Garry
2024-04-29 17:47 ` [PATCH v3 04/21] xfs: simplify extent allocation alignment John Garry
2024-04-29 17:47 ` [PATCH v3 05/21] xfs: make EOF allocation simpler John Garry
2024-04-29 17:47 ` [PATCH v3 06/21] xfs: introduce forced allocation alignment John Garry
2024-04-29 17:47 ` [PATCH v3 07/21] fs: xfs: align args->minlen for " John Garry
2024-06-05 14:26   ` John Garry
2024-06-06  8:47     ` Dave Chinner
2024-06-06 16:22       ` John Garry
2024-06-07  6:04         ` John Garry
2024-04-29 17:47 ` [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag John Garry
2024-04-30 23:22   ` Dave Chinner
2024-05-01 10:03     ` John Garry
2024-05-02  0:50       ` Dave Chinner
2024-05-02  7:56         ` John Garry
2024-06-12  2:10   ` Long Li
2024-06-12  6:55     ` John Garry
2024-06-12 15:43       ` Darrick J. Wong
2024-06-13  2:04         ` Long Li
2024-04-29 17:47 ` [PATCH v3 09/21] xfs: Do not free EOF blocks for forcealign John Garry
2024-04-30 22:54   ` Dave Chinner
2024-05-01  8:30     ` John Garry
2024-05-02  1:11       ` Dave Chinner
2024-05-02  8:55         ` John Garry
2024-04-29 17:47 ` [PATCH v3 10/21] xfs: Update xfs_is_falloc_aligned() mask " John Garry
2024-04-30 23:35   ` Dave Chinner
2024-05-01 10:48     ` John Garry
2024-05-01 23:45       ` Darrick J. Wong
2024-04-29 17:47 ` [PATCH RFC v3 11/21] xfs: Unmap blocks according to forcealign John Garry
2024-05-01  0:10   ` Dave Chinner
2024-05-01 10:54     ` John Garry
2024-06-06  9:50     ` John Garry
2024-04-29 17:47 ` [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign John Garry
2024-05-01  0:53   ` Dave Chinner
2024-05-01 11:24     ` John Garry
2024-05-01 23:53     ` Darrick J. Wong
2024-05-02  3:12       ` Dave Chinner
2024-04-29 17:47 ` [PATCH v3 13/21] xfs: Enable file data forcealign feature John Garry
2024-04-29 17:47 ` [PATCH v3 14/21] iomap: Sub-extent zeroing John Garry
2024-05-01  1:07   ` Dave Chinner
2024-05-01 10:23     ` John Garry
2024-05-30 10:40     ` John Garry
2024-06-11  3:10   ` Long Li
2024-06-11  7:29     ` John Garry
2024-04-29 17:47 ` [PATCH v3 15/21] fs: xfs: " John Garry
2024-05-01  1:32   ` Dave Chinner
2024-05-01 11:36     ` John Garry
2024-05-02  1:26       ` Dave Chinner
2024-04-29 17:47 ` [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-04-29 17:47 ` [PATCH v3 17/21] iomap: Atomic write support John Garry
2024-05-01  1:47   ` Dave Chinner
2024-05-01 11:08     ` John Garry
2024-05-02  1:43       ` Dave Chinner
2024-05-02  9:12         ` John Garry
2024-04-29 17:47 ` [PATCH v3 18/21] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
2024-04-29 17:47 ` [PATCH v3 19/21] xfs: Support atomic write for statx John Garry
2024-04-29 17:47 ` [PATCH v3 20/21] xfs: Validate atomic writes John Garry
2024-04-29 17:47 ` [PATCH v3 21/21] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).